Table 4.2 Collection statistics for Reuters-RCV1. Values are rounded for the com- putations in this book. The unrounded values are: 806,791 documents, 222 tokens per document, 391,523 (distinct) terms, 6.04 bytes per token with spaces and punc- tuation, 4.5 bytes per token without spaces and punctuation, 7.5 bytes per term, and 96,969,056 tokens. The numbers in this table correspond to the third line (“case fold- ing”) in Table 5.1 (page 87).