Reuters-RCV1 has 100 million tokens. Collecting all termID–docID pairs of
the collection using 4 bytes each for termID and docID therefore requires 0.8
GB of storage. Typical collections today are often one or two orders of magnitude larger than Reuters-RCV1. You can easily see how such collections
overwhelm even large computers if we try to sort their termID–docID pairs
in memory. If the size of the intermediate files during index construction is
within a small factor of available memory, then the compression techniques
introduced in Chapter 5 can help; however, the postings file of many large
collections cannot fit into memory even after compression.
With main memory insufficient, we need to use an external sorting algorithm, that is, one that uses disk. For acceptable speed, the central require