Web-scale collections are considerably larger than the AP89 collection.
The AP89 collection contains about 40 million words, but the (relatively small)TREC Web collection GOV26 contains more than 20 billion words.
With that many words, it seems likely that the number of new words would eventually drop to near zero and Heaps’ law would not be applicable.
It turns out this is not the case.
Figure 4.4 shows a plot of vocabulary growth for GOV2 together with a graph of Heaps’ law with k = 7.34 and β = 0.648. This data indicates that the number of unique words continues to grow steadily even after reaching 30 million.
This has significant implications for the design of search engines, which will be discussed in Chapter 5.
Heaps’ law provides a good fit for this data, although the parameter values are very different than those for other TREC collections and outside the boundaries established as typical with these and other smaller collections.