With that many words, it seems likely that the number of new words would eventually drop to near zero and Heaps’ law would not be applicable.
Figure 4.4 shows a plot of vocabulary growth for GOV2 together with a graph of Heaps’ law with k = 7.34 and β = 0.648. This data indicates that the number of unique words continues to grow steadily even after reaching 30 million.
This has significant implications for the design of search engines, which will be discussed in Chapter 5.
Heaps’ law provides a good fit for this data, although the parameter values are very different than those for other TREC collections and outside the boundaries established as typical with these and other smaller collections.