Based on the assumption of a Zipf distribution for words, we would expect that the number of new words that occur in
a given amount of new text would decrease as the size of the corpus increases.
New words will, however, always occur due to sources such as invented words (think of all those drug names and start-up company names), spelling errors, product numbers, people’s names, email addresses, and many others.
The relationship between the size of the corpus and the size of the vocabulary was found empirically by Heaps (1978) to be: