These estimates are much better than the ones produced assuming independence,
but they are still too low. Rather than storing even more information, such
as the number of occurrences of word triples, it turns out that reasonable estimates
of result size can be made using just word frequency and the size of thecurrent
result set. Search engines estimate the result size because they do not rank all
the documents that contain the query words. Instead, they rank a much smaller
subset of the documents that are likely to be the most relevant. If we know the
proportion of the total documents that have been ranked (s) and the number of
documents found that contain all the query words (C), we can simply estimate
the result size as C/s, which assumes that the documents containing all the words
are distributed uniformly.9 The proportion of documents processed is measured
by the proportion of the documents containing the least frequent word that have
been processed, since all results must contain that word.