Most commonly, researchers evaluate clustering results by
comparison to a ground truth set of class labels for documents. This poses problems when evaluating large scale
collections containing hundreds of millions of documents.
Human assessors or rule based intelligent systems are required to label the entire collection into many thousands of
potential topics. Even if a small percentage of the collection is labeled, how does an assessor choose between many
thousands of potential topics in a general purpose document
collection such as the