Most commonly, researchers evaluate clustering results by
comparison to a ground truth set of class labels for doc-
uments. This poses problems when evaluating large scale
collections containing hundreds of millions of documents.
Human assessors or rule based intelligent systems are re-
quired to label the entire collection into many thousands of
potential topics. Even if a small percentage of the collec-
tion is labeled, how does an assessor choose between many
thousands of potential topics in a general purpose document
collection such as the