In this chapter we look at a particular type of classification task, where the objects are text documents siicli as articles in newspapers, scientific papers
in journals or perhaps abstracts of papers, or even just their titles. The aim
is to use a set of pre-classified dociiments to classify those that have not yet
been seen. This is becoming an increasingly important practical problem as
the volume of printed material in many fields keeps increasing and even in specialist fields it can be very difficiilt to locate relevant dociiments. Much of the terminology used reflects the origins of this work in librarianship and
information science, long before data mining techniqiies became available.
In principle we can use any of the standard methods of classification (Naive
Bayes, Nearest Neighbour, decision trees etc.) for tlus task, but datasets of text
documents have a number of specific feat iires coiiiparcd with tlie datasets we
have seen so far, which reqiiire separate fixplaiialioii. Tlie special case where
the docziments are wcl) pageK will be covered iii Scctioii 15.9.