FOA Home | UP: Building hypotheses about documents

Feature selection

When you are thinking about how you classifiy your Email, almost certainly keywords contained in your Email are some of the FEATURES you think of first. Recall, however, that the keyword vocabulary can be very large. Using this feature space, then, individual documentšs representations will be very SPARSE . In terms of the vector space model of Section §3.4 , many of the vector elements will be zero. To use Littlestonešs lovely expression, ``Irrelevant attributes abound'' [Littlestone88] , and so it should come as no suprise that his learning techniques are especially appropriate in FOA learning applications in Sectgion §7.5.3 .

Efforts to control the keyowrd vocabulary and make the lexical features as meaningful as possible are therefore important preconditions for good classification performance. For example, name-tagging techniques (cf. Section §6.6.1 ) which reliably identify proper names can provide valuable classification features. For example, a proper name tagger would be one that was especially sophisticated about capitalization, name order, abbreviation conventions. When both people's proper names and institutional names (government agencies, universities, corporations, etc.) the recognition of complex, multi-token phrases becomes possible:

In part because of the difficult issues lexical, keywordly-based representations entail, it is worth thinking briefly about some of the alternatives. There are also less-obvious features we might use to classify documents. META-DATA associated with the document, for example information about its date and place of publication, are one possibility. Geographic place information associated with a document can also be useful; cf Section §6.6.1 . Finally, recall the bibliographic citations that many document contain (cf. Section §6.1 ). The set of references one document makes to other ones (representable as links in a graph) can be used as the basis of classification in much the same way as its keywords.

In summary, while keywords provide the most obvious set of features on which classifications can be based, these result in very large and sparse learning problems. Other features are also available, and may also be useful. It is important to note, however, that careful Bayesian reasoning about dependencies among keyword features is a very difficult problem, as discussed in Section §5.5.7 . Attempting to extend this inference to include other, heterogeneous types of features must be done carefully.


Top of Page | UP: Building hypotheses about documents | ,FOA Home

FOA © R. K. Belew - 00-09-21