FOA Home | UP: Building hypotheses about documents

Hypothesis spaces

However selected, the features discussed in the previous section must now be composed into hypotheses concerning how we might describe documents. There is an extraordinary range of alternatives in representation here. Some of these are shown in Figure (figure)

Decision trees are formed by asking a question about individual features and using the answers to these questions to navigate through a series of tests until documents are ultimately classified at the leaves. Weighted, linear combinations of the features can also be formed. Neural networks are best viewed as non-linear compositions of weighted features [Crestani93] [Crestani94] [Gallant91] [Kwok95] [Wong93] . Boolean formulae can be formed from sentences using simple conjunctive or disjunctive combinations. Our focus here will be on Bayesian networks, which attempt to represent probabilistic relationships among the features.

In any of these cases, machine learning techniques must be sensitive to their inductive bias. That is, given a fixed amount of data, we must have some a priori preference for some kinds of hypotheses over others. For example, decision tree learning algorithms [Quinlan93] prefer small trees. Neural networks prefer smooth mappings [Mitchell97] , etc.

A common feature of all these learning algorithms is a general preference for parsimony, or simplicity. This preference is typically attributed first to William of Occam (c. 1330). OCCAM'S RAZOR has been used ever since to cleave simpler hypotheses from more complex ones.

Another motivation for the parsimony bias has been realized more recently within machine learning: simple hypotheses are also most likely to accurately go beyond the data used to train them to predict to other, unseen data. That is, while very complicated hypotheses have a tendency to OVER-FIT to the training data given a learning algorithm, the good fit they can accomplish on this set is not matched when the same classification is done on new data. The issues involved in evaluating a classifier's performance is an important topic within machine learning [Mitchell97] .

Top of Page | UP: Building hypotheses about documents | ,FOA Home

FOA © R. K. Belew - 00-09-21