When irrelevant attributes abound

In documents where irrelevant attributes abound'" [Littlestone88] (e.g., when any one document contains a small fraction of the full vocabulary but still more than are important in a classifier) this rapid winnowing'' of features is critical. One approach is to use BOOSTING methods which begin with many very weak hypotheses but focus in on the most successful [Schapire98] .

Kivinen and Warmuth's exponentiated gradient{EG} algorithm extends from Winnow's binary features and output to real-valued quantities [Warmuth97] . The result shares a great deal with the Widrow-Hoff model (cf. Section §7.3.1 ), but rather than having weight changes making an additive change in the direction of the gradient, EG recommends making a multiplicative change to each element of the document vector. Using $R_{\mathbf{d}}$ as in Equation (FOAref) above: \mathbf{d}' = \mathbf{d} \frac{exp(- 2 \eta (\mathbf{d} \cdot \mathbf{q} - R_{\mathbf{d}}) \mathbf{q})} {|\mathbf{d}''|} where $\mathbf{d}'$ is the updated document vector and $|\mathbf{d}''|$ is the length of the document vector after all weights have been updated. That is, weights are always re-normalized so that their sum remains one (and non-negative). Renormalization is an important feature of EG which, in conjunction with the multiplicative increases in those relevant'' features that are shared with a query, results in quickly ({i.e.,\ exponentially fast) reduction to zero for irrelevant weights [Lewis96] . Callan has found also that rather than training the EG classifier with zero for incorrect classifications, and unity for correct ones, using more moderate target values pegged to the minimum and maximum feature values was more successful [Callan98] .

Another very recent approach is to apply Vapniik's SUPPORT VECTOR MACHINES (SVM) [Vapnik95] . Rather than searching for dichotomizing planes within a representational space that has been pre-defined (e.g., the hyperplanes that gradient descent methods adjust), SVM's search in the dual'' space defined by the set of training instances for KERNELS (representations) wherein the classes can be conveniently separated! As Joachims [Joachims98] has recently emphasized, the way in which these techniques avoid search of the vast space of potential keyword features seems to make SVM very appropriate technology for this application.

FOA © R. K. Belew - 00-09-21