FOA Home | UP: Background

Training against manual indices

We will be especially concerned with corpora which have benefited from extensive manual indexing. For example, the articles in the Encyclopedia Britannicahas benefitted from man- centuries of effort have been applied to organize these textual passages into coherent indices, thesauri, and taxonomies. This manual attention provides two advantages in the context of machine learning.

First, the manual classification of documents to categories can be used as training data in the context of supervised learning §7.4 . Second, manually constructed representations provide a kind of upper bound on what we can hope our automatic learning techniques should build. Ultimately, however, we can expect that the most successful applications will not oppose manual, editorial enhancement with automatic induction but integrate learning into the editorial process. Machine learning can already do much of the job that has been traditionally been done by human editors; and yet, many aspects of the editorial function will remain beyond our learning techniques for the foreseeable future. Harnessing machine learning as part of a EDITOR'S WORKBENCH promises to leverage this scarce resource most effectively.

Of course, corpora which have benefit from such careful manual attention are few and far between. Much more typical is the textual corpus without any manual indexing whatsoever. The third advantage, then, of those special corpora which do have attending editorial enhancements is that if our learning techniques can generate analogous structures on these special collections, we can realistically expect the same techniques to generate useful structure on other collections as well.

Top of Page | UP: Background | ,FOA Home

FOA © R. K. Belew - 00-09-21