FOA Home | UP: Extracting Lexical Features

Building useful tools

The promise offered by the first chapter is that many real-world problems can be viewed as instances of the FOA problem. The proof is to be found in concrete code - a relatively small technology base that will prove useful in a wide array of applicatons. In this chapter we will present a suite of software tools that together generate a ``search engine" for a wide variety of situations. Source code is provided so that these tools can be easily modified for applications of your own. We will work through two different examples of IR systems, in order to demonstrate how slight variants of the same basic code can handle both.

Compared to the broad generalities of Chapter 1, the technical details of this chapter will sound a much different tone. Describing a complex algorithm requires the specification of many, sometimes tedious details. To make the software executable on machines that are likely to be available to you, the details are provided for several operating systems. But the processor speeds, internal memory and harddisk sizes available on computers is changing dramatically each year, so many of the assumptions on which these routines are based will require constant re-evaluation.

You will develop the software tools in three phases. The first phase will convert an arbitrary pile of textual objects into a well-defined corpus of documents, each containing a string of terms to be indexed. The second phase involves building efficient data structures to ``invert" the $Index$ relation so that, rather than seeing all the words that are in a particular document, we can find all documents containing particular keywords. All of these efforts are in anticipation of the third and final phase, which matches queries against indices in order to retrieve those that are most similar. These three major phases are central to building any search engine.

This chapter will be most concerned with the first two phases that together extract lexical features. Our goal will be the extraction of a set of features worthy of subsequent analysis. As in any cognitive science, the specification of an appropriate LEVEL OF ANALYSIS -- whether it is the resolution and depth of an image; the sub-phonemes of continuous speech; the speech acts of language ... -- the specification of this atomic feature set is the first important step.

This will involve a great deal of work, much of it unpleasant except to those who enjoy designing efficient algorithms and data-structures (some of us actually do enjoy this!:). The promise is that we will, as a consequence of good software design, develop useful tools that allow us to spend the rest of our time exploring interesting features of language.

Top of Page | UP: Extracting Lexical Features | ,FOA Home

FOA © R. K. Belew - 00-09-21