FOA Home | UP: Intra-document parsing


We have described the lexical analyzer in terms of the job it must do processing each and every document in the corpus because this task confronts us first. But because our central task will be to match these documents against subsequent users' queries, it is critical that the identical lexical analysis be performed on the queries as was performed on the documents. This creates several implementation constraints (e.g., that the same code libraries are available to the indexer and to the query processing interface), but these are minor. If the query language is designed to support any special operators (e.g., Boolean combinators, proximity operators), the query's lexical analyzer may accept a super-set of the tokens accepted by the document's analyzer. In any case, it is imperative if queries and documents are to be matched correctly that the same lexical analysis be applied to both streams. Using an identical code library is the easiest way to ensure this.

It may seem nonsensical to worry so much about processing each character efficiently, when we assume that some other, previous process has already identified each inter-document break - doesn't such processing require the same computational effort, and, if so, doesn't this make our current efficiency worries moot?

Perhaps. A conclusive answer depends on many architecture and operating system specifics. There are two reasons that we have made such assumptions. The first is that the practicalities of delivering the FOA corpora and code currently make this convenient. But the more serious reason is that the most theoretically and intellectually interesting questions involve analysis of operations downstream from the first stages of inter-document parsing - how to identify tokens, how to count them, etc. If these latter operations are made especially efficient, it means we can afford to do more experimentation, more playfully. For a text, that is the primary concern.

Top of Page | UP: Intra-document parsing | ,FOA Home

FOA © R. K. Belew - 00-09-21