FOA Home | UP: Extracting Lexical Features

Inter-document parsing

The first step is to break the corpus -- an arbitrary ``pile of text'' -- into individually retrievable documents. This demands that we be specific about the format of the corpus, and that we decide how it is to be divided into individual documents. For all operating systems we will consider, this problem can be defined more precisely in terms of paths, directories, files, and position within file. For any application in which the corpus can be described by the path to its root, these tools will translate directories/files/documents-within-files into a homogenous corpus. Of course, there are some situations (e.g., when documents are maintained within a database) that cannot be captured in these terms, but these primitives do allow a wide range corpora to be specified.

Our model will assume that many documents may be contained within a single file, and that each document occupies a contiguous region within the file. Extend the software to allow a document to be comprised of multiple, non-contiguous textual fields.

Issues concerning structure within a single document are closely related to assumptions we may or may not be able to make about the lengths of the documents in question. Our assumptions about how long a typical document is will recur throughout this book. It is obvious, for example, that different document browsers are necessary if we need to browse through an entire book rather than look at a single paragraph. Less obvious is that the fundamental weighting algorithms used by our indexing techniques will depend very sensitively on the number of tokens contained in each document.

Take a large \textbf{LaTeX}\ document and run it repeatedly through \texttt{LaTeX2HTML}, systematically varying the logical unit of document structure at which individual HTML pages are constructed. Discuss the impact of these ``arbitrary'' decisions on the weight of the key words. \end{exercise}

In this textbook we will focus primarily on two particular test corpora, AI theses (AIT) and email; these are discussed in more detail in Section §2.4 . Each of these have natural notions of the individual document: In the case of the AIT it is the thesis's abstract, and for email it is the entire message. In both cases, more refined notions of document (the individual paragraphs within the abstract or within the email message) are possible.

With these assumptions, we can define our corpus simply with two files: one specifying full path information for each file, and a second specifying where within these files each and every message resides. A large portion of the task of navigating a directory full of files and visiting each of them can be accomplished using the dirent. { The {\tt dirent} interface began with a Berkeley Software Distribution (BSD) specification written by Kirk McKusick in the mid-1980s. It has evolved to be a part of the POSIX standard. Ports to various platformds (Linux, MSDOS, MacOS) are available [Gwyn94] .

} This utility allows the recursive descent through all directories from a specified root, visiting every file contained therein.

In many cases, the files we will be indexing will have a great deal of syntactic structural information above and beyond the meaningful text itself. For example, our email will often contain a great deal of mail header information, as (loosely:) face? specified in RFC822. Many word processing systems, for example in \TeX, XML and HTML, now produce documents with a well-defined syntax. If, for example, the documents are written in HTML, we don't want to index pseudo-words like

. In many of these situations, FILTERS exist that can extract just the meaningful text from surrounding header or format information; DeTeX {an example of a useful filter for removing \LaTeX and \TeX markup.} Use of such utilities spares us the task of parsing this elaborate structure, but it also means that more elaborate solutions for maintaining the difference between the document's index and the document's presentation must be addressed.

The basic data elements to be parsed from our two examples, email and AIT, are shown in Figure (figure) .

Top of Page | UP: Extracting Lexical Features | ,FOA Home

FOA © R. K. Belew - 00-09-21