# Inter-document parsing

The first step is to break the corpus -- an arbitrary pile of text'' -- into individually retrievable documents. This demands that we be specific about the format of the corpus, and that we decide how it is to be divided into individual documents. For all operating systems we will consider, this problem can be defined more precisely in terms of paths, directories, files, and position within file. For any application in which the corpus can be described by the path to its root, these tools will translate directories/files/documents-within-files into a homogenous corpus. Of course, there are some situations (e.g., when documents are maintained within a database) that cannot be captured in these terms, but these primitives do allow a wide range corpora to be specified.

Our model will assume that many documents may be contained within a single file, and that each document occupies a contiguous region within the file. Extend the software to allow a document to be comprised of multiple, non-contiguous textual fields.

Issues concerning structure within a single document are closely related to assumptions we may or may not be able to make about the lengths of the documents in question. Our assumptions about how long a typical document is will recur throughout this book. It is obvious, for example, that different document browsers are necessary if we need to browse through an entire book rather than look at a single paragraph. Less obvious is that the fundamental weighting algorithms used by our indexing techniques will depend very sensitively on the number of tokens contained in each document.

Take a large \textbf{LaTeX}\ document and run it repeatedly through \texttt{LaTeX2HTML}, systematically varying the logical unit of document structure at which individual HTML pages are constructed. Discuss the impact of these arbitrary'' decisions on the weight of the key words. \end{exercise}

In this textbook we will focus primarily on two particular test corpora, AI theses (AIT) and email; these are discussed in more detail in Section §2.4 . Each of these have natural notions of the individual document: In the case of the AIT it is the thesis's abstract, and for email it is the entire message. In both cases, more refined notions of document (the individual paragraphs within the abstract or within the email message) are possible.

With these assumptions, we can define our corpus simply with two files: one specifying full path information for each file, and a second specifying where within these files each and every message resides. A large portion of the task of navigating a directory full of files and visiting each of them can be accomplished using the dirent. { The {\tt dirent} interface began with a Berkeley Software Distribution (BSD) specification written by Kirk McKusick in the mid-1980s. It has evolved to be a part of the POSIX standard. Ports to various platformds (Linux, MSDOS, MacOS) are available [Gwyn94] .

} This utility allows the recursive descent through all directories from a specified root, visiting every file contained therein.

In many cases, the files we will be indexing will have a great deal of syntactic structural information above and beyond the meaningful text itself. For example, our email will often contain a great deal of mail header information, as (loosely:) face? specified in RFC822. Many word processing systems, for example in \TeX, XML and HTML, now produce documents with a well-defined syntax. If, for example, the documents are written in HTML, we don't want to index pseudo-words like