FOA Home | UP: Fine points


Dependencies on document type

The process of indexing has been idealized, as having a first stage where we worry about what kind of document it is (e.g., whether it's a thesis or an email message), and then assuming subsequent processing is completely independent of document type. Like all software designs, this idealization breaks down in the face of real data.

Consider email messages. One common element of these documents is {quoted} text from another email message. Often this is marked by a caret prefix, as shown in Figure (FOAref) . The role of inter-document citations like this is considered in depth in Section §6.1 , but for the present a reasonable design decision is that all text should be indexed only once. Especially appropriate if we have both the original email message and the quoted version of it, we might want to elide (ignore) quoted lines.

On Thu, 28 Aug 1997 23:10:03 GMT, Michael Query wrote:

>My question is, should I get another 2 Gb SCSI disk for putting the >OS (NT 4.0 WS), software, etc on, or should I get an IDE disk for this?

Having played around with different configs for a while, I'd say go SCSI. I'd do that even if I had to get a second SCSI controller.

(You'll ``hear'' a lot of people arguing that IDE is good enough, but if you are after overall improved performance SCSI is best.)

my 2Y. \caption{Quoted lines in an Email message}

Other software designs are possible, but the easiest way to implement this is to check for quoted lines within the {postdoc} routine - if the first character of a line is a caret mark, don't do any of the subsequent processing. Don't check it against noise words, don't stem, index, or install it in the term tree. Unfortunately this creates precisely the kind of Email-specific processing that should be avoided by well-engineered software.


Top of Page | UP: Fine points | ,FOA Home


FOA © R. K. Belew - 00-09-21