FOA Home | UP: A statistical basis for keyword meaning

Lexical consequences, internal/external perspectives

The plot in Figure (FOAref) is based on word-frequency statistics like those shown in Table (FOAref) . Note that on this log-log plot, frequency is a nearly linear inverse function of rank.

One way to make the various lexical decisions considered in the last chapter is to consider the effects of various decisions in terms of statistics such as these. Table (FOAref) shows the statistics for stemmed, non-noise word tokens (shown in {\tt \\tt}, eg SYSTEM) we typically assume together with noise words (shown in italics, eg the). As expected, the noise words are very frequent. But it is interesting to contrast those very frequent words defined a priori in the NEGATIVE DICTIONARY from those that are especially frequent in this particular corpus. In many ways these are excellent candidates for EXTERNAL KEYWORDS : characteristizations of this corpus' content, from the ``external'' perspective of general language use. That is, these are exactly the words (cf. NEURAL NETWORK, BASE, LEARN, WORLD, KNOWLEDGE) that could suggest to a browsing WWW user that the AIT corpus might be worth visiting. Once ``inside'' the topical domain of AI, however, these same words become as ineffective as other noise-words as INTERNAL KEYWORDS , discriminating the contents of one AI thesis dissertation from the next (cf. SYSTEM, MODEL,PROCESS, DESIGN).

Also shown in (FOAref) are statistics both with and without stemming. For example, the token SYSTEM itself appeared only 8632 times; variations like SYSTEMS, SYSTEMATIC, etc. must account for the other 12856. This simple example also demonstrates how issues of phrase recognition (cf. NEURAL NETWORK), and other, messy issues (e.g., the presence of French noise words in some of the dissertation abstracts but not in our English negative dictionary!) can arise in even the simplest, ``cleanest'' corpora.

Top of Page | UP: A statistical basis for keyword meaning | ,FOA Home

FOA © R. K. Belew - 00-09-21