FOA Home
The first constraint we can apply to the set of keywords we will allow
in our vocabulary is to define a DOMAIN OF DISCOURSE - the
subject area within which each and every user of our search engine is
assumed to be searching. While we might imagine building a truly
encyclopedic reference work, one capable of answering questions
about any topic whatsoever, it is much more common to build a
search engine with more limited goals, capable of answering questions
about some particular subject. We will choose the simpler path
(it will prove enough of a challenge!), and focus on a particular topic.
To be concrete, throughout this text we will assume that the domain of
discourse is {\tt ARTIFICIAL INTELLIGENCE} (AI). Briefly, AI can be
defined as a sub-discipline of computer science, especially concerned
with algorithms that mimic inferences which, had they been made by a
human, would be considered ``intelligent.'' It typically includes such
topics as {\tt KNOWLEDGE REPRESENTATION, MACHINE LEARNING, ROBOTICS},
etc.
Thus is a BROADER TERM than \term{ARTIFICIAL INTELLIGENCE}.
This HYPERNYM relationship between the two phrases is something
we will return to later (cf. Section §6.3 ). For example, our task becomes more
difficult if we assume that the corpus of documents contains materials
on the broader topic of \term{COMPUTER SCIENCE}, rather than just (!)
\term{ARTIFICIAL INTELLIGENCE}. Conversely, the topics \term{KNOWLEDGE
REPRESENTATION}, \term{MACHINE LEARNING}, \term{ROBOTICS} are all
NARROWER TERMS , and our task would, {\em caeteris
paribus\/}\footnote{(Assuming) all other things being equal.}, be made
easier if we only had to help users FOA one of them.
Constraining the
vocabulary so that it is EXHAUSTIVE enough that any imaginable
topic is expressible within the language, while remaining
SPECIFIC enough that any particular subjects a user is likely to
investigate can be distinguished from others, will become a central goal
of our design. \term{ROBOTICS}, for example, would seem a descriptive
keyword because it identifies a relatively small sub-area of
\term{ARTIFICIAL INTELLIGENCE}. \term{COMPUTER SCIENCE} would be silly
as a keyword (for this corpus), as we are assuming it would apply to
each and every document and hence does nothing to discriminate them - it
is too exhaustive. At the other extreme, \term{ROBOTIC VACUUM CLEANERS
FOR 747 AIRLINERS} is almost certainly too specific.
The VOCABULARY
SIZE -- the total number of keywords -- depends on many factors,
including the scope of the domain of discourse. A typical language user
has a reading vocabulary of approximately 50,000 words. Web search
engines and large test corpora formed from the union of many document
types may require vocabularies ten times this large. It is unlikely that
such a large lexicon of keywords is required for restricted corpora, but
it is also true that even a narrow field can develop an extensive,
specialized JARGON or TERMS OF ART. In practice, search
engines typically have difficulty reducing the number of usable keywords
much below 10,000.
Top of Page
Topical scope