FOA Home
By TEST CORPORA we refer to collections of documents that also
have associated with them a series of queries for which RELEVANCE
ASSESMENTS are available. One of the earliest such test sets was a
collection of 1400 research papers on aerodynamics developed by C.
Cleverdon in the mid-1960's, known as the Cranfield corpus [Cleverdon63] . For most of the
1980's, a set of corpora known as CACM, CISI, INSPEC, MED and NPL
(somtimes referred to as the CORNELL CORPORA ) were developed,
maintained and distributed by Gerald Salton and his students at Cornell
and became the {\em de facto} standard for testing within the IR
community. For some time the most influential test corpora have been the
TREC corpora associated
with the Text Retrieval Evaluation Conference meetings \cite{Harman95}
Table
(FOAref) gives a sample of statistics for a number of the most
widely-used corpora. One obvious trend is the increasing size of these
collections over time. The Reuters
corpus classification labels that are invaluable for training
classifiers (cf. Section §7.4 ). With
our AIT corpus, the OSHMED [Hersh94]
is one of the few to provide multiple relevance assessments of the same
$\langle q,d \rangle$ pair.
Table (FOAref) shows a sample query
from the TREC experiments.
\footnotesize 1 > Science and Technology 2 >
AIDS treatments 3 > Document will mention a specific AIDS or ARC
treatment. 4 > To be r, a document must include a reference to at least
one specific potential Acquired Immune Deficiency Syndrome (AIDS or AIDS
Related Complex treatment. 5 > 1. Acquired Immune Deficiency Syndrome
(AIDS, AIDS Related Complex (ARC 6 > 2. treatment, drug, pharmaceutical
7 > 3. test, trials, study 8 > 4. AZT, TPA 9 > 5. Genentech,
Burroughs-Wellcome 10 > ARC - AIDS Related Complex 11 > . A set of
symptoms similar to AIDS. 12 > AZT - Azidothymidine, a drug for the
treatment of Acquired Immune Deficiency Syndrome, its related pneumonia,
and for severe AIDS Related Complex. 13 > TPA - Tissue Plasminogen
Activator - a blood clot-dissolving drug. 14 > treatment - any drug or
procedure used to reduce the debilitating effects of AIDS or ARC.
\caption{TREC query} For this query, and hundreds of others like it,
considerable manual effort has gone into assessing whether documents in
the TREC corpus should be considered ``relevant'' or not. Note
especially the way ``basic'' query (Line 2) has been embellished with
general and specific topical orientation (Lines 1,3), important terms
and abbreviations have been explicated, etc. This is much more
information than most users typically provide, but it also allows much
more refined assessments of systems' performance.
As the testing
procedures of the TREC participants have developed over the years,
multiple ``tracks'' have formed, corresponding to typical search engine
usage patterns. The task on which we have focused throughout this
section is termed AD HOC RETRIEVAL , in the sense that a constant
corpus is repeatedly searched with respect to a series of ad
hoc queries. This is distinguished from the ROUTING task,
which assumes a relatively constant, standing set of queries (for
example, corresponding to the interests of various employees of the same
corporation). Then,an on-going stream of documents is compared, with
relevant documents routed appropriate recipients.
More recently, a
special type of routing termed FILTERING has also been
considered. In the filtering task, the standing query is allowed to
adapt to the stream of RelFbk generated by the users as they
receive and evaluate routed documents (cf. Section §7.3 ).
Top of Page
Test corpora