# Test corpora

By TEST CORPORA we refer to collections of documents that also have associated with them a series of queries for which RELEVANCE ASSESMENTS are available. One of the earliest such test sets was a collection of 1400 research papers on aerodynamics developed by C. Cleverdon in the mid-1960's, known as the Cranfield corpus [Cleverdon63] . For most of the 1980's, a set of corpora known as CACM, CISI, INSPEC, MED and NPL (somtimes referred to as the CORNELL CORPORA ) were developed, maintained and distributed by Gerald Salton and his students at Cornell and became the {\em de facto} standard for testing within the IR community. For some time the most influential test corpora have been the TREC corpora associated with the Text Retrieval Evaluation Conference meetings \cite{Harman95}

Table (FOAref) gives a sample of statistics for a number of the most widely-used corpora. One obvious trend is the increasing size of these collections over time. The Reuters corpus classification labels that are invaluable for training classifiers (cf. Section §7.4 ). With our AIT corpus, the OSHMED [Hersh94] is one of the few to provide multiple relevance assessments of the same $\langle q,d \rangle$ pair.

Table (FOAref) shows a sample query from the TREC experiments.

\footnotesize 1 > Science and Technology 2 > AIDS treatments 3 > Document will mention a specific AIDS or ARC treatment. 4 > To be r, a document must include a reference to at least one specific potential Acquired Immune Deficiency Syndrome (AIDS or AIDS Related Complex treatment. 5 > 1. Acquired Immune Deficiency Syndrome (AIDS, AIDS Related Complex (ARC 6 > 2. treatment, drug, pharmaceutical 7 > 3. test, trials, study 8 > 4. AZT, TPA 9 > 5. Genentech, Burroughs-Wellcome 10 > ARC - AIDS Related Complex 11 > . A set of symptoms similar to AIDS. 12 > AZT - Azidothymidine, a drug for the treatment of Acquired Immune Deficiency Syndrome, its related pneumonia, and for severe AIDS Related Complex. 13 > TPA - Tissue Plasminogen Activator - a blood clot-dissolving drug. 14 > treatment - any drug or procedure used to reduce the debilitating effects of AIDS or ARC. \caption{TREC query} For this query, and hundreds of others like it, considerable manual effort has gone into assessing whether documents in the TREC corpus should be considered relevant'' or not. Note especially the way basic'' query (Line 2) has been embellished with general and specific topical orientation (Lines 1,3), important terms and abbreviations have been explicated, etc. This is much more information than most users typically provide, but it also allows much more refined assessments of systems' performance.

As the testing procedures of the TREC participants have developed over the years, multiple tracks'' have formed, corresponding to typical search engine usage patterns. The task on which we have focused throughout this section is termed AD HOC RETRIEVAL , in the sense that a constant corpus is repeatedly searched with respect to a series of ad hoc queries. This is distinguished from the ROUTING task, which assumes a relatively constant, standing set of queries (for example, corresponding to the interests of various employees of the same corporation). Then,an on-going stream of documents is compared, with relevant documents routed appropriate recipients.

More recently, a special type of routing termed FILTERING has also been considered. In the filtering task, the standing query is allowed to adapt to the stream of RelFbk generated by the users as they receive and evaluate routed documents (cf. Section §7.3 ).

FOA © R. K. Belew - 00-09-21