FOA Home | UP: INDIVIDUALS' assessment: search engine performance

Traditional evaluation methodologies

Before surveying all of the ways that evaluation might be performed, it is worthwhile sketching how it has typically been done in the past [Cleverdon63] . In the beginning, computers were slow, had very limited disk space and even more limited memories; initial test corpora needed to be small, too. One benefit of these small corpora was that it allowed at least the possibility of having a set of test queries compared exhaustively against each and every document in the corpus.

The source of these test queries, and the assessment of their relevance, varied in early experiments. For example, in the Cranfield experiments [Lancaster68] , 1400 documents in metallurgy were searched according to 221 queries generated by some of the documents' authors. In Saltan's experiments with the ADI corpus, 82 papers presented at a 1963 American Documentation Institute meeting were searched against 35 queries and evaluated by students and ``staff experts" associated with Saltan's lab [Salton68] . Lancaster's construction of the MEDLARS collection was similar [Lancaster69] .

As computers have increased in capacity, reasonable evaluation has required much larger corpora. The Text Retrieval Conference (TREC), begun in 1992 and continuing to the present, has set a new standard for search engine evaluation [Harman95] . The TREC methodology is notable in several respects. First, it avoids exhaustive assessment of all documents by using the POOLING method, a proposal for the construction of ``ideal'' test collections that predates TREC by decades [SparckJones76] . The basic idea is to use each search engine independently and then ``pool'' their results to form a set of those documents that have at least this recommendation of potential relevance. All search engines retrieve ranked lists of $k$ potentially relevant document, and the union of these retrieved sets are presented to human judges for relevance assessment.

In the case of TREC, $k=100$ and the human assessors were retired security analysts, like those that work at the National Security Agency (NSA), watching the world's communications. Since only documents retrieved by one of the systems being tested are evaluated there remains the possibility that relevant documents remain undiscovered and we might worry that our evaluations will change as new systems retrieve new documents and these are evaluated. Recent analysis seems to suggest that at least in the case of the TREC corpus, evaluations in fact are quite stable [Voorhees98] .

An important consequence of this methodological convenience is that unassessed documents are assumed to be irrelevant. This creates an unfortunate dependence on the retrieval methods used to nominate documents, which we can expect to be most pronounced when the methods are similar to one another. For example, if the alternative retrieved sets are the result of manipulating single parameters of the same basic retrieval procedure, the resulting assesments may have overlap with, and hence be useless for comparison of, methods producing significantly different retrieval sets. For the TREC collection, this problem was handled by drawing the top 200 documents from a wide range of 25 methods which had little overlap [Harman95] . Vogt [Vogt98] has explored how similarities and differences between retrieval methods can be similarly exploited as part of combined, hybrid retrieval systems (cf. Section §7.4.4 ).

It is also possible to sample a small subset of a corpus, submit the entire sample to review by the human expert, and extrapolate from the number of relevant documents found to an expected number across the entire corpus. One famous example of this methodology is Blair \& Maron's assessment of IBM's STAIRS retrieval system [Blair85] of the early 1980's. This evaluation studied the real-world use of STAIRS by a legal firm as part of a LITIGATION SUPPORT task: 40000 memos, design documents, etc. were to be searched with respect to 51 different queries. The lawyers themselves then agreed to evaluate documents' relevance. As they report: To find the unretrieved relevant documents we developed sample frames consisting of subsets of unretrieved database that we believed to be rich in relevant documents\ldots. Random samples were taken from these subsets, and thesamples were examined by the lawyers in a blind evaluation; the lawyers were not aware they were evaluating sample sets rather than retrieved sets they had personally generated. The total number of relevant documents that existed in these subsets could then be estimated. We sampled from subsets of the database rather than the entire database because, for most queries, the percentage of relevant documents in the database was less than \%2, making irt almost impossible t have both manageable sample sizes and a high level of confidence in the resulting Recall estimates. Of course, no extrapolation to the entire database could be made from these Recall calculations. Nonetheless, the estimation of the numbe of relevant unretrieved documents in the subsets did give us a maximum value for Recall for each request. [p. 291--293] This is a difficult methodology, but it allows some of the best estimates of Recall available, still. And their news was not good: on average retrievals captured only 20% of Relevant documents!

In short, methodologies for valid search engine evaluations require much more sophistication and care than generally appreciated. Careful experimental design [Tague-Sutcliffe92] , statistical analysis [Hull93] and presentation [Keen92] are all critical.

Top of Page | UP: INDIVIDUALS' assessment: search engine performance | ,FOA Home

FOA © R. K. Belew - 00-09-21