FOA Home | UP: INDIVIDUALS' assessment: search engine performance

Ordering the \Ret set

Do not worry about large numbers of results: the best ones come first! [AltaVista, 1998]

The next step is to move beyond thinking of as simply a set. We will suppose that retrieved documents are returned {\em in some order} by the search engine, reflecting its assessment of how well each document matches the query. Following current Web vernacular, we will call this ordering of the \Ret set a HITLIST and a retrieved document's position its HITLIST RANK ${\bf \mathname{Rank}}(d_{i})$. This is a positive integer assigned to each document in the \Ret set, in descending order of similarity with respect to a the matching function ${\bf \mathname{Match}}(q,d)$: {\bf \mathname{Match}}(q,d) \;\in \Re \nonumber \\ {\bf \mathname{Rank}}(d) \;\in J^{+} \nonumber \\ {\bf \mathname{Rank}}(d_{i}) < {\bf \mathname{Rank}}(d_{j}) \Longleftrightarrow {\bf \mathname{Match}}(q,d_{i}) > {\bf \mathname{Match}}(q,d_{j})

Sparck Jones [SparckJones72] and others have historically referred to a document's rank in \Ret as its COORDINATION LEVEL . Strictly speaking, coordination level refers to the number of keywords shared by document and query. In Boolean retrieval systems, sensitive only to the presence or absence of keywords, ranking by coordination level may be the only measure on document/query similarity availability.

For long queries, hitlist rank and coordination level are likely to be similar, since it is unlikely that different documents will match exactly the same number of words from the query. But for short queries it is likely that coordination level will only partially order the \Ret set. This is why \vanR{161}, speaking of the Boolean systems typical at that time, says, ``Unfortunately, the ranking generated by a matching function is rarely a simple ordering, but more commonly a weak ordering.'' Most modern search engines, however, exploit keyword weightings and can provide much more refined measures, thereby providing a TOTAL ORDERING of the hitlist.

According to the Probability Ranking Principle (cf. Section §5.5.1 ), a retrieval system is performing optimally just in case it retrieves documents in order of decreasing probability of relevance. For now we simply assume that there is a total ordering imposed over \Ret. We will use the hitlist ranking to effectively define a series of retrievals. Setting a very high threshold on this ordering would mean retrieving a very small set, while setting a lower threshold will retrieve a much larger one.

Now consider a particular query $q$ and the set $Rel_{q}$ of relevant documents associated with it. Assuming that is totally ordered makes it possible for us to define the fundamental analytic tool for search engine performance: the RECALL/PRECISION CURVE (Re/Pre Curve). The basic procedure is to consider each retrieved document in hitlist rank order and ask the precision and recall of a retrieval of all documents up to and including this one.

Consider the first of the two hypothetical retrievals shown in Table (FOAref) .

With respect to this query, we will assume there are exactly five relevant documents out of a total of 25 in the corpus. The very first one retrieved is deemed relevant; if we stopped retrieval at this point, our recall would be 0.2 (since we have retrieved 1 of 5 relevant documents) and our precision is perfect (the one retrieved document is relevant). Our good luck continues as we consider the next document, also relevant; this generates a second Re/Pre data point of (0.4,1.0). We are not so lucky with the third document retrieved , and precision drops to 2/3 while recall remains at 0.4. Proceeding down the retrieval in Rank order, and plotting each and every point in this fashion gives the Re/Pre curve shown in Figure (figure) .

At this point we can already make several observations . Asymptotically, we know that the final recall must go to one; once we have retrieved every document we've also retrieved every relevant document. The precision will be the ratio of the number of relevant documents to the total corpus size. Ordinarily, unless we are interested in very general queries, and/or very small sets of documents, this ratio will be very close to zero.

The other end of the curve, however, turns out to be much less stable. We would hope that a retrieval system's very first candidate for retrieval, that document with hitlist rank = 1, will be relevant but it may not. Figure (figure) shows a second pair of hypothetical data points (in \blue), corresponding to the case that a single irrelevant document is ranked higher than the relevant ones. This relatively small change in assessment creates a fairly dramatic effect on the curve, with real consequence once we need to juxtapose multiple queries' curves (see Section §4.3.7 , below). Such instability is an inevitable consequence of the definitions of $\mathname{Precision}$ and $\mathname{Recall}$: if the first retrieved document happens to be relevant its Re/Pre coordinates will be $<1,\frac{1}{\mathname{NRel}}>$, otherwise it will be $<0,0>$.

Figure (figure) puts this particular retrieval in the context of the best and worst retrievals we might imagine. The best possible retrieval would be to retrieve the five relevant documents first, and then all other documents. This would produce the upper, square Re/Pre curve. Alternatively, the worst possible retrieval would retrieve all but the relevant documents before returning these; this produces the lower line.

Top of Page | UP: INDIVIDUALS' assessment: search engine performance | ,FOA Home

FOA © R. K. Belew - 00-09-21