**FOA Home**
**UP:** Probabilisitic retrieval

Perhaps the simplest model proceeds by imagining binary, independent
features; it is conventionally called (surprise!) the **BINARY
INDEPENDENCE MODEL** (BIM) [Robertson76] [vanR77] . First, the binary assumption is
that all the features $x_{i}$ are binary. This is not a very restrictive
assumption and is used only to simplify the derivation.

The much bigger
assumption is that the documents' features occur independently of one
another. We have discussed the problems with such an assumption before.
quotes J. H. Williams' expression of the paradox: The assumption of
independence of words in a document is usually made as a matter of
mathematical convenience. Without the assumption, many of the subsequent
mathematical relations could not be expressed. With it, many of the
conclusions should be accepted with extreme caution. [Williams65] . The key advantage it
allows is that the probability of a feature vector ${\bf x}$ becomes
simply the *product* of the **MARGINAL PROBABILITIES** of the
individual features: \Pr({\bf x} | \mathname{Rel}) = \prod_{i} \Pr( x_i
| \mathname{Rel}) Very convenient, and very unrealistic. {Cooper has
noted that we don't need quite such a restrictive notion of independence
[Cooper94] . All that is required is
to know that the *ratio* of ``linked'' ir/relevant feature
probabilities \[ \Pr({x_i} | \mathname{Rel})\over \Pr({x_i} |
\overline{\mathname{Rel}}) \] are independent of one another. While
slightly less restrictive, this assumption seems no more realistic.}

Applying this decomposition to our odds calculation gives: \mathname{Odds}(\mathname{Rel} | {\bf x}) = \mathname{Odds}(\mathname{Rel}) \cdot \prod_i {\Pr({x_i} | \mathname{Rel})\over \Pr({x_i} | \overline{\mathname{Rel}})}

It will be
convenient to introduce the variables $p_i$ and $q_{i}$ to capture the
probabilities that feature $x_{i}$ is present, given that a document
is/not relevant: p_i & \equiv & \Pr(x_i=1 | \mathname{Rel}) \\ q_i &
\equiv & \Pr(x_i=1 | \overline{\mathname{Rel}}) The complimentary
probabities, concerning documents in which the feature is
*absent*, can also be defined easily: 1-p_i & = & \Pr(x_i=0 |
\mathname{Rel}) \\ 1-q_i & = & \Pr(x_i=0 | \overline{\mathname{Rel}})

These definitions break the product into two portions, the first having to do with those features that are present in a particular document and the second those that are not: \mathname{Odds}(\mathname{Rel} | {\bf x}) = \mathname{Odds}(\mathname{Rel}) \cdot \prod_{x_i=1} {p_i\over q_i}\cdot \prod_{x_i=0} {{1-p_i}\over{1-q_i}}

Recall that both queries
and documents live within the same vector space defined over the
features $x_{i}$. The *two* products of Equation
*(FOAref)* (defined in terms of presence/absence of a feature in
a document) can be further broken into *four* subcases depending
on whether the features do or do not occur in the query. We next make
another ``background'' assumption concerning all of those features
$x_{i}$ that are not in *both* the query *and* document of
current interest, viz., that the probability of these features being
present in relevant and irrelevant documents is equal: $p_i = q_i$. That
is, for those terms we don't care about (because they don't affect this
query/document comparison) we are happy to think that the occurence of
these terms is independent of their ir/relevance.

Consider the sets $D$
and $Q$ shown in Figure *(figure)* defined in terms of those
features $x_i$ present and absent in the document and query,
respectively. $q_i$ of the presence of a feature in irrelevant
documents, but there is no intended, direct connection between these two
quantities.} Regrouping the two products of Equation *(FOAref)*
into four products created by the two sets $D$ and $Q$, the
$\frac{p_{i}}{q_{i}}$ terms cancel except in in the intersection of the
query and document (where the feature $x_{i}$ is present in both) and in
$Q \backslash D$ , the set difference of $Q$ take-away $D$:

In the retrieval situation we will exploit the sparseness that makes it much more efficient to keep track of where a feature does occur ( $x_{i}=1$ )than all the places it does not ( $x_{i}=0$ ). Since the second product is defined over all the features of $q$ except those in $d$ , if we are careful to ``pre-multiply'' each feature in their intersection by a reciprocal, we can then safely multiply everything in the query by the same ratio:

The next section will show the utility of separating the last term, which depends on features of the document in question, from the first two which do not as part of an online retrieval calculation.

But first, it is worthwhile considering how we might attempt to estimate some of the required statistics. Fuhr [Fuhr92] , for example, considers the case when we have RelFbk from a user who has evaluated each of the top $N$ documents in an initial retrieval and has found $R$ of these to be relevant (as well as evaluating all the $N-R$ rest of them and found them to be irrelevant!). If a particular feature $x_{i}$ is present in $n$ of the retrieved documents with $r$ of these relevant, then this bit of RelFbk provides reasonable estimates for $p_{i}$ and $q_{i}$: \widetilde{p_{i}} &=&\frac{r}{R} \\ \widetilde{q_{i}} &=&\frac{n-r}{N-R}

*Top of Page*
**UP:** Probabilisitic retrieval
**FOA Home**