**FOA Home**
**UP:** Probabilisitic retrieval

The retrieval problem is made simpler by noting that all we really need
is a simple **DISCRIMINATION THRESHOLD** , above which the evidence
for $ \mathname{Rel}$ is sufficient that our retrieval system elects to
show a user the document as part of a hitlist. We reflect this by making
the $ \mathname{Rank}$ function proportional to the odds of relevance
given the document: \mathname{Rank}({\bf x})\propto \prod_{x_{i}\in (D
\cap Q)}{\frac{ p_{i}(1-q_{i})}{q_{i}(1-p_{i})}}

Assuming the features $x_{i}$ are stochastically independentAssuming the features $x_{i}$ are stochastically independent\ is very helpful, but manipulating the \emph{product} of terms is still awkward. For example, we might reasonably want to \defn{regress} (train) the feature variables against known \Rel documents to compute weights for their relative contributions \cite{Cooper92,Gey94}. While nonlinear regression algorithms (e.g., neural networks) do exist, regression with respect to a linear combination of features is most straight-forward.

**LOGARITHMIC**
functions perform just this transformation (i.e., transforming products
to sums), and are also guaranteed to be **MONOTONIC** . Monotonicity
guarantees that if $a**relative
order of any two documents on our hitlist.
**

The two steps of

\item first comparing two probabilities in an $\mathname{Odds}$ ratio; and then

taking logarithms to form linear combinations rather than products

arises so frequently that the two operators are often composed and considered a single $$ calcuation.

For our discrimination purposes, then, $$ will suffice: \mathname{LogOdds}(\mathname{Rel} | {\bf x})=\mathname{LogOdds}(\mathname{Rel}) \cdot \sum_{x_{i}\in Q} \log \,\left({\frac{{1-p_{i}}}{{1-q_{i}}}}\right) \cdot \sum_{x_{i}\in (D \cap Q)}\log \,\left({\frac{p_{i}(1-q_{i})}{q_{i}(1-p_{i})}}\right)

$(\mathname{Rel})$ reflects the *prior* probability that we are
likely to find a relevant document in our corpus, independent of any
features of the query/document. This could vary, for example between
very general versus very specific ones, but again should not alter our
ranking of documents pertaining to a particular query. Focusing
exclusively on those factors that will be useful discriminating, with
respect to a particular query, one document from another, Equation
*(FOAref)* simplifies considerably, and we can come up with a
$\mathname{Rank}$ measure: \mathname{Rank}({\bf x})=\sum_{x_{i}\in (Q
\cap D)} c_i + \kappa where the weight $c_{i}$ associated with each
feature c_{i}=\log \,{\frac{p_{i}({1-p_{i}})}{q_{i}({1-q_{i}})}} is
called the **RELEVANCE WEIGHT** (also know as **RETRIEVAL STATUS
VALUE (RSV)** ) of this feature, and $\kappa $ is the query-invariant
constant: \kappa =\mathname{LogOdds}(\mathname{Rel})+\sum_{x_{i}\in
q}{\frac{{1-p_{i}} }{{1-q_{i}}}} corresponding to $\log$ of the first
two document-insensitive terms of Equation *(FOAref)*

*Top of Page*
**UP:** Probabilisitic retrieval
**FOA Home**