FOA Home | UP: Probabilisitic retrieval

Linear discriminators

The retrieval problem is made simpler by noting that all we really need is a simple DISCRIMINATION THRESHOLD , above which the evidence for $ \mathname{Rel}$ is sufficient that our retrieval system elects to show a user the document as part of a hitlist. We reflect this by making the $ \mathname{Rank}$ function proportional to the odds of relevance given the document: \mathname{Rank}({\bf x})\propto \prod_{x_{i}\in (D \cap Q)}{\frac{ p_{i}(1-q_{i})}{q_{i}(1-p_{i})}}

Assuming the features $x_{i}$ are stochastically independentAssuming the features $x_{i}$ are stochastically independent\ is very helpful, but manipulating the \emph{product} of terms is still awkward. For example, we might reasonably want to \defn{regress} (train) the feature variables against known \Rel documents to compute weights for their relative contributions \cite{Cooper92,Gey94}. While nonlinear regression algorithms (e.g., neural networks) do exist, regression with respect to a linear combination of features is most straight-forward.

LOGARITHMIC functions perform just this transformation (i.e., transforming products to sums), and are also guaranteed to be MONOTONIC . Monotonicity guarantees that if $arelative order of any two documents on our hitlist.

The two steps of

\item first comparing two probabilities in an $\mathname{Odds}$ ratio; and then

taking logarithms to form linear combinations rather than products

arises so frequently that the two operators are often composed and considered a single $$ calcuation.

For our discrimination purposes, then, $$ will suffice: \mathname{LogOdds}(\mathname{Rel} | {\bf x})=\mathname{LogOdds}(\mathname{Rel}) \cdot \sum_{x_{i}\in Q} \log \,\left({\frac{{1-p_{i}}}{{1-q_{i}}}}\right) \cdot \sum_{x_{i}\in (D \cap Q)}\log \,\left({\frac{p_{i}(1-q_{i})}{q_{i}(1-p_{i})}}\right)

$(\mathname{Rel})$ reflects the prior probability that we are likely to find a relevant document in our corpus, independent of any features of the query/document. This could vary, for example between very general versus very specific ones, but again should not alter our ranking of documents pertaining to a particular query. Focusing exclusively on those factors that will be useful discriminating, with respect to a particular query, one document from another, Equation (FOAref) simplifies considerably, and we can come up with a $\mathname{Rank}$ measure: \mathname{Rank}({\bf x})=\sum_{x_{i}\in (Q \cap D)} c_i + \kappa where the weight $c_{i}$ associated with each feature c_{i}=\log \,{\frac{p_{i}({1-p_{i}})}{q_{i}({1-q_{i}})}} is called the RELEVANCE WEIGHT (also know as RETRIEVAL STATUS VALUE (RSV) ) of this feature, and $\kappa $ is the query-invariant constant: \kappa =\mathname{LogOdds}(\mathname{Rel})+\sum_{x_{i}\in q}{\frac{{1-p_{i}} }{{1-q_{i}}}} corresponding to $\log$ of the first two document-insensitive terms of Equation (FOAref)

Top of Page | UP: Probabilisitic retrieval | ,FOA Home

FOA © R. K. Belew - 00-09-21