**FOA Home**
**UP:** Modeling documents

Arguably the simplest model captures only the presence/absence of words in the document. That is, the document is modeled as the composition of $k$ keywords drawn from the $$ as so many independent Bernoulli trials. That is, we imagine that a document $\mathbf{d}$ is constructed by repeatedly selecting $|\mathbf{d}|$ words for each position in the document.

A reasonable simplification is to assume that the word's position within the document does not affect its conditional probability: (\forall i,j) \Pr(k_{i} | c ; \Theta) & = & \Pr(k_{j} | c ; \Theta) \\ & \equiv & \Pr(k | c ; \Theta)

When we become interested in
realistic document structures and writing conventions (e.g., abstract
paragraphs, introductions and conclusions, **SPIRAL EXPOSITIONS** of
news stories (cf. Section §6.2 ), etc.,
this assumption must be reconsidered.}

If we associate a biased coin with each keyword $k$, we can decompose the desired model $into two sets of parameters: \theta_{c} & \equiv & \Pr(c) \\ \theta_{ck} & \equiv & \Pr(k | c) {i.e.,\ the prior probability of each class $c$, and the probability that a keyword is present given that we know a document containing it is in class $c$. Then the ``naive Bayesian'' assumption allows us to assume that the keywords occur at each positional locations independently of one another: \Pr(\mathbf{d}|c) = \prod_{i=1}^{|\mathbf{d}|} \theta_{ck}

*Top of Page*
**UP:** Modeling documents
**FOA Home**