**FOA Home**
**UP:** Dimensionaility Reduction

Imagine that we've collected data on the `HEIGHT`

and
`WEIGHT`

of everyone in a classroom of $N$ students. If these
are plotted, the result is something like Figure *(figure)* .
Notice the correlation around an axis we might call something like
`SIZE`

. Students vary most along this dimension; it captures
most of the information about their distribution. It is possible to
capture a major source of variation across the
`HEIGHT/WEIGHT`

sample because, just as with our keywords,
the two quantities are correlated.

In this section we analyze similar
statistical correlations among the keywords and documents contained in
the much larger vector space model first mentioned in Section §3.4 . Recall that in the vector space
model, the $\mathname{Index}$ relation placing $D \equiv
\mathname{NDoc}$ vectors corresponding to the corpus documents within
the space $\Re^{V}, V \equiv \mathname{NKw}$ (for **VOCABULARY SIZE**
) defined by its keyword vocabulary.

Here we describe this in the terms
of linear algebra, where J = \mathname{Index} \) \). For similar
reasons, within this section, we will use $V \equiv \mathname{NKw}$ and
$D \equiv \mathname{NDoc}$.} is a $D \times V$ element matrix. $ spaces
that we discuss here. Some of these involve the **CURSE OF
DIMENSIONALITY** , which makes the computational expense of many
important questions grow exponentially with the number of dimensions.}

Attempts
to reduce this large dimensional space into something smaller are called
**DIMENSIONALITY REDUCTION** . There are two reasons we might be
interested in reducing dimensions. The first is probably most obvious:
it's a very unwieldy representation of documents' content. Individual
documents will have many, many zeros, corresponding to the many words in
the corpus $V$ not present in an individual document; the vector space
matrix is very **SPARSE** . Dimensionality reduction is a search for
a representation that is denser, more compressed.

Another reason might be
to exploit what has become known as **LATENT SEMANTIC **
relationships among these keywords. When we make each term in our
vocabulary a dimension, we are effectively assuming they are
**ORTHOGONAL** to one another; we expect their effects to be
independent. But many features of FOA suggest that index terms are
highly dependent, highly correlated with one another. If that's the
case, we can exploit that correlation by capturing only those axes of
maximal variation and throwing away the rest.

*Top of Page*
**UP:** Dimensionaility Reduction
**FOA Home**