CSE 254 LECTURE NOTES

April 3, 2007


ANNOUNCEMENTS

Handouts today: Introductory announcement, and tutorial notes by Klein and Manning.  These notes focus on NLP applications, but they will not be our focus in 254.


JOINT VERSUS CONDITIONAL MODELS

Let x designate a training or test example, and let y be its label.  A joint model is a probability distribution p(x,y).  A conditional model is a distribution p(y|x).  

In order to classify a test example, all we really need to know is the conditional model.  The joint model is more ambitious, since p(x,y) = p(x)p(y|x).  The claim is that we should not bother learning a model for p(x).

Suppose we decompose p(x,y) in some other way: either p(x,y) = p(y)p(x|y)  or p(x,y) = f(x,y) for some functional form f(x,y).  Whatever the decomposition chosen, if we train by maximizing likelihood, we are trading off fitting the observed p(x) well with fitting the observed p(y|x) well.  The claim is that we should focus on the latter.

Empirically, conditional training does work better.  Example: let x be a word in a sentence, represented by many many features.  Let y be the word-sense-disambiguation of x.  With exactly the same features, maximizing joint likelihood during training leads to 86.8/73.6% training/test accuracy.  Maximizing conditional likelihood leads to 98.5/76.1% accuracy.  (But notice that the improvement on test data is much smaller than on training data.)


NAIVE BAYES VERSUS LOGISTIC REGRESSION

Let's look at a specific example.  A naive Bayes (NB) classifier uses the decomposition p(x,y) = p(y)p(x|y) and then the simplification p(x|y) = PROD_i p(x_i | y).  The graphical model has arrows pointing away from the label variable y.

The graphical model for logistic regression (LR) has the arrows pointing from the features x_i to the label y.  In general, the conditional probability table (or conditional distribution) would be very complex because the in-degree of y is high.  The basic simplification in LR is the use of a linear model here.  There is no free lunch: learning a classifier from limited training data requires assumptions and/or simplifications of some sort.

Note the pictures are slightly misleading.  The NB model does have an unconditional distribution for y, its single node with indegree 0.  The LR model does not have any distribution for its in-degree-0 nodes.


WHAT ARE FEATURES?

With conditional models of the sort we will investigate (i.e. log-linear models), anything and the kitchen sink can be a feature.  We can have lots of classes, lots of features, and we can pay attention to different features for different classes.

So, in general, a feature is a real-valued function of both the data space X and the label space Y; formally a feature is a function f: X x Y -> R.

Almost all features are presence/absence indicator functions so the real value is either 0 or 1.  We often sum over a training set and talk about the value of a feature on the whole training set.

If we have a conventional attribute a(x) with k alternative values, and n classes, we can make k*n different features as defined above. 

Because features are just functions of x and y, they can overlap in arbitrary ways.  E.g. if x is a word we can have features like "starts with a capital letter", "starts with "G", is "Graham", "is six letters long."  Generally we can code suffixes, prefixes, facts from a lexicon, preceding/following punctuation, etc., as features.


SOFTMAX PROBABILITY MODELING

Log-linear models are actually very simple: there is one real-valued weight for each feature, no more no less.  Let the features be subscripted i and let the weights be w_i.

Define  p(y | x, w)  =  exp SUM_i w_i f_i(x,y)  /   SUM_y' exp SUM_i w_i f_i(x,y)  

Why this functional form?  Several aspects to the answer:
(1) The linear combination  SUM_i w_i f_i(x,y)  can be any real value.  The exponential makes it positive, like a valid probability.
(2) The division makes the results between 0 and 1, i.e. a valid probability.
(3) The ranking of the probabilities will be the same as the ranking of the linear values.

This function is called a softmax because the exponentials make the biggest linear value dominate all others, like a standard "max" function does.

With this functional form for probabilities, we can choose parameter values w_i to maximize the conditional probability of the training examples.

Question: Why use the softmax?  Answer:  The softmax is certainly not the only possible choice.  In neural networks people have used other similar logistic-shaped
functions like tanh, and in statistics people have used the "probit" function instead, for example.  Softmax is the most widely used now.  One reason is that its derivative is
very simple; see below.  References:
        http://en.wikipedia.org/wiki/Probit#Related_Topics
        http://www.faqs.org/faqs/ai-faq/neural-nets/part2/section-12.html