CSE 254 LECTURE NOTES
April 3, 2007
ANNOUNCEMENTS
Handouts today: Introductory announcement, and tutorial notes by
Klein and Manning. These notes focus on NLP applications, but they will not be our focus in 254.
JOINT VERSUS CONDITIONAL MODELS
Let x designate a training or test example, and let y be its label.
A joint model is a probability distribution p(x,y). A
conditional model is a distribution p(y|x).
In order to classify a test example, all we really need to know is the
conditional model. The joint model is more ambitious, since
p(x,y) = p(x)p(y|x). The claim is that we should not bother
learning a model for p(x).
Suppose we decompose p(x,y) in some other way: either p(x,y) =
p(y)p(x|y) or p(x,y) = f(x,y) for some functional form f(x,y).
Whatever the decomposition chosen, if we train by maximizing
likelihood, we are trading off fitting the observed p(x) well with
fitting the observed p(y|x) well. The claim is that we should
focus on the latter.
Empirically, conditional training does work better. Example: let
x be
a word in a sentence, represented by many many features. Let y be
the word-sense-disambiguation of x. With exactly the same
features, maximizing joint likelihood during training leads to
86.8/73.6% training/test accuracy.
Maximizing conditional likelihood leads to 98.5/76.1% accuracy.
(But notice that the improvement on test data is much smaller
than on training data.)
NAIVE BAYES VERSUS LOGISTIC REGRESSION
Let's look at a specific example. A naive Bayes (NB)
classifier uses the decomposition p(x,y) = p(y)p(x|y) and then the
simplification p(x|y) = PROD_i p(x_i | y). The graphical model
has arrows pointing away from the label variable y.
The graphical model for logistic regression (LR) has the arrows
pointing from the features x_i to the label y. In general, the
conditional probability table (or conditional distribution) would be
very complex because the in-degree of y is high. The basic
simplification in LR is the use of a linear model here. There is
no free lunch: learning a classifier from limited training data
requires assumptions and/or simplifications of some sort.
Note the pictures are slightly misleading. The NB model does have
an unconditional distribution for y, its single node with indegree 0.
The LR model does not have any distribution for its in-degree-0
nodes.
WHAT ARE FEATURES?
With conditional models of the sort we will investigate (i.e.
log-linear models), anything and the kitchen sink can be a feature.
We can have lots of classes, lots of features, and we can pay
attention to different features for different classes.
So, in general, a feature is a real-valued function of both the data
space X and the label space Y; formally a feature is a function f: X x
Y -> R.
Almost all features are presence/absence indicator functions so the
real value is either 0 or 1. We often sum over a training set and
talk about the value of a feature on the whole training set.
If we have a conventional attribute a(x) with k alternative values, and
n classes, we can make k*n different features as defined above.
Because features are just functions of x and y, they can overlap in
arbitrary ways. E.g. if x is a word we can have features like
"starts with a capital letter", "starts with "G", is "Graham", "is six
letters long." Generally we can code suffixes, prefixes, facts
from a lexicon, preceding/following punctuation, etc., as features.
SOFTMAX PROBABILITY MODELING
Log-linear models are actually very simple: there is one real-valued
weight for each feature, no more no less. Let the features be
subscripted i and let the weights be w_i.
Define p(y | x, w) = exp SUM_i w_i f_i(x,y) / SUM_y' exp SUM_i w_i f_i(x,y)
Why this functional form? Several aspects to the answer:
(1) The linear combination SUM_i w_i f_i(x,y) can be any
real value. The exponential makes it positive, like a valid
probability.
(2) The division makes the results between 0 and 1, i.e. a valid probability.
(3) The ranking of the probabilities will be the same as the ranking of the linear values.
This function is called a softmax because the exponentials make the
biggest linear value dominate all others, like a standard "max"
function does.
With this functional form for probabilities, we can choose parameter
values w_i to maximize the conditional probability of the training
examples.
Question: Why use the softmax? Answer: The softmax is
certainly not the only possible choice. In neural networks people
have used other similar logistic-shaped
functions like tanh, and in statistics people have used the "probit"
function instead, for example. Softmax is the most widely used
now. One reason is that its derivative is
very simple; see below. References:
http://en.wikipedia.org/wiki/Probit#Related_Topics
http://www.faqs.org/faqs/ai-faq/neural-nets/part2/section-12.html