CSE 254 LECTURE NOTES
April 12, 2007
Handout today: The CRF tutorial by Sutton and McCallum.
We have a function from R^n to R that we want to maximize. We'll
use a global method: the entire x vector will be updated at once.
(An alternative is called stochastic descent.) The method
will be iterative: we will find x_1, x_2, etc. such that f(x_i+1) >=
f(x_i) for each iteration i.
The method will use a "line search:" x_i+1 = argmax_t f(x_i +
t*s_i) for fixed x_i and s_i. The direction vector s_i
will be a function of the gradient at x_i. Let's assume we
know how do line search: this is a one-dimensional problem so it is not
[ Adapted from Wikipedia.] Provided that f(x) is a twice-differentiable function and the initial guess x0 is chosen close enough to x * , the sequence (xn) defined by
will converge towards x * .
This iterative scheme can be generalized to several dimensions by replacing the derivative with the gradient, , and the reciprocal of the second derivative with the inverse of the Hessian matrix, . One obtains the iterative scheme
Usually Newton's method is modified to include a small step size γ > 0 instead of γ = 1
The geometric interpretation of Newton's method is that at each iteration one approximates by a quadratic function around , and then takes a step towards the maximum/minimum of that quadratic function. (If happens to be a quadratic function, then the exact extremum is found in one step.)
Newton's method converges much faster towards a local maximum or
minimum than gradient descent. In fact, every local minimum has a
neighborhood N such that, if we start with Newton's method with step size γ = 1 converges quadratically (if the Hessian is invertible in that neighborhood).
Finding the inverse of the Hessian is an expensive operation, so the linear equation
is often solved approximately (but to great accuracy) using a method
such as conjugate gradient. There also exist various quasi-Newton
methods, where an approximation for the Hessian is used instead.
This is a quasi-Newton method. It uses the
previously computed gradients to compute an estimate of the inverse of the Hessian
matrix, then uses this approximate matrix to compute the new search direction.
The limited-memory BFGS method uses just a few most recent
gradients to define a low-rank approximation of the Hessian that does
not need to be computed or stored explicitly.
Impose a diagonal Gaussian prior on the parameters with mean zero and
covariance s^2I. This is equivalent to adding to the CLL objective function a term
- SUM_i wi^2/2s^2
where the sum is over all weights w_i.
There is no useful theory to suggest a good choice for s^2 but
results are typically insensitive to changes in s^2 up to 10 times.
Use crossvalidation to find the value of s^2 that seems to lead
to best test-set performance.
An advantage of Gaussian regularization is that it makes the objective
function strictly convex, which avoids optimization difficulties.
Note that Gaussian regularization is different from the
regularization used in the discriminant NB paper referred to on
Assignment 1. It would be interesting to investigate
experimentally which is preferable.
To promote sparsity, i.e. many weights being exactly zero, we
can use the absolute value of w_i insteadof its square. This
also has a probabilistic interpretation.
THE TASK OF TAGGING PART-OF-SPEECH
Finally, we can start talking about CRFs. We'll extend log-linear
models to be CRFs. First, let's think about an example of a
learning task for which CRFs are useful.
Given a sentence, the task is to label each word as noun, verb,
adjective, preposition, etc. There is a fixed known set of these
part-of-speech (POS) labels. Each sentence is a separate training/test example. We will
represent a sentence by features based on its words. Features
will be very general:
The highest-accuracy POS taggers use 100,000s of features.
- Some features will be position-specific while others will be sums over the length of the sentence.
- Some features will look just at a word:
"starts with a capital letter", "starts with "G", is "Graham", "is six
letters long." Generally we can code suffixes, prefixes, facts
from a lexicon, preceding/following punctuation, etc., as features.
- Features can also use the words one to the left, one to the
right, two to the left etc., and the whole sentence.
- Restriction: Features can only use neighboring tags.
This is an example of a "structured prediction" task. We want to
predict a complex label (a POS sequence) for a complex input (an entire
Why is this task difficult. i.e. why is it not a standard classifier learning task?
(1) We lose too much information by learning a per-word classifier.
(2) Different sentences have different lengths, so how can we have a common vector of attributes for each sentence?
(3) The set of all possible sequences of labels is intractably large.
Linear conditional random fields (CRFs) are a way to extend
log-linear models to this type of task. Use vector notation for
sequences, so boldface x or xbar means a sequence of
variable length. Let xbar be a sequence of n words and let
ybar be a corresponding sequence of tags.
Define the log-linear model:
p(ybar | xbar; w) = [ 1/Z(xbar) ] exp [ SUM_j wj Fj(xbar,ybar) ].
Assume that each feature Fj is actually a sum along the sentence, for i=1 to i=n:
Fj(xbar, ybar) = SUM_i fj(y_i-1, y_i, xbar, i).
This notation means that each feature fj can depend on the whole
sentence, the current label and the previous label, and the current
position i within the sentence. Note that a feature can depend
on only a subset of these four possible influences. Examples of
"the current label is NOUN and the current word is capitalized"
"the word at the start of the sentence is Mr."
"the previous label was SALUTATION".
All the fj features share the same weight wj. Summing each fj
over i means that we can have a fixed set of features for log-linear
training even though the training examples are not fixed-length.
THREE PROBLEMS TO SOLVE
(1) In words: How can we compute the maximum-probability
sequence of labels for a test sentence? Formally, given a set of
weights, to make predictions we need to solve ybar* = argmax_ybar
p(ybar | xbar; w). This is difficult since the number of
alternative ybar is exponential.
(2) Training: Given a set of xbar, ybar pairs we want to find the
weights w that maximize the product of conditional probabilities.
We can do this like for any log-linear model assuming we can
compute log p(ybar | xbar; w), so:
(3) Compute p(ybar | xbar; w) = [ 1/Z(xbar) ]
exp (SUM_j wj Fj(xbar,ybar). The difficulty here is that
the denominator is again a sum over all ybar: Z(xbar) =
SUM_ybar' exp (SUM_j wj Fj(xbar, ybar').
For all these tasks, we will need tricks to sum over all possible ybar.
The fact that features can depend on at most two labels,
which have to be adjacent, will make these tricks possible.