CSE 254 LECTURE NOTES
April 17, 2007
ANNOUNCEMENTS
The schedule of presentations is mostly set.
I have assigned only one slot to each person so far. We'll
plan the rest of the quarter after the schedule of first presentations
has settled down.
LINEAR CRFS
Use vector notation for
sequences, so boldface x or xbar means a sequence of
variable length. Let xbar be a sequence of n words and let
ybar be a corresponding sequence of tags. Define the log-linear model:
p(ybar | xbar; w) = [ 1/Z(xbar) ] exp [ SUM_j wj Fj(xbar,ybar) ].
Assume that each feature Fj is actually a sum along the sentence, for i=1 to i=n:
Fj(xbar, ybar) = SUM_i fj(y_i-1, y_i, xbar, i).
All the fj features share the same weight wj. Summing each fj
over i means that we can have a fixed set of features for log-linear
training even though the training examples are not fixed-length.
TWO PROBLEMS TO SOLVE
(1) In words: How can we compute the maximum-probability
sequence of labels for a test sentence? Formally, given a set of
weights, to make predictions we need to solve
ybar* = argmax_ybar
p(ybar | xbar; w).
This is difficult since the number of
alternative ybar is exponential.
For training: Given a set of xbar, ybar pairs we want to find the
weights w that maximize the product of conditional probabilities.
We can do this in the same way as for any log-linear model assuming we can
compute log probabilities and gradients. Specifically, we need to do the following.
(2) For each training example compute
p(ybar | xbar; w) = [ 1/Z(xbar) ]
exp (SUM_j wj Fj(xbar,ybar).
The difficulty here is that
the denominator is again a sum over all ybar: Z(xbar) =
SUM_ybar' exp (SUM_j wj Fj(xbar, ybar').
For both these tasks, we need tricks to sum over all possible ybar.
The fact that features can depend on at most two labels,
which have to be adjacent, makes these tricks possible.
THE VITERBI ALGORITHM
Let's solve problem (1) above efficiently. First note that we can ignore the denominator. We want to compute
ybar* = argmax_ybar
p(ybar | xbar; w) = argmax_ybar exp [ SUM_j wj Fj(xbar,ybar) ].
Now use the definition of Fj. We get
ybar* = argmax_ybar [
SUM_j SUM_i fj(y_i-1, y_i, xbar, i) ] = argmax_ybar [
SUM_i gi(y_i-1, y_i, xbar) ]
where
gi(y_i-1, y_i) = SUM_j wj fj(y_i-1, y_i, xbar, i).
Note that the xbar and i arguments of fj have been dropped in the
definition of gi. Remember that each entry of the ybar vector is
one of a finite set of labels. Let v range over these labels.
Define U(k,v) to be the score of the best sequence of labels from 1 to k, where label k is required to be v. Formally:
U(k,v) = max {y1, ..., yk-1} [ gk(y_k-1, v) + SUM_i<=k-1 gi(y_i-1, y_i) ]
Now we can write down a recurrence that lets us compute U(k,v) efficiently:
U(k,v) = max_w [ U(k-1,w) + gk(w, v) ]
With this recurrence we can compute ybar for any xbar in O(m^2n) time
where n is the length of xbar and m is the cardinality of the set of
labels.
This algorithm is a variation of the Viterbi algorithm for computing the highest-probability path through a hidden Markov model.
COMPUTING THE PARTITION FUNCTION
The denominator of the probability formula is called the partition function: Z(xbar) =
SUM_ybar' exp [ SUM_j wj Fj(xbar, ybar') ].
As above, we can write
Z(xbar) =
SUM_ybar exp [ SUM_i gi(y_i-1, y_i, xbar) ] =
SUM_ybar PROD_i [ exp gi(y_i-1, y_i) ]
We can compute this efficiently by matrix multiplication. Define
Mt(v,w) = exp [ gt(v, w) ]. Consider multiplying
M1 and M2. We have
M12(u,w) = SUM_v M1(u,v)M2(v,w) = SUM_v exp [ g1(u, v, xbar, 1) + g2(v, w, xbar, 2)].
Similarly
M123(u,x) = SUM_w
M12(u,w)M3(w,x) = SUM_w [ SUM_v M1(u,v)M2(v,w) ]M3(w,x)
= SUM_v,w M1(u,v)M2(v,w)M3(w,x)
and so on.
Consider the <start,stop> entry of the entire product M123...n+1.
This is T = SUM_ybar M1(start,y1) M2(y1,y2) ...
Mn+1(yn,stop).
We have
T = SUM_ybar exp[g1(start,y1)] exp[g2(y1,y2)] ... exp[gn+1(yn,stop)] = SUM_ybar PROD_k exp[gk(yk-1,yk)]
which is exactly what we need.
Computational complexity: Each matrix is m by m where m is the
cardinality of the label set. Each matrix multiplication requires
O(m^3) time, so the total time is O(nm^3). We have reduced a sum
over an exponential number of alternatives to a polynomial-time
computation. However, even though polynomial, this is way worse
than the time needed by the Viterbi algorithm. Interesting
question: Is computing the partition function harder in some
fundamental way than computing the most likely label sequence?
(Note on notation: u, v, w, and x above are all single labels; w is not a weight and x is not a component of xbar.)
FORWARD-BACKWARD ALGORITHM
The matrix multiplication method for computing the partition function
is called a "forward-backward" algorithm. A similar algorithm can
be used to compute any function of the form SUM_ybar ht(y_t-1, yt)
EXTENSIONS TO THE BASIC LINEAR-CHAIN CRF
The input vector xbar is treated as a unit so it does not have to
be a sequence. It could be a two-dimensional image for example,
or a collection of separate individuals, e.g. telephone customers.
What is fundamental is that the label set ybar has some
structure. Specifically, only small subsets of labels interact
directly with each other. Here, every interacting subset is a
pair. Generally, tractability arises from the interacting subsets
being small. Often, the real-world reason interacting subsets are
small is that interactions are short-distance.
The objective function used for training is not the same one that we
really want to maximize on test data. Maximizing CLL instead of
LL on training data is a step forward, but further progress can be
imagined. Instead of maximizing CLL we could maximize (all on
training data):
- yes/no accuracy
- pointwise conditional log likelihood
- mean-squared error for binary labels
- some measure of distance for numerical labels
A fundamental issue is whether we want to maximize a pointwise
objective. For a long sequence, we may have a vanishing chance of
predicting the entire label sequence correctly. The single
sequence with highest probability may be very different from the most
probable label at each position.
.