CSE 254 LECTURE NOTES

April 17, 2007


ANNOUNCEMENTS

The schedule of presentations is mostly set.  I have assigned only one slot to each person so far.  We'll plan the rest of the quarter after the schedule of first presentations has settled down.

LINEAR CRFS

Use vector notation for sequences, so boldface x or xbar means a sequence of variable length.  Let xbar be a sequence of n words and let ybar be a corresponding sequence of tags.  Define the log-linear model:

p(ybar | xbar; w)  =  [ 1/Z(xbar) ]  exp [ SUM_j wj Fj(xbar,ybar) ].

Assume that each feature Fj is actually a sum along the sentence, for i=1 to i=n:

Fj(xbar, ybar)  =  SUM_i fj(y_i-1, y_i, xbar, i).

All the fj features share the same weight wj.  Summing each fj over i means that we can have a fixed set of features for log-linear training even though the training examples are not fixed-length.


TWO PROBLEMS TO SOLVE

(1) In words:  How can we compute the maximum-probability sequence of labels for a test sentence?  Formally, given a set of weights, to make predictions we need to solve  
ybar* = argmax_ybar p(ybar | xbar; w).  
This is difficult since the number of alternative ybar is exponential.

For training:  Given a set of xbar, ybar pairs we want to find the weights w that maximize the product of conditional probabilities.  We can do this in the same way as for any log-linear model assuming we can compute log probabilities and gradients.   Specifically, we need to do the following.

(2) For each training example compute  
p(ybar | xbar; w)  =  [ 1/Z(xbar) ]  exp (SUM_j wj Fj(xbar,ybar).  
The difficulty here is that the denominator is again a sum over all ybar:  Z(xbar)  =  SUM_ybar'   exp (SUM_j wj Fj(xbar, ybar').  

For both these tasks, we need tricks to sum over all possible ybar.  The fact that features can depend on at most two labels, which have to be adjacent, makes these tricks possible.


THE VITERBI ALGORITHM

Let's solve problem (1) above efficiently.  First note that we can ignore the denominator.  We want to compute  
ybar*  =  argmax_ybar  p(ybar | xbar; w)  =  argmax_ybar  exp [ SUM_j wj Fj(xbar,ybar) ].

Now use the definition of Fj.  We get
ybar*  =  argmax_ybar  [ SUM_j SUM_i fj(y_i-1, y_i, xbar, i) ]  =  argmax_ybar  [ SUM_i gi(y_i-1, y_i, xbar) ]
where
gi(y_i-1, y_i)  =  SUM_j wj fj(y_i-1, y_i, xbar, i).

Note that the xbar and i arguments of fj have been dropped in the definition of gi.  Remember that each entry of the ybar vector is one of a finite set of labels.  Let v range over these labels.

Define U(k,v) to be the score of the best sequence of labels from 1 to k, where label k is required to be v.  Formally:
U(k,v)  =  max {y1, ..., yk-1} [ gk(y_k-1, v) + SUM_i<=k-1 gi(y_i-1, y_i) ]

Now we can write down a recurrence that lets us compute U(k,v) efficiently:
U(k,v)  =  max_w [ U(k-1,w) +  gk(w, v) ]

With this recurrence we can compute ybar for any xbar in O(m^2n) time where n is the length of xbar and m is the cardinality of the set of labels.

This algorithm is a variation of the Viterbi algorithm for computing the highest-probability path through a hidden Markov model.


COMPUTING THE PARTITION FUNCTION

The denominator of the probability formula is called the partition function:  Z(xbar)  =  SUM_ybar' exp [ SUM_j wj Fj(xbar, ybar') ].  
As above, we can write
Z(xbar)  =  SUM_ybar exp [ SUM_i gi(y_i-1, y_i, xbar) ]   =  SUM_ybar PROD_i [ exp gi(y_i-1, y_i) ]  

We can compute this efficiently by matrix multiplication.  Define Mt(v,w)  =  exp [ gt(v, w) ].  Consider multiplying M1 and M2.  We have
M12(u,w)  =  SUM_v M1(u,v)M2(v,w)  =  SUM_v exp [ g1(u, v, xbar, 1) + g2(v, w, xbar, 2)].
Similarly
M123(u,x)  =  SUM_w M12(u,w)M3(w,x)  =  SUM_w [ SUM_v M1(u,v)M2(v,w) ]M3(w,x)  =  SUM_v,w M1(u,v)M2(v,w)M3(w,x)
and so on.

Consider the <start,stop> entry of the entire product M123...n+1.  This is  T = SUM_ybar M1(start,y1) M2(y1,y2) ... Mn+1(yn,stop).
We have
T = SUM_ybar exp[g1(start,y1)] exp[g2(y1,y2)] ... exp[gn+1(yn,stop)]  =  SUM_ybar PROD_k exp[gk(yk-1,yk)]
which is exactly what we need.

Computational complexity:  Each matrix is m by m where m is the cardinality of the label set.  Each matrix multiplication requires O(m^3) time, so the total time is O(nm^3).  We have reduced a sum over an exponential number of alternatives to a polynomial-time computation.  However, even though polynomial, this is way worse than the time needed by the Viterbi algorithm.  Interesting question:  Is computing the partition function harder in some fundamental way than computing the most likely label sequence?

(Note on notation: u, v, w, and x above are all single labels;  w is not a weight and x is not a component of xbar.)


FORWARD-BACKWARD ALGORITHM

The matrix multiplication method for computing the partition function is called a "forward-backward" algorithm.  A similar algorithm can be used to compute any function of the form SUM_ybar ht(y_t-1, yt)


EXTENSIONS TO THE BASIC LINEAR-CHAIN CRF

The input vector xbar is treated as a unit so it does not have to be a sequence.  It could be a two-dimensional image for example, or a collection of separate individuals, e.g. telephone customers.  What is fundamental is that the label set ybar has some structure.  Specifically, only small subsets of labels interact directly with each other.  Here, every interacting subset is a pair.  Generally, tractability arises from the interacting subsets being small.  Often, the real-world reason interacting subsets are small is that interactions are short-distance.

The objective function used for training is not the same one that we really want to maximize on test data.  Maximizing CLL instead of LL on training data is a step forward, but further progress can be imagined.  Instead of maximizing CLL we could maximize (all on training data):
A fundamental issue is whether we want to maximize a pointwise objective.  For a long sequence, we may have a vanishing chance of predicting the entire label sequence correctly.  The single sequence with highest probability may be very different from the most probable label at each position.
.