April 12, 2007


Handout today:  The CRF tutorial by Sutton and McCallum.


We have a function from R^n to R that we want to maximize.  We'll use a global method: the entire x vector will be updated at once.  (An alternative is called stochastic descent.)  The method will be iterative: we will find x_1, x_2, etc. such that f(x_i+1) >= f(x_i) for each iteration i.

The method will use a "line search:"  x_i+1 = argmax_t f(x_i + t*s_i)  for fixed x_i and s_i.   The direction vector s_i  will be a function of the gradient at x_i.  Let's assume we know how do line search: this is a one-dimensional problem so it is not too difficult.  


[ Adapted from Wikipedia.]  Provided that f(x) is a twice-differentiable function and the initial guess x0 is chosen close enough to x * , the sequence (xn) defined by

x_{n+1} = x_n - \frac{f'(x_n)}{f''(x_n)}, \ n \ge 0

will converge towards x * .

This iterative scheme can be generalized to several dimensions by replacing the derivative with the gradient, \nabla f(\mathbf{x}), and the reciprocal of the second derivative with the inverse of the Hessian matrix, H f(\mathbf{x}). One obtains the iterative scheme

\mathbf{x}_{n+1} = \mathbf{x}_n - [H f(\mathbf{x}_n)]^{-1} \nabla f(\mathbf{x}_n), \ n \ge 0.

Usually Newton's method is modified to include a small step size γ > 0 instead of γ = 1

\mathbf{x}_{n+1} = \mathbf{x}_n - \gamma[H f(\mathbf{x}_n)]^{-1} \nabla f(\mathbf{x}_n).

The geometric interpretation of Newton's method is that at each iteration one approximates f(\mathbf{x}) by a quadratic function around \mathbf{x}_n, and then takes a step towards the maximum/minimum of that quadratic function. (If f(\mathbf{x}) happens to be a quadratic function, then the exact extremum is found in one step.)

Newton's method converges much faster towards a local maximum or minimum than gradient descent. In fact, every local minimum has a neighborhood N such that, if we start with \mathbf{x}_0 \in N, Newton's method with step size γ = 1 converges quadratically (if the Hessian is invertible in that neighborhood).

Finding the inverse of the Hessian is an expensive operation, so the linear equation

\mathbf{p}_{n} = \mathbf{x}_{n+1}-\mathbf{x}_{n} = [H f(\mathbf{x}_n)]^{-1} \nabla f(\mathbf{x}_n), \ n \ge 0..

is often solved approximately (but to great accuracy) using a method such as conjugate gradient.  There also exist various quasi-Newton methods, where an approximation for the Hessian is used instead.


This is a quasi-Newton method.  It uses the previously computed gradients to compute an estimate of the inverse of the Hessian matrix, then uses this approximate matrix to compute the new search direction.  The limited-memory BFGS method uses just a few most recent gradients to define a low-rank approximation of the Hessian that does not need to be computed or stored explicitly.


Impose a diagonal Gaussian prior on the parameters with mean zero and covariance s^2I.  This is equivalent to adding to the CLL objective function a term
- SUM_i wi^2/2s^2  
where the sum is over all weights w_i.

There is no useful theory to suggest a good choice for s^2 but results are typically insensitive to changes in s^2 up to 10 times.  Use crossvalidation to find the value of s^2 that seems to lead to best test-set performance.

An advantage of Gaussian regularization is that it makes the objective function strictly convex, which avoids optimization difficulties.

Note that Gaussian regularization is different from the regularization used in the discriminant NB paper referred to on Assignment 1.  It would be interesting to investigate experimentally which is preferable.

To promote sparsity, i.e. many weights being exactly zero, we can use the absolute value of w_i insteadof its square.  This also has a probabilistic interpretation.


Finally, we can start talking about CRFs.  We'll extend log-linear models to be CRFs.  First, let's think about an example of a learning task for which CRFs are useful.

Given a sentence, the task is to label each word as noun, verb, adjective, preposition, etc.  There is a fixed known set of these part-of-speech (POS) labels.  Each sentence is a separate training/test example.  We will represent a sentence by features based on its words.  Features will be very general:
The highest-accuracy POS taggers use 100,000s of features.

This is an example of a "structured prediction" task.  We want to predict a complex label (a POS sequence) for a complex input (an entire sentence).  

Why is this task difficult. i.e. why is it not a standard classifier learning task?
(1) We lose too much information by learning a per-word classifier.
(2) Different sentences have different lengths, so how can we have a common vector of attributes for each sentence?
(3) The set of all possible sequences of labels is intractably large.


Linear conditional random fields (CRFs) are a way to extend log-linear models to this type of task.  Use vector notation for sequences, so boldface x or xbar means a sequence of variable length.  Let xbar be a sequence of n words and let ybar be a corresponding sequence of tags.

Define the log-linear model:

p(ybar | xbar; w)  =  [ 1/Z(xbar) ]  exp [ SUM_j wj Fj(xbar,ybar) ].

Assume that each feature Fj is actually a sum along the sentence, for i=1 to i=n:

Fj(xbar, ybar)  =  SUM_i fj(y_i-1, y_i, xbar, i).

This notation means that each feature fj can depend on the whole sentence, the current label and the previous label, and the current position i within the sentence.   Note that a feature can depend on only a subset of these four possible influences.  Examples of features:
"the current label is NOUN and the current word is capitalized"
"the word at the start of the sentence is Mr."
"the previous label was SALUTATION".
All the fj features share the same weight wj.  Summing each fj over i means that we can have a fixed set of features for log-linear training even though the training examples are not fixed-length.


(1) In words:  How can we compute the maximum-probability sequence of labels for a test sentence?  Formally, given a set of weights, to make predictions we need to solve  ybar* = argmax_ybar p(ybar | xbar; w).  This is difficult since the number of alternative ybar is exponential.

(2) Training:  Given a set of xbar, ybar pairs we want to find the weights w that maximize the product of conditional probabilities.  We can do this like for any log-linear model assuming we can compute  log p(ybar | xbar; w), so:

(3) Compute  p(ybar | xbar; w)  =  [ 1/Z(xbar) ]  exp (SUM_j wj Fj(xbar,ybar).  The difficulty here is that the denominator is again a sum over all ybar: Z(xbar)  =  SUM_ybar'   exp (SUM_j wj Fj(xbar, ybar').  

For all these tasks, we will need tricks to sum over all possible ybar.  The fact that features can depend on at most two labels, which have to be adjacent, will make these tricks possible.