Handout today: The CRF tutorial by Sutton and McCallum.

The method will use a "line search:" x_i+1 = argmax_t f(x_i + t*s_i) for fixed x_i and s_i. The direction vector s_i will be a function of the gradient at x_i. Let's assume we know how do line search: this is a one-dimensional problem so it is not too difficult.

[ Adapted from Wikipedia.] Provided that *f*(*x*) is a twice-differentiable function and the initial guess *x*_{0} is chosen close enough to *x* * , the sequence (*x*_{n}) defined by

will converge towards *x* * .

This iterative scheme can be generalized to several dimensions by replacing the derivative with the gradient, , and the reciprocal of the second derivative with the inverse of the Hessian matrix, . One obtains the iterative scheme

Usually Newton's method is modified to include a small step size γ > 0 instead of γ = 1

The geometric interpretation of Newton's method is that at each iteration one approximates by a quadratic function around , and then takes a step towards the maximum/minimum of that quadratic function. (If happens to *be* a quadratic function, then the exact extremum is found in one step.)

Newton's method converges much faster towards a local maximum or
minimum than gradient descent. In fact, every local minimum has a
neighborhood *N* such that, if we start with Newton's method with step size γ = 1 converges quadratically (if the Hessian is invertible in that neighborhood).

Finding the inverse of the Hessian is an expensive operation, so the linear equation

- .

is often solved approximately (but to great accuracy) using a method such as conjugate gradient. There also exist various quasi-Newton methods, where an approximation for the Hessian is used instead.

- SUM_i wi^2/2s^2

where the sum is over all weights w_i. There is no useful theory to suggest a good choice for s^2 but results are typically insensitive to changes in s^2 up to 10 times. Use crossvalidation to find the value of s^2 that seems to lead to best test-set performance.

An advantage of Gaussian regularization is that it makes the objective function strictly convex, which avoids optimization difficulties.

Note that Gaussian regularization is different from the regularization used in the discriminant NB paper referred to on Assignment 1. It would be interesting to investigate experimentally which is preferable.

To promote sparsity, i.e. many weights being exactly zero, we can use the absolute value of w_i insteadof its square. This also has a probabilistic interpretation.

Given a sentence, the task is to label each word as noun, verb, adjective, preposition, etc. There is a fixed known set of these part-of-speech (POS) labels. Each sentence is a separate training/test example. We will represent a sentence by features based on its words. Features will be very general:

- Some features will be position-specific while others will be sums over the length of the sentence.
- Some features will look just at a word: "starts with a capital letter", "starts with "G", is "Graham", "is six letters long." Generally we can code suffixes, prefixes, facts from a lexicon, preceding/following punctuation, etc., as features.
- Features can also use the words one to the left, one to the right, two to the left etc., and the whole sentence.
- Restriction: Features can only use neighboring tags.

This is an example of a "structured prediction" task. We want to predict a complex label (a POS sequence) for a complex input (an entire sentence).

Why is this task difficult. i.e. why is it not a standard classifier learning task?

(1) We lose too much information by learning a per-word classifier.

(2) Different sentences have different lengths, so how can we have a common vector of attributes for each sentence?

(3) The set of all possible sequences of labels is intractably large.

Define the log-linear model:

p(ybar | xbar; w) = [ 1/Z(xbar) ] exp [ SUM_j wj Fj(xbar,ybar) ].

Assume that each feature Fj is actually a sum along the sentence, for i=1 to i=n:Fj(xbar, ybar) = SUM_i fj(y_i-1, y_i, xbar, i).

This notation means that each feature fj can depend on the whole
sentence, the current label and the previous label, and the current
position i within the sentence. Note that a feature can depend
on only a subset of these four possible influences. Examples of
features: "the current label is NOUN and the current word is capitalized"

"the word at the start of the sentence is Mr."

"the previous label was SALUTATION".

"the word at the start of the sentence is Mr."

"the previous label was SALUTATION".

All the fj features share the same weight wj. Summing each fj over i means that we can have a fixed set of features for log-linear training even though the training examples are not fixed-length.

(2) Training: Given a set of xbar, ybar pairs we want to find the weights w that maximize the product of conditional probabilities. We can do this like for any log-linear model assuming we can compute log p(ybar | xbar; w), so:

(3) Compute p(ybar | xbar; w) = [ 1/Z(xbar) ] exp (SUM_j wj Fj(xbar,ybar). The difficulty here is that the denominator is again a sum over all ybar: Z(xbar) = SUM_ybar' exp (SUM_j wj Fj(xbar, ybar').

For all these tasks, we will need tricks to sum over all possible ybar. The fact that features can depend on at most two labels, which have to be adjacent, will make these tricks possible.