Department of Computer Science and Engineering
University of California, San Diego
CSE 254
Spring 2007

Assignment 2 (Revised)

DUE AT THE START OF CLASS ON TUESDAY MAY 8, 2007


Important: Please use this discussion board to ask questions.

This assignment is a second warm-up project for the main project for 254.  The objective is to understand the basic CRF algorithms thoroughly by implementing them yourself.  Using Matlab is recommended, with the same nonlinear optimization package that you used for the first assignment.

The goal is to learn a CRF model that can place hyphens in English words correctly.  The training set consists of 11697 English words with correct hyphens indicated.  The hyphens indicated are those that TeX considers correct.  Generally these are a strict subset of those that a standard dictionary will indicate as correct.  I have checked the list but it is likely that a few errors remain.  Imperfect training data is normal for real applications.

The algorithms you need to implement include the matrix multiplication method for computing the partition function and the Viterbi algorithm for computing the most probable labeling of a sequence.  Very important: you need to be sure that your code is correct, and in your report you need to convince the reader that your code is correct.  Use the checkgrad.m function written by Carl Rasmussen.

After you are sure that your implementation is correct, tune your learning algorithm using crossvalidation.  Use two performance metrics: word-level accuracy and letter-level accuracy.  The former is the fraction of words that your method gets exactly right.  The latter is the fraction of letters for which your method predicts correctly whether or not a hyphen is legal in front of this letter.

You need to define and code your own set of log-linear features.  Be creative, but start with a small set for debugging purposes.

Answer some or all of the following questions:

(1) Most classifiers produce numerical scores, and yes/no predictions are based on whether the predicted score is above or below a threshold.  With your CRF, are there one or more thresholds that you can tune to improve word-level accuracy and/or letter-level accuracy?

(2) Is there an objective function for training that is different from conditional log likelihood that leads to better results for word-level accuracy, or for letter-level accuracy?

(3) How does performance depend on which features you use?  Is there some ceiling on achievable performance regardless of what features are used?

(4) Can you extend the standard CRF algorithms to allow a label to depend directly on other labels in addition to the immediately preceding label?  Does this extension improve performance?

What to submit.
  To quote Sanjoy Dasgupta, "Discuss your results in precise and lucid prose.  Content is king, but looks matter too!"  For further advice, see the article Crafting Papers on Machine Learning by Pat Langley and Three Sins of Authors in Computer Science and Math by Jonathan Shewchuk.