This assignment is a second warm-up project for the main project for
254.
The objective is to understand the basic CRF algorithms
thoroughly by implementing them yourself. Using Matlab is
recommended, with the same nonlinear optimization package that you used
for the first assignment.
The goal is to learn a CRF model that can place hyphens in English words correctly. The training set consists of 11697 English words with correct hyphens
indicated. The hyphens indicated are those that TeX considers
correct. Generally these are a strict subset of those that a
standard dictionary will indicate as correct. I have checked the
list but it is likely that a few errors remain. Imperfect
training data is normal for real applications.
The algorithms you need to implement include the matrix multiplication
method for computing the partition function and the Viterbi algorithm
for computing the most probable labeling of a sequence. Very
important: you need to be sure that your code is correct, and in
your report you need to convince the reader that your code is correct.
Use the checkgrad.m function written by Carl Rasmussen.
After you are sure that your implementation is correct, tune your
learning algorithm using crossvalidation. Use two performance
metrics: word-level accuracy
and letter-level accuracy. The former is the fraction of words
that your method gets exactly right. The latter is the fraction
of letters for which your method predicts correctly whether or not a
hyphen is legal in front of this letter.
You need to define and code your own set of log-linear features.
Be creative, but start with a small set for debugging purposes.
Answer some or all of the following questions:
(1) Most classifiers produce numerical scores, and yes/no predictions
are based on whether the predicted score is above or below a threshold.
With your CRF, are there one or more thresholds that you can
tune to improve word-level accuracy and/or letter-level accuracy?
(2) Is there an objective function for training that is different from
conditional log likelihood that leads to better results for word-level
accuracy, or for letter-level accuracy?
(3) How does performance depend on which features you use?
Is there some ceiling on achievable performance regardless of
what features are used?
(4) Can you extend the standard CRF algorithms to allow a label to
depend directly on other labels in addition to the immediately
preceding label? Does this extension improve performance?
What to submit.
To quote Sanjoy Dasgupta, "Discuss your results in precise and lucid
prose. Content is king, but looks matter too!" For further
advice, see
the article Crafting
Papers on Machine Learning by Pat Langley and Three
Sins of Authors in Computer Science and Math by Jonathan Shewchuk.