CSE 150 LECTURE NOTES

November 8, 2004
 
 

ANNOUNCEMENTS

Remember that the midterm exam is on Wednesday.


NAIVE BAYES IMPLEMENTATION ISSUES

Problem 1:  Suppose we have hundreds of features.  Then we are multiplying together hundreds of numbers (one for each feature), each of which is less than 1.0.  We could get numerical underflow, since the smallest non-zero number representable with standard 64 bit arithmetic is about 10 to the power -300.

Solution:  Instead of multiplying numerical probabilities, add the logarithms of the probabilities.

Problem 2:  Suppose one value of one feature, for example value y of feature j, is never observed in the training data for some class c.  Then we will have the estimated P(Xj = y | C = c) = 0 exactly.  By multiplication, for any test example with feature j = y we will have P(C = c | X) = 0 regardless of the values of all other features.

Solution:  Make all probabilities non-zero (and also not exactly 1.0) by "smoothing" with pseudo-counts.  Define
            P(C = c)   = (estimated)   count(C = c) + 1 / N + r.
P(Xj=x |  C = c)   = (estimated)   count(Xj = x and C = c) + 1 / count(C = c) + q.


GENERALIZATION, OVERFITTING, CONCEPT DRIFT

A desirable classifier is one that classifies correctly all possible examples.  Classifying examples given in the training set correctly is just a proxy for this objective.  Terminology: A correct concept generalizes to unseen examples.

The philosophical problem of induction is that a finite set of examples never allows you to generalize with perfect certainty to the infinite set of potential future examples.

The practical problem of overfitting is that usually, a classifier is specialized for its training examples.  Its accuracy on test data is almost always lower than on training data.

Concept drift is a different problem.  It is that the true concept may be different for test examples than for training examples.  For example, the training examples may contain spam from 2003, while the test examples consist of spam for 2004.  Usually, the issue in the concept drift is not that the true label of any given example has changed.  Instead, the problem is that the relative proportions of different classes, and of different groups with in each class, have changed.


MEASURING ACCURACY

For supervised learning with two possible classes, performance must be measured with four numbers computed using a test set of examples that is independent of the training set: Above, "tp" means the number of test examples that are true positives, etc.  Summary statistics and costs can be calculated from these numbers, e.g. accuracy = (tp + tn)/n.  However, no single number such as "accuracy" describes performance completely.  In some applications (e.g. spam filtering) false positives (saying that a spam message is actually legitimate) are much less important than false negatives.  In other applications, the opposite is true.

 

CROSS-VALIDATION

What if we want to use all available examples in the training set?  That is, we have too few to spare any for an independent test set?

Usual answer: use k-fold cross-validation for some k.  Procedure:

randomly partition the training set S into k disjoint subsets of equal size named S1 through Sk
for i = 1 to k do
    run the learning algorithm on S with Si removed
    test the resulting classifier on Si
    sum tp, fp, tn, fn obtained for each Si
Under full cross-validation k = |S|, i.e. the number of examples in the dataset.  Full cross-validation is common but not recommended because it is computationally very expensive, and because its results can be misleading.  For example, if each example is duplicated in the training set and we use a nearest-neighbor classifier, then full cross-validation will show a zero error rate. Ten-fold cross-validation is used most often nowadays.