CSE 150 LECTURE NOTES
November 8, 2004
ANNOUNCEMENTS
Remember that the midterm exam is on Wednesday.
NAIVE BAYES IMPLEMENTATION ISSUES
Problem 1: Suppose we have hundreds of features. Then we
are
multiplying together hundreds of numbers (one for each feature), each
of which is less than 1.0. We could get numerical underflow,
since the smallest non-zero number representable with standard 64 bit
arithmetic is about 10 to the power -300.
Solution: Instead of multiplying numerical probabilities, add the
logarithms of the probabilities.
Problem 2: Suppose one value of one feature, for example value y
of
feature j, is never observed in the training data for some class
c. Then we will have the estimated P(Xj = y | C = c) = 0
exactly. By multiplication, for any test example with feature j =
y we will have P(C = c | X) = 0 regardless of the values of all other
features.
Solution: Make all probabilities non-zero (and also not exactly
1.0) by "smoothing" with pseudo-counts. Define
P(C = c) = (estimated) count(C = c) + 1 / N + r.
P(Xj=x | C = c) = (estimated) count(Xj =
x and C = c) + 1 / count(C = c) + q.
GENERALIZATION, OVERFITTING, CONCEPT DRIFT
A desirable classifier is one that classifies correctly all
possible examples. Classifying examples
given in the training set correctly is just a proxy for this
objective. Terminology: A correct concept generalizes to
unseen examples.
The philosophical problem of induction is that a finite set of
examples
never allows you to generalize with perfect certainty to the infinite
set
of potential future examples.
The practical problem of overfitting is that usually, a classifier is
specialized for its training examples. Its accuracy on test data
is almost always lower than on training data.
Concept drift is a different problem. It is that the true concept
may be different for test examples than for training examples.
For example, the training examples may contain spam from 2003, while
the test examples consist of spam for 2004. Usually, the issue in
the concept drift is not that the true label of any given example has
changed. Instead, the problem is that the relative proportions of
different classes, and of different groups with in each class, have
changed.
MEASURING ACCURACY
For supervised learning with two possible classes, performance must be
measured with four numbers computed using a test set of examples that
is
independent of the training set:
- true
positives
tp
- false
positives
fp
- true
negatives
tn
- false
negatives
fn
Above, "tp" means the number of test examples that are true positives,
etc. Summary statistics and costs can be calculated from these
numbers, e.g.
accuracy = (tp + tn)/n. However, no single number such as
"accuracy" describes performance completely. In some applications
(e.g. spam filtering) false positives (saying that a spam message is
actually legitimate) are much less important than false
negatives. In other applications, the opposite is true.
CROSS-VALIDATION
What if we want to use all available examples in the training
set?
That is, we have too few to spare any for an independent test set?
Usual answer: use k-fold cross-validation for some k.
Procedure:
randomly partition the training set S into k disjoint
subsets
of equal size named S1 through Sk
for i = 1 to k do
run the learning algorithm on S with Si removed
test the resulting classifier on Si
sum tp, fp, tn, fn obtained for each Si
Under full cross-validation k = |S|, i.e. the number of examples in the
dataset. Full cross-validation is common but not recommended
because
it is computationally very expensive, and because its results can be
misleading.
For example, if each example is duplicated in the training set and we
use
a nearest-neighbor classifier, then full cross-validation will show a
zero
error rate.
Ten-fold cross-validation is used most often nowadays.