CSE 150 LECTURE NOTES

November 1, 2004
 
 

ANNOUNCEMENTS

Today's handout is the description of the third assignment.

 

FRAMEWORK FOR LEARNING A CLASSIFIER

With N training examples, the training data are a matrix with N rows and p columns, where each example is represented by values for p different features.  Let feature value j for example i be x_ij.
The label of example i is y_i, for example y_i = 1 if message i is spam and y_i = 0 if it is not spam.

Each test example is also represented as a row vector of length p.  The label y for a test example is unknown.  The output of the classifier is a guess at y.

  

RANDOM VARIABLES

We view each training example as a vector of random values.  Mathematically, each feature is a random variable.  Note that the class is a random variable also.

Let C be the class random variable, and let X1 to Xp be the feature random variables.  For a training example, the value of all of these is known.  For a test example, only the value of X1 to Xp is known. 

Notation:  P(X = x) means the probability that the random variable X (always capitalized) has the particular value x (always lower case).  A "probability density function" or "pdf" assigns a probability to each value of a random variable.  We write X ~ f(x) to mean P(X = x) = f(x).

When we write just P(X) we mean P(X = x) for any individual value x.  The notation P(X|Y) means P(X=x | Y = y). 

If X and Y are independent, then P(X,Y) = P(X)P(Y).  This formula says that for all values x and for all values y, P(X =x and Y = y) = P(X =x)P(Y = y).


BAYESIAN LEARNING

Bayesian learning is an approach to learning a classifier using the language of probability theory and Bayes' rule.

Given a test example, we the classifier to want to estimate P(C), i.e. P(C = c) for each different value C might take.  Often, the only possible values are C = 0 and C = 1.  More precisely, we want to estimate the conditional probability P(C | X1 and ... and Xp), which is different for each example since each example has different values for the random variables X1 to Xp. 

Let's use Bayes' rule on the test example:        P(C=c | X1=x1 and ... and Xp=xp)  =  P(X1=x1 and ... and Xp=xp | C=c) P(C=c) / P(X1=x1 and ... and Xp=xp).

There are three factors above that we need to learn from the training data.

The easiest is P(C = c).  The unconditional P(C = c) is the overall probability of label value c in the whole population (also called the incidence of C=c or the base rate of C=c).  We estimate this as count(C = c)/n where count(C =c) means the number of training examples with value c for the class, and n is the total number of training examples

The denominator is next easiest, because for any fixed X, it is the same for all different c.  There are three different ways to deal with the denominator:
  1. If we just want to know which c is most probable given X, we just need to find which numerator is highest.
  2. SUM_c  P(C=c | X1=x1 and ... and Xp=xp) = 1 always, so we can calculate the numerator for every c and then calculate the value for the denominator that makes this sum equal to 1.  In other words, the different labels c are mutually exclusive, so
P(X1=x1 and ... and Xp=xp) = SUM_c P(X1=x1 and ... and Xp=xp | C=c) P(C=c). 

With just two labels 0 and 1 we have

P(X1=x1 and ... and Xp=xp) = P(X1=x1 and ... and Xp=xp | C=0) P(C=0) + P(X1=x1 and ... and Xp=xp | C=1) P(C=1).

All the equations above are mathematically always true.  We have not made any assumptions yet.


THE NAIVE BAYES ASSUMPTION: CONDITIONAL INDEPENDENCE

So, we need to estimate P(X1=x1 and ... and Xp=xp | C=c) somehow.

Now we make an important assumption that in general is not exactly true.  Let's assume that

P(X1=x1 and ... and Xp=xp | C=c) = P(X1=x1 C = c) ... P(Xp=xp | C=c).

This is an assumption that the features are independent inside each class.  It is not an overall independence assumption.

GRE and GPA example.