CSE 291 LECTURE NOTES

March 11, 2004
 
 

ANNOUNCEMENTS

The final exam is next Monday from 11:50am to 2:30pm, in this room.  I don't know why the strange time.

See the sample final and the extra credit optional assignment.
 
 

THE LOGISTIC REGRESSION MODEL

On Tuesday I talked about how we can have the same model, but different modeling principles (e.g. MLE versus MVUE) can give us different parameter estimates.  Let's see an important example.  Here, different models lead to a classifier with the same form, but different numerical parameter values.
 
Suppose we have K different classes.  Let x be a multidimensional example.  Assume the following model:
log P(C=j | X=x) / P(C=K | X=x)  =  bj0 + bjTx
Notice there are K-1 linear functions of x.  Function number j gives the log-odds (also called log-ratio) of class j against class K.

Bayes rule says that

P(C=j | X=x)  =  exp( bj0 + bjTx ) / [ 1 + SUM bk0 + bkTx ]

and the probabilities sum to one.

Usually, there are just two classes and just one linear function of x.


MAX LIKELIHOOD

Consider a training set of size N.  In the two class case, let yi = 1 for class 1 and 0 for class 2.  Let p(x,b) be the probability of class 1 given x according to the model with parameters b.

Then the log-likelihood for a training set of size N is

SUMi  yi log p(xi, b) + (1-yi) log (1-p(xi, b))

The likelihood is

SUMi  yi bTxi +  log (1+exp(bTxi))
The score function, i.e. the derivative with respect to the parameters, is
SUMi  xi (yi - p(xi,b))
Setting this to zero, the first equation we get is SUMi  yi  =  SUMi  p(xi,b)  because the first component of each x is 1.
To maximize the likelihood, we use Newton-Raphson optimization.

 

ENTROPY AND KL DISTANCE

Definition:  The entropy of a discrete probability distribution is
H(P)  =  - SUM p(y) log p(y)
Note that the sum here is over all possible outcomes.  The most common base for logarithms is e.  The negative sign makes the entropy always positive.  If p(y) = 0, we define p(y) log p(y) = 0.

For a fixed support of the distribution, the entropy is maximized when p(y) is the same for all y.  This means that the outcome is maximally unpredictable, and hence that the information content of an outcome is maximal.

Entropy and variance both measure unpredictability, but entropy does not depend on the actual values of the possible outcomes.

Definition:  The Kullback-Liebler (KL)-distance between two distributions Pa and Pb is

H(Pa || Pb)  =  SUM pa(y) log pa(y)/pb(y)
Again, the sum here is over all possible outcomes.  If pa(y) = 0, we define pa(y) log pa(y)/pb(y) = 0.  If pa(y) > 0 and pb(y) = 0 then H(Pa || Pb) is infinite.

The KL-distance is always positive, and zero only if P0 and P1 are the same distribution. 

Note that the KL -distance is not symmetric, so it should not be called a distance.  An alternative name is relative entropy.

A reasonable distance metric is called the divergence:  J(P0 || P1)  =  H(P0 || P1) + H(P1 || P0).



KL DISTANCE AND LOGISTIC REGRESSION

Given a particular x_i, Suppose Pa is the empirical distribution based on y_i and Pb is the ML logistic regression distribution.  Then the MLE formula above is the negative of the average KL distance between the empirical distribution and the model distribution:
- SUMyi log p(xi, b)/ y+ (1-yi) log (1-p(xi, b))/(1-yi)
We can evaluate the same expression over a test set to evaluate the accuracy of any method for predicting probabilities (in this case plain accuracy, defined as percentage of correct predictions, is not appropriate because predictions are not discrete.)  The big disadvantage of the evaluation formula above is that if the method predicts probability exactly zero for the correct label of any test example, then the formula yields infinity.  It does not distinguish between one totally false prediction versus many.


LINEAR DISCRIMINANT ANALYSIS (LDA)

This is a probabilistic classifier that is based on a different model, but leads to a formula with the same parameters.

Suppose the prior probability of each class is pik. and each class has pdf f_k(x).  Then
P(C=j | X=x) = pij fj(x) / SUMk pik fk(x)
This says that if we knew the f_k(x), classification would be easCSE 291 Lecture Notes for November 20, 2002y.

Suppose we assume that the data in each class are generated by a single multivariate Gaussian:

fk(x) =  constant exp ( - 0.5 (x - muk)Tinvcov (x - muk )
So the log-odds for two classes is
log P(C=j | X=x) / P(C=k | X=x)  =  log pij/pik - 0.5 (muj + muk)Tinvcov (muj + muk ) + xTinvcov (muj - muk )
This is an equation linear in x.
 
 

SAME MODEL, DIFFERENT PARAMETER VALUES

We have seen that under LDA, the log odds are also a linear function of x:
log P(C=j | X=x) / P(C=K | X=x)  =  aj0 + ajTx
The two models have the same representation, but different learning algorithms.

Logistic regression learning is discriminative: it just tries to estimate conditional probabilities P(C=j | X=x).

Linear discriminant analysis (LDA) is generative: we try to estimate the parameters of a model that generates X and C.

We set the LDA parameters by maximizing P(X=x,C=c) = P(X = x)P(C=c|X=x) for the training data.

The underlying LDA model is P(X=x) = SUMk pi phi(x,muk ,cov)
 
 

GENERATIVE VERSUS DISCRIMINATIVE

Discriminative modeling makes no assumptions about the distribution of X, hence it is not affected by the data not satisfying the Gaussian assumption, e.g. if there are outliers.

But generative modeling can benefit from unlabeled data.

Rule of thumb:  With 30% more data, discriminative modeling will do as well as generative, if the data satisfy the generative Gaussian assumption.