log P(C=j | X=x) / P(C=K | X=x) = bj0 + bjTxNotice there are K-1 linear functions of x. Function number j gives the log-odds (also called log-ratio) of class j against class K.
Bayes rule says that
and the probabilities sum to one.
Usually, there are just two classes and just one linear function of
x.
Then the log-likelihood for a training set of size N is
SUMi yi log p(xi, b) + (1-yi) log (1-p(xi, b))
The likelihood is
SUMi yi bTxi + log (1+exp(bTxi))The score function, i.e. the derivative with respect to the parameters, is
SUMi xi (yi - p(xi,b))Setting this to zero, the first equation we get is SUMi yi = SUMi p(xi,b) because the first component of each x is 1.
For a fixed support of the distribution, the entropy is maximized when p(y) is the same for all y. This means that the outcome is maximally unpredictable, and hence that the information content of an outcome is maximal.
Entropy and variance both measure unpredictability, but entropy does not depend on the actual values of the possible outcomes.
Definition: The Kullback-Liebler (KL)-distance between two distributions Pa and Pb is
H(Pa || Pb) = SUM pa(y) log pa(y)/pb(y)Again, the sum here is over all possible outcomes. If pa(y) = 0, we define pa(y) log pa(y)/pb(y) = 0. If pa(y) > 0 and pb(y) = 0 then H(Pa || Pb) is infinite.
Note that the KL -distance is not symmetric, so it should not be called a distance. An alternative name is relative entropy.
A reasonable distance metric is called the divergence: J(P0 || P1) = H(P0 || P1) + H(P1 || P0).- SUMi yi log p(xi, b)/ yi + (1-yi) log (1-p(xi, b))/(1-yi)We can evaluate the same expression over a test set to evaluate the accuracy of any method for predicting probabilities (in this case plain accuracy, defined as percentage of correct predictions, is not appropriate because predictions are not discrete.) The big disadvantage of the evaluation formula above is that if the method predicts probability exactly zero for the correct label of any test example, then the formula yields infinity. It does not distinguish between one totally false prediction versus many.
P(C=j | X=x) = pij fj(x) / SUMk pik fk(x)This says that if we knew the f_k(x), classification would be easCSE 291 Lecture Notes for November 20, 2002y.
Suppose we assume that the data in each class are generated by a single multivariate Gaussian:
fk(x) = constant exp ( - 0.5 (x - muk)Tinvcov (x - muk )So the log-odds for two classes is
log P(C=j | X=x) / P(C=k | X=x) = log pij/pik - 0.5 (muj + muk)Tinvcov (muj + muk ) + xTinvcov (muj - muk )This is an equation linear in x.
log P(C=j | X=x) / P(C=K | X=x) = aj0 + ajTxThe two models have the same representation, but different learning algorithms.
Logistic regression learning is discriminative: it just tries to estimate conditional probabilities P(C=j | X=x).
Linear discriminant analysis (LDA) is generative: we try to estimate the parameters of a model that generates X and C.
We set the LDA parameters by maximizing P(X=x,C=c) = P(X = x)P(C=c|X=x) for the training data.
The underlying LDA model is P(X=x) = SUMk pik
phi(x,muk ,cov)
But generative modeling can benefit from unlabeled data.
Rule of thumb: With 30% more data, discriminative modeling
will
do as well as generative, if the data satisfy the generative Gaussian
assumption.