CSE 291 LECTURE NOTES

January 20, 2004
 
 

ANNOUNCEMENTS

Note that Pierre Baldi will give a lecture in the DL series on Thursday this week at 2pm.


EXPONENTIAL FAMILY COMPLETENESS THEOREM

Definition:  A distribution family P_theta on a Euclidean space belongs to the family of exponential families its density function has the form
p_theta(x)  =  C(theta) exp[  Q1(theta)*t1(x) + ... + Qk(theta)*tk(x) ] h(x)
where theta is any collection of parameters and the Q and t functions are real-valued.

Note that by the factorization theorem, the vector  [t1(x), ...,  tk(x)] is sufficient.

The exponential family includes binomial, Gaussian, Poisson, and many other families.  It does not include discrete distributions, or uniform distributions.

Theorem:  The family of distributions of [t1(x), ...,  tk(x)]  is complete if the parameter space Theta contains a k-dimensional rectangle.

Proof:  Omitted.

Notes:  When you define a family of distributions, you have to say not just what the parameters are (e.g. mu and sigma^2) but also what the allowable ranges are for these (e.g. mu > 0, sigma^2 > mu).

To show that a family is exponential, you may have to write theta as a function of the natural parameters, e.g. theta = [-0.5sigma^2, mu/sigma^2]

To use the theorem, you have to find a rectangle in the range of the transformed parameter theta.  When a family of distributions is highly restricted, e.g. we only consider Gaussians with mu = sigma^2, then completeness can fail.


DISCUSSION

What have we achieved?  We know how to recognize sufficient statistics.  Note that the concept of sufficiency refers to theta, not to a particular g(theta).  We can use the same t for many different functions g of the same theta.

We have a proof that many sufficient statistics have a family of distributions that is complete.  In this case, we know how to obtain a minimum-variance unbiased estimator (MVUE).

The remarkable aspect of this argument is that is an algorithm for generating an endless number of optimal learning algorithms.  Each MVUE is the best possible algorithm for learning a particular piece of knowledge from any dataset satisfying the assumption that it is from a populatopn P_theta.

On the other hand, optimality here is very limited.  There may be biased estimators that have better variance, even better men squared error (MSE, i.e. variance plus bias squared).

We also still need to find the expectations for each application.  Exercise/research topic: Can you use numerical methods instead of algebra and calculus to do applications?
 
For the history of the Rao-Blackwell theorem see http://www.mrs.umn.edu/~sungurea/introstat/history/w98/Rao.html.


LIKELIHOOD AND THE SCORE FUNCTION

The function  log p(x,theta)  is called the log-likelihood function.  Its derivative  s(x,theta) = d log p(x,theta) / d theta  is called the score function.

Because we use natural logarithm and d/dx log x = 1/x, the chain rule for derivatives says that

s(x,theta)  =  1/p(x,theta) * d p(x,theta) / d theta
Generally, given x we want to guess theta such that p(x,theta) is high and  d p(x,theta) / d theta = 0, to be at a local maximum for p(x,theta).  Hence for fixed x, the score function says which values of theta are best: the optimum score is zero and any non-zero score is less desirable.
 
 

MEAN AND VARIANCE OF THE SCORE FUNCTION

Lemma:  For any fixed value of theta, the score function has zero mean:

Proof:  By definition E[s(x,theta)]  =  INT_x dx p(x,theta) d/dtheta log p(x,theta).

So  E[s(x,theta)]  =  INT_x dx p(x,theta) 1/p(x,theta) * d/dtheta p(x,theta)
                              =  INT_x dx d/dtheta p(x,theta)  =  d/dtheta INT_x dx p(x,theta)  =  d/dtheta 1  =  0.

Intuitively, the integral of a derivative is the derivative of the integral because the derivative of a sum is the sum of the derivatives.  This equality can fail if the bounds over which we average x are different for different theta, but we won't go into these complications.

Because the score function has zero mean, its variance is just the expected value of its square:

var[s(x,theta)]  =  E_theta [ s(x,theta)^2 ]
Note that the variance, like the mean, is an average over all values of x, given a certain theta.  The mean is always zero but the variance can be different for different theta.
 

CRAMER-RAO LOWER BOUND

Sometimes we can't find an MVUE, but we can find an unbiased estimator.  In this case we'd like to know how good its variance is.  One way to do this is to compare it to some lower bound.  The result we'll see now gives such a lower bound.

Suppose that the score function has small variance, for some theta.  This means that all x have scores close to zero, so whatever the x that we observe, it doesn't provide much information about the value of theta.  Hence every estimator of theta based on x is likely to be bad.

More specifically, the smaller the variance of s(x,theta), the bigger the variance of any unbiased estimator g(x), including the MVUE.

Theorem [Cramer, Rao]:  Suppose the family of distributions P_theta is defined by a density function p(x,theta) where theta is a single real-valued parameter.  Let g(x) be any unbiased estimator of theta.  Then

var_theta[g(x)]  >=  1/ var[s(x,theta)].
Proof:  We start with some properties of g(x).  First, the expectation of g(x) is theta so
INT_x g(x) p(x,theta) dx  =  theta

d/ d theta  INT_x g(x) p(x,theta) dx  =  1

INT_x g(x) d/ d theta  p(x,theta) dx  =  1

The last step above comes from the fact that g(x) is not a function of theta.  It also assumes regularity conditions that we won't go into.  Now using the fact  s(x,theta)  =  d log p(x,theta)/dtheta  =   1/p(x,theta) * d p(x,theta) / d theta
INT_x g(x) * d log p(x,theta)/dtheta * p(x,theta) dx  =  1
which is the expectation of  g(x) * s(x,theta).

We proved above that E[s(x,theta)] = 0.  Consider the definition of the covariance of g(x) and s(x,theta):

cov[g(x), s(x,theta)]  =  E[ (g(x)-theta)*(s(x,theta)-0) ]
                                     =  E[ g(x)*s(x,theta) -  theta*s(x,theta) ]  =  E[ g(x)*s(x,theta) ]  -  0
Using the general result that the covariance squared is less than the product of the variances gives
var[g(x)]*var[s(x,theta)]  >=  cov[g(x), s(x,theta)]^2  =  E[ g(x)*s(x,theta) ]^2  =  1
so var[g(x)]  >=  1/ var[s(x,theta)] as wanted.
 
 



Copyright (c) by Charles Elkan, 2004.