p_theta(x) = C(theta) exp[ Q1(theta)*t1(x) + ... + Qk(theta)*tk(x) ] h(x)where theta is any collection of parameters and the Q and t functions are real-valued.
Note that by the factorization theorem, the vector [t1(x), ..., tk(x)] is sufficient.
The exponential family includes binomial, Gaussian, Poisson, and many other families. It does not include discrete distributions, or uniform distributions.
Theorem: The family of distributions of [t1(x), ..., tk(x)] is complete if the parameter space Theta contains a k-dimensional rectangle.
Proof: Omitted.
Notes: When you define a family of distributions, you have to say not just what the parameters are (e.g. mu and sigma^2) but also what the allowable ranges are for these (e.g. mu > 0, sigma^2 > mu).
To show that a family is exponential, you may have to write theta as a function of the natural parameters, e.g. theta = [-0.5sigma^2, mu/sigma^2]
To use the theorem, you have to find a rectangle in the range of the
transformed parameter theta. When a family of distributions is
highly restricted, e.g. we only consider Gaussians with mu = sigma^2,
then completeness can fail.
The remarkable aspect of this argument is that is an algorithm for generating an endless number of optimal learning algorithms. Each MVUE is the best possible algorithm for learning a particular piece of knowledge from any dataset satisfying the assumption that it is from a populatopn P_theta.
On the other hand, optimality here is very limited. There may be biased estimators that have better variance, even better men squared error (MSE, i.e. variance plus bias squared).
We also still need to find the expectations for each application. Exercise/research topic: Can you use numerical methods instead of algebra and calculus to do applications?Because we use natural logarithm and d/dx log x = 1/x, the chain rule for derivatives says that
s(x,theta) = 1/p(x,theta) * d p(x,theta) / d thetaGenerally, given x we want to guess theta such that p(x,theta) is high and d p(x,theta) / d theta = 0, to be at a local maximum for p(x,theta). Hence for fixed x, the score function says which values of theta are best: the optimum score is zero and any non-zero score is less desirable.
Proof: By definition E[s(x,theta)] = INT_x dx p(x,theta) d/dtheta log p(x,theta).
So E[s(x,theta)] = INT_x dx p(x,theta)
1/p(x,theta) * d/dtheta p(x,theta)
= INT_x dx d/dtheta p(x,theta) = d/dtheta INT_x dx
p(x,theta) = d/dtheta 1 = 0.
Intuitively, the integral of a derivative is the derivative of the integral because the derivative of a sum is the sum of the derivatives. This equality can fail if the bounds over which we average x are different for different theta, but we won't go into these complications.
Because the score function has zero mean, its variance is just the expected value of its square:
var[s(x,theta)] = E_theta [ s(x,theta)^2 ]Note that the variance, like the mean, is an average over all values of x, given a certain theta. The mean is always zero but the variance can be different for different theta.
Suppose that the score function has small variance, for some theta. This means that all x have scores close to zero, so whatever the x that we observe, it doesn't provide much information about the value of theta. Hence every estimator of theta based on x is likely to be bad.
More specifically, the smaller the variance of s(x,theta), the bigger the variance of any unbiased estimator g(x), including the MVUE.
Theorem [Cramer, Rao]: Suppose the family of distributions P_theta is defined by a density function p(x,theta) where theta is a single real-valued parameter. Let g(x) be any unbiased estimator of theta. Then
var_theta[g(x)] >= 1/ var[s(x,theta)].Proof: We start with some properties of g(x). First, the expectation of g(x) is theta so
INT_x g(x) p(x,theta) dx = thetaThe last step above comes from the fact that g(x) is not a function of theta. It also assumes regularity conditions that we won't go into. Now using the fact s(x,theta) = d log p(x,theta)/dtheta = 1/p(x,theta) * d p(x,theta) / d thetad/ d theta INT_x g(x) p(x,theta) dx = 1
INT_x g(x) d/ d theta p(x,theta) dx = 1
INT_x g(x) * d log p(x,theta)/dtheta * p(x,theta) dx = 1which is the expectation of g(x) * s(x,theta).
We proved above that E[s(x,theta)] = 0. Consider the definition of the covariance of g(x) and s(x,theta):
cov[g(x), s(x,theta)] = E[ (g(x)-theta)*(s(x,theta)-0) ]Using the general result that the covariance squared is less than the product of the variances gives
= E[ g(x)*s(x,theta) - theta*s(x,theta) ] = E[ g(x)*s(x,theta) ] - 0
var[g(x)]*var[s(x,theta)] >= cov[g(x), s(x,theta)]^2 = E[ g(x)*s(x,theta) ]^2 = 1so var[g(x)] >= 1/ var[s(x,theta)] as wanted.