CSE 291 LECTURE NOTES

January 8, 2004
 
 

MEAN SQUARED ERROR (MSE)

Definition: The mean squared error (MSE) of an estimator g hat is
E_theta [g hat - g(theta)]^2 

Example: Suppose x = (x1 ... xn) is an iid sample from a univariate normal distribution with parameter theta = (mu, sigma^2).  The obvious estimator for mu is x bar.  What is the MSE of x bar?  Answer: ...

A desirable property would be minimum mean square error: For every theta and every other estimator g bar, we want
E_theta [g hat - g(theta)]^2  <  E_theta [g bar - g(theta)]^2
This is not achievable in general.  Consider the estimator g bar(x) = g(theta_0) regardless of x.  Although this is a bad estimator in general, it has zero error for theta = theta_0.  So g hat would have to have zero error for theta_0, and hence for all theta.  (Analogy: A stopped clock is perfectly accurate twice a day.)
 
 

UNBIASEDNESS

Traditionally in machine learning the word bias means any sort of prior knowledge, e.g. the belief that survival rates are non-increasing.  In mathematical statistics the word has a different, more narrow meaning.  (Both meanings are quite different from the ordinary meaning "unjustified prejudice.")

Definition: The estimator g hat is unbiased if E_theta [g hat(x)] = g(theta) for all theta.

Example continued: Suppose x = (x1 ... xn) is an iid sample from a univariate normal distribution with parameter theta = (mu, sigma^2).  The obvious estimators for mu and sigma^2 are x bar and s^2 = (1/n) SUM (xi - x bar)^2.

It can be computed that x bar is unbiased, but s^2 is not: E[x bar] = mu and E[s^2] =/= sigma^2.

Exercise: Compute E[s^2].  This is related to Question 2 on the current assignment.

 

THE CONCEPT OF SUFFICIENCY

Often, many aspects of a sample x provide no information about an unknown parameter theta.

Example: Suppose x = (x1 ... xn) is the result of n independent binomial trials.  Intuitively, only the order of the 1s and 0s is irrelevant, and the sum SUM x_i captures all available information about the probability theta of success.

Note: We assume without question that the trials are independent.  Information other than the sum would be relevant if we wanted to check this assumption! 

Definition:  A statistic t is any function of the sample x.  Intuitively, a statistic is a summary of the observed data.

Intuitively, a statistic is sufficient if it preserves all information from x that is relevant for estimating theta.

The function x |-> SUM x_i is a statistic.  We shall prove that this statistic is sufficient.

 

FORMALIZING SUFFICIENCY

Let the family of subsets {A} be a partition of the sample space X.  For each A we have a probability distribution restricted to A: P_theta(x|x in A).  Assume that for every A this distribution is the same regardless of theta.

 Suppose we cannot observe x directly, but just that x belongs to the set A.  Clearly this information is relevant for estimating theta.

Now suppose we discover exactly which x in A was the outcome.  This extra information does not help us refine our estimate of the value of theta.

Example continued: Suppose x = (x1 ... xn) in X = {0,1}^n.  Partition X into {A0 ... An} where Ak = {x: SUM xi = k}.

Now P_theta(x|Ak) = 1/(n choose k) if x in Ak and zero if x not in Ak, for any theta.

Definition: The partition {A} of X is sufficient for the family P_theta if for every theta, P_theta(x|A) is the same for all theta.

The partition {A} is minimal sufficient if its sets are supersets of those of every other sufficient partition.
 
 

HOW TO ACHIEVE BEST-POSSIBLE SUFFICIENCY

Write x ~ x' iff p_theta(x)/p_theta(x') is the same for all theta.  The equivalence classes of ~ define a partition of X.

Lemma:  This partition is minimal sufficient (under certain natural conditions).

Example: For x being the outcome of n independent binomial trials (i.e. a binary sequence of length n), p_theta(x) = theta^z (1-theta)^(n - z) where z = sum xi.

We have p_theta(x)/p_theta(x') = theta^(z-z') (1-theta)^(-z+z').  If and only if z = z' then this ratio is constant.  Hence the partition based on SUM xi is minimal sufficient.
 
 

SUFFICIENT STATISTICS

Definition (again): A statistic is any function X -> Y for any range Y.  Often Y = R (the real numbers) but not always.

Any statistic t generates a partition of X based on the equivalence relation x ~ x' iff t(x) = t(x').

Definition:  The statistic t is (minimal) sufficient for P_theta if this partition is (minimal) sufficient.

A minimal sufficient statistic is a function of every other sufficient statistics, i.e. it loses information compared to all of these.

Note that minimal sufficient statistics are never unique.
 
 

THE RAO-BLACKWELL THEOREM

Minimum mean squared error (MSE) is unachievable, but suppose we restrict attention to unbiased estimators.  Does there exist one of these with minimum MSE?

Note that if E_theta [g hat(x)] = g(theta) then E_theta [g hat(x) - g(theta)] is the variance var_theta(g hat).  Therefore we are looking for minimum variance unbiased estimators: MVUEs.

Theorem: Let P_theta be a family of distributions on a sample space X, where theta in Theta.  Suppose g tilde: X -> R is any unbiased estimator of g: Theta -> R.  Let t be a statistic that is sufficient for theta.  Then g hat(t) = E_theta[ g tilde | t] is an unbiased estimator for g with variance equal-or-smaller to that of g tilde.

Intuition: If we average over all possible observations x' that have the same value for the sufficient statistic t, then we reduce the variance of the estimator.

 



Copyright (c) by Charles Elkan, 2004.