Learning Mixture Models
Learning mixture models is a natural formalization of the clustering
problem. A mixture model is a collection of distributions D1,
.., Dk and weights w1, .., wk. To sample
from a mixture model, we draw i with probability wi, and then
draw a random sample from Di. Thus, each cluster in the data
corresponds to a different distribution Di in the mixture, and the weights wi correspond to the fraction of points from each cluster in the
entire dataset.
In the problem of learning mixture models, we are given samples from a
mixture model, and our goal is to (a) learn the parameters of the
distributions Di and (b) classify each sample, according to its source
distribution.
The most popular algorithms for learning mixture models in practice are the
Expectation-Maximization(EM) and the k-means algorithms.
Separation
If the distributions in the mixture are very close together in space, then
intuitively the mixture will be hard to learn. To quantify this effect,
Dasgupta[Das99] introduced the idea of a separation condition . A separation
condition is a promise that ensures that every pair of distributions in
the mixture are far apart. Given a mixture with a certain separation, the
goal of the algorithm designer is to learn the correct clustering of the
data.
A long line of literature has produced algorithms which have smaller and
smaller separation requirements. Typically, separation is
measured by the minimum distance between the means of any pairs of
distributions in the mixture, as a function of a standard deviation. The
parameters here are d, the dimensionality of the data, σ, the maximum
directional standard deviation, and σ* the maximum directional
standard deviation in the subspace containing the means.
The table below summarizes some of the bounds.
Reference |
Separation |
Comments |
Algorithm |
[Das99] |
σ d1/2 |
Spherical Gaussians |
Random Projections |
[DS00] |
σ d1/4 |
Bounded Eccentricity |
EM |
[AK01] |
σ d1/4 |
Includes other conditions, like concentric Gaussians |
Distance thresholding |
[VW02] |
σ k1/4 |
Spherical Gaussians |
PCA Projections |
[KSV05, AM05] |
σ/wmin1/2 |
|
PCA Projections |
[CR08] |
σ* k1/2 |
Product Distributions |
Canonical Correlations |
[BV08] |
Low Fischer discriminant values |
|
Isotropic PCA |
Learning Mixtures with No Separation Condition
In the absence of a lower bound on the separation, the problem of learning
mixture models is considered to be quite hard, particularly, when there are
more than two clusters. In this case, one is again samples from a mixture
model, and the goal is to produce a mixture which has KL-Divergence at most
ε to the true mixture.
For a mixture of two product distributions, Freund and Mansour[FM99]
provide an algorithm, with running time polynomial in n, 1/ε.
For a mixture of k product distributions, Feldman, O'Donnell and Servedio [FOS05]
provides an algorithm which produces with a running-time of
dO(k3/ε3).
Sample Requirement
Another aspect of an algorithm that learns mixture models is its sample
requirement. [CHRZ07, Cha07] shows an algorithm which learns a mixture of
two binary product distributions, given Ω(d/s4) samples, where s is
the separation. [BCFZ07] shows a SVD-based algorithm that learns a mixture
of k binary product distributions, given Ω(d/s2) samples from the
mixture. [L96] shows an information-theoretic lower bound of
Ω(1/s4) on any algorithm for learning a mixture of two
spherical Gaussians on a line.
[SSR06] experimentally studies the sample requirement of EM for
learning Gaussian mixtures. They conjecture that learning mixtures
has three phases, depending on the number of
samples : with less than about Ω(d/s4) samples, it
is information-theoretically hard; with more than about
Ω(d/s2)
samples, computationally easy, and in between, computationally hard,
but easy in an information-theoretic sense.
Learning Mixtures of More General Distributions
[DHKS05] examines the question of learning mixtures of more general
ddistributions than Gaussians, particularly, distributions with
heavy-tails, and provides an algorithm for this problem which runs in
time exponentially in the dimensionality of the data. [CR08b] provides
a polynomial-time algorithm for this problem, using ideas
from metric-embedding literature.
Also, [AM05] and [KSV05] show that their results apply to mixtures of a
slightly broader class of distributions than Gaussians, including mixtures
of log-concave distributions.
EM and k-means
The most common algorithms used in practice for learning mixture models
remain k-means and EM. However, not much is known about these
algorithms theoretically.
[DS00] shows a two-round variant of EM that produces the correct clustering
when the separation between the Gaussians is O(σ d1/4).
[RW84] analyze the likelihood surface around the optimal solution, and show
that EM has first-order convergence; [XJ96] characterizes some conditions
under which EM has second-order convergence.
References
- [RW84] R. Redner, H. Walker, "Mixture densities, maximum likelihood
and the EM algorithm", SIAM Review, 1984.
- [L96] B.G. Lindsey, "Mixture Models: Theory, Geometry and
Applications", IMS Publishers, 1996.
- [XJ96] L. Xu, M.I. Jordan, "On Convergence Properties of the EM
Algorithm for Gaussian Mixtures", Neural Computation, 1996.
- [Das99] S. Dasgupta, "Learning Mixtures of Gaussians", Proc. of
Foundations of Comp. Sci, 1999.
- [FM99] Y. Freund, Y. Mansour, "Estimating a Mixture of Two Product
Distributions", In Proc. of Conf. on Learning Theory, 1999.
- [DS00] S. Dasgupta, L. Schulman, "A two-round variant of EM for
Gaussian Mixtures", Proc. of Uncertainty in Artificial Intelligence, 2000.
- [AK01] S. Arora, R. Kannan, "Learning Mixtures of Arbitrary
Gaussians", In Proc. of Symp. on Theory of Computing, 2001.
- [VW02] S. Vempala, G. Wang, "A Spectral Algorithm for Learning
Mixtures of Distributions", In Proc. of Foundations of Comp. Sci, 2002.
- [FOS05] J. Feldman, R. O'Donnell, R. Servedio, "Learning Mixtures of
Product Distributions over Discrete Domains", In Proc. of Foundations of
Comp. Sci, 2005.
- [KSV05] R. Kannan, H. Salmasian, S. Vempala, "The Spectral Method
for General Mixture Models", In Proc. of Conf. on Learning Theory, 2005.
- [AM05] D. Achlioptas, F. McSherry, "On Spectral Learning of Mixtures
of Distributions", In Proc. of Conf. on Learning Theory, 2005.
- [DHKS05] A. Dasgupta, J. Hopcroft, J. Kleinberg, M. Sandler, "On
Learning Mixtures of Heavy-Tailed Distributions", In Proc. of Foundations
of Comp. Sci, 2005.
- [SSR06] N. Srebro, G. Shakranovich, S. Roweis, "An Investigation of
the Computational and Informational Limits in Gaussian Mixture
Clustering", In Proc. of Intl. Conf. on Machine Learning, 2006.
- [CHRZ07] K. Chaudhuri, E. Halperin, S. Rao, S. Zhou, "A Rigorous
Analysis of Population Stratification with Limited Data", In Proc. of
Symp. on Discrete Algorithms, 2007.
- [Cha07] K. Chaudhuri, "Learning Mixtures of Distributions", Ph.D
Thesis,EECS Department, Univ. of California, Berkeley.
Department
- [BCFZ07] A. Blum, A. Coja-Oghlan, A. Frieze, S. Zhou, "Separating
Populations with Wide Data: A Spectral Analysis", In Proc. of ISAAC,
2007.
- [CR08] K. Chaudhuri, S. Rao, "Learning Mixtures of Product
Distributions using Correlations and Independence", In Proc. of Conf. on
Learning Theory, 2008.
- [CR08b] K. Chaudhuri, S. Rao, "Beyond Gaussians: Spectral Methods
for Learning Mixtures of Heavy-Tailed Distributions", In Proc. of Conf.
on Learning Theory, 2008.
- [BV08] S.C. Brubaker, S. Vempala, "Isotropic PCA and
Affine-Invariant Clustering", In Foundations of Comp. Sci, 2008.