Learning Mixture Models

Learning mixture models is a natural formalization of the clustering problem. A mixture model is a collection of distributions D1, .., Dk and weights w1, .., wk. To sample from a mixture model, we draw i with probability wi, and then draw a random sample from Di. Thus, each cluster in the data corresponds to a different distribution Di in the mixture, and the weights wi correspond to the fraction of points from each cluster in the entire dataset. In the problem of learning mixture models, we are given samples from a mixture model, and our goal is to (a) learn the parameters of the distributions Di and (b) classify each sample, according to its source distribution. The most popular algorithms for learning mixture models in practice are the Expectation-Maximization(EM) and the k-means algorithms.

Separation

If the distributions in the mixture are very close together in space, then intuitively the mixture will be hard to learn. To quantify this effect, Dasgupta[Das99] introduced the idea of a separation condition . A separation condition is a promise that ensures that every pair of distributions in the mixture are far apart. Given a mixture with a certain separation, the goal of the algorithm designer is to learn the correct clustering of the data.

A long line of literature has produced algorithms which have smaller and smaller separation requirements. Typically, separation is measured by the minimum distance between the means of any pairs of distributions in the mixture, as a function of a standard deviation. The parameters here are d, the dimensionality of the data, σ, the maximum directional standard deviation, and σ* the maximum directional standard deviation in the subspace containing the means.

The table below summarizes some of the bounds.
Reference Separation Comments Algorithm
[Das99] σ d1/2 Spherical Gaussians Random Projections
[DS00] σ d1/4 Bounded Eccentricity EM
[AK01] σ d1/4 Includes other conditions, like concentric Gaussians Distance thresholding
[VW02] σ k1/4 Spherical Gaussians PCA Projections
[KSV05, AM05] σ/wmin1/2 PCA Projections
[CR08] σ* k1/2 Product Distributions Canonical Correlations
[BV08] Low Fischer discriminant values Isotropic PCA

Learning Mixtures with No Separation Condition

In the absence of a lower bound on the separation, the problem of learning mixture models is considered to be quite hard, particularly, when there are more than two clusters. In this case, one is again samples from a mixture model, and the goal is to produce a mixture which has KL-Divergence at most ε to the true mixture.

For a mixture of two product distributions, Freund and Mansour[FM99] provide an algorithm, with running time polynomial in n, 1/ε. For a mixture of k product distributions, Feldman, O'Donnell and Servedio [FOS05] provides an algorithm which produces with a running-time of dO(k33).

Sample Requirement

Another aspect of an algorithm that learns mixture models is its sample requirement. [CHRZ07, Cha07] shows an algorithm which learns a mixture of two binary product distributions, given Ω(d/s4) samples, where s is the separation. [BCFZ07] shows a SVD-based algorithm that learns a mixture of k binary product distributions, given Ω(d/s2) samples from the mixture. [L96] shows an information-theoretic lower bound of Ω(1/s4) on any algorithm for learning a mixture of two spherical Gaussians on a line.

[SSR06] experimentally studies the sample requirement of EM for learning Gaussian mixtures. They conjecture that learning mixtures has three phases, depending on the number of samples : with less than about Ω(d/s4) samples, it is information-theoretically hard; with more than about Ω(d/s2) samples, computationally easy, and in between, computationally hard, but easy in an information-theoretic sense.

Learning Mixtures of More General Distributions

[DHKS05] examines the question of learning mixtures of more general ddistributions than Gaussians, particularly, distributions with heavy-tails, and provides an algorithm for this problem which runs in time exponentially in the dimensionality of the data. [CR08b] provides a polynomial-time algorithm for this problem, using ideas from metric-embedding literature.

Also, [AM05] and [KSV05] show that their results apply to mixtures of a slightly broader class of distributions than Gaussians, including mixtures of log-concave distributions.

EM and k-means

The most common algorithms used in practice for learning mixture models remain k-means and EM. However, not much is known about these algorithms theoretically.

[DS00] shows a two-round variant of EM that produces the correct clustering when the separation between the Gaussians is O(σ d1/4). [RW84] analyze the likelihood surface around the optimal solution, and show that EM has first-order convergence; [XJ96] characterizes some conditions under which EM has second-order convergence.

References