CSE 291. Unsupervised Learning

CSE 291D. Unsupervised Learning

Spring 2011
Professor: Lawrence Saul
Lectures: Tue/Thu 11:00 am - 12:20 pm
Location: CogSci Building, Room 4
Office hours: after class and/or by appointment.
Units: 4

Course Description

The lectures in this course will survey leading algorithms for unsupervised learning and high dimensional data analysis. The first part of the course will cover probabilistic/generative models of high dimensional data, such as Gaussian mixture models, factor analysis, nonnegative matrix factorization, exponential family PCA, probabilistic latent semantic analysis, latent Dirichlet allocation, independent component analysis, and deep neural networks. The second part of the course will cover spectral methods for dimensionality reduction, including multidimensional scaling, Isomap, maximum variance unfolding, locally linear embedding, graph Laplacian methods, spectral clustering, and kernel PCA.

Prerequisites

The course is aimed at graduate students in machine learning and related fields. Students should have earned a high grade in a previous, related course, such as CSE 250A, CSE 250B, ECE 271A, or ECE 271B. The course will be taught by lecture in the same style as CSE 250A, though at a more advanced mathematical level. Enrollment is by permission of the instructor.

Grading

There will be three homework assignments (60-75%) and a final course project (25-40%). Students may also be required to submit "scribe notes" (handwritten or typeset) on a small subset of lectures.

Homework #1 (out Apr 05, due Apr 19)
Homework #2 (out Apr 19, due May 03)
Homework #3 (out May 11, due Jun 09)

Tentative Syllabus

Tue Mar 29	Course overview. Review of clustering: k-means algorithm, Gaussian mixture modeling.
Thu Mar 31	Review of linear dimensionality reduction: principal component analysis, factor analysis.
Tue Apr 05	EM algorithms for factor analysis, principal component analysis, and mixture of factor analyzers.
Thu Apr 07	Nonnegative matrix factorization: cost functions and multiplicative updates.
Tue Apr 09	Nonnegative matrix factorization: auxiliary functions and proofs of convergence.
Thu Apr 14	Exponential family PCA.
Tue Apr 19	Singular value decomposition, low-rank matrix approximations, multidimensional scaling.
Thu Apr 21	Manifold learning, Isomap algorithm.
Tue Apr 26	Nystrom approximation; maximum variance unfolding (MVU).
Thu Apr 28	Spectral clustering, normalized cuts, graph partitioning.
Tue May 03	Laplacian eigenmaps, locally linear embedding (LLE).
Thu May 05	Low rank factorizations for MVU, kernel PCA, class evaluations.
Tue May 10	Document modeling: bag-of-words representation, probabilistic latent semantic indexing, Dirichlet models.
Thu May 12	Latent Dirichlet allocation.
Tue May 17	Variational approximations for inference.
Thu May 19	Independent component analysis: maximum likelihood, contrast functions.
Tue May 24	Fixed point methods; blind source separation.
Thu May 26	Student presentations: Andrew Gross, Edward O'Brien and Chris DeBoever, Rohan Anil, Moahammed Saberian.
Tue May 31	Student presentations: Vineet Kumar, Baris Aksanli, Samyeul Noh and Sunghee Woo, Katherine Ellis, Daryl Lim, He Huang.
Thu Jun 02	Student presentations: Matt Der, Vivek Ramavajjala, Akshay Balsubramani, Ashish Venkat, Elkin Dario Gutierrez.

Readings

Probabilistic PCA

S. Roweis (1998). EM algorithms for PCA and SPCA. In M. I. Jordan, M. S. Kearns, and S. A. Solla (eds.), Advances in Neural Information Processing Systems 10, pages 626-632. MIT Press: Cambridge, MA.

M. E. Tipping and C. M. Bishop (1999). Mixtures of probabilistic principal component analysers. Neural Computation 11(2), 443-482.

Nonnegative matrix factorization

D. D. Lee and H. S. Seung (1999). Learning the parts of objects with nonnegative matrix factorization. Nature 401: 788-791.
D. D. Lee and H. S. Seung (2001). Algorithms for nonnegative matrix factorization. In T. K. Leen, T. G. Dietterich, and V. Tresp (eds.), Advances in Neural Information Processing Systems 14, pages 556-562. MIT Press: Cambridge, MA.

Exponential family PCA

M. Collins, S. Dasgupta, and R. E. Schapire (2002). A generalization of principal component analysis to the exponential family. In T. G. Dietterich, S. Becker, and Z. Ghahramani (eds.), Advances in Neural Information Processing Systems 13, pages 617-624. MIT Press: Cambridge, MA.

I. Rish, G. Grabarnik, G. Cecchi, F. Pereira, and G. J. Gordon (2008). Closed-form supervised dimensionality reduction with generalized linear models. In Proceedings of the Twenty Fifth International Conference on Machine Learning (ICML-08), pages 832-839. Helsinki, Finland.

Document modeling

T. Hoffman (1999). Probabilistic latent semantic analysis. In Proceedings of the Fifteenth Annual Conference on Uncertainty in Artificial Intelligence (UAI-99), pages 289-296. Stockholm, Sweden.

D. Blei, A. Ng, and M. I. Jordan (2003). Latent Dirichlet allocation. Journal of Machine Learning Research 3:993-1022.

Independent component analysis

A. Hyvarinen (1999). Fast and robust fixed-point algorithms for independent component analysis. IEEE Transactions on Neural Networks 10(3):626-634.

L. Molgedey and H.G. Schuster (1994). Separation of a mixture of independent signals using time delayed correlations. Physical Review Letters 72(23): 3634-3637.

Deep architectures

G. E. Hinton and R. R. Salakhutdinov (2006). Reducing the dimensionality of data with neural networks. Science 313: 504-507.

Y. Bengio and Y. LeCun (2007). Scaling learning algorithms towards AI. In Bottou, L. and Chapelle, O. and DeCoste, D. and Weston, J. (eds.), Large-Scale Kernel Machines. MIT Press: Cambridge, MA.

Multidimensional scaling and Nystrom approximation

J. C. Platt (2004). Fast embedding of sparse music similarity graphs. In S. Thrun, L. K. Saul, and B. Schoelkopf (eds.) (2004). Advances in Neural Information Processing Systems 16, pages 571-578. MIT Press: Cambridge, MA.
J. C. Platt (2005). FastMap, MetricMap, and landmark MDS are all Nystrom algorithms. In Proceedings of the Tenth International Workshop on Artificial Intelligence and Statistics (AISTATS-05), pp. 261-268. Barbados, West Indies.

Isomap and extensions

J.B. Tenenbaum, V. de Silva and J. C. Langford (2000). A global geometric framework for nonlinear dimensionality reduction. Science, vol. 290, pp. 2319--2323.

V. de Silva and J. B. Tenenbaum (2003). Global versus local methods in nonlinear dimensionality reduction. In S. Becker, S. Thrun, and K. Obermayer (eds.), Advances in Neural Information Processing Systems 15, pages 705-712. MIT Press: Cambridge, MA.

Maximum variance unfolding

K. Q. Weinberger and L. K. Saul (2006). An introduction to nonlinear dimensionality reduction by maximum variance unfolding. In Proceedings of the Twenty First National Conference on Artificial Intelligence (AAAI-06), pages 1683-1686. Boston, MA.
J. Sun, S. Boyd, L. Xiao, and P. Diaconis (2006). The fastest mixing Markov process on a graph and a connection to a maximum variance unfolding problem. SIAM Review 48(4):681-699.
K. Q. Weinberger, F. Sha, Q. Zhu, and L. K. Saul (2007). Graph Laplacian regularization for large-scale semidefinite programming. In B. Schoelkopf, J. Platt, and T. Hofmann (eds.), Advances in Neural Information Processing Systems 19, pages 1489-1496. MIT Press: Cambridge, MA.

Spectral clustering

J. Shi and J. Malik (2000). Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) 22(8): 888-905.
A. Ng, M. Jordan, and Y. Weiss (2002). On spectral clustering: analysis and an algorithm. In T. Dietterich, S. Becker and Z. Ghahramani (eds.), Advances in Neural Information Processing Systems 14, pages 849-856. MIT Press: Cambridge, MA.

Graph Laplacian methods

M. Belkin and P. Niyogi (2003). Laplacian eigenmaps for dimensionality reduction and data representation, Neural Computation 15(6): 1373-1396.
M. Belkin and P. Niyogi (2004). Semi-supervised learning on Riemannian manifolds, Machine Learning 56:209-239.

Locally linear embedding and related work

S. T. Roweis and L. K. Saul (2000). Nonlinear dimensionality reduction by locally linear embedding. Science 290: 2323-2326.
L. K. Saul and S. T. Roweis (2003). Think globally, fit locally: unsupervised learning of low dimensional manifolds. Journal of Machine Learning Research 4:119-155.
D. Donoho and C. Grimes (2003). Hessian eigenmaps: locally linear embedding techniques for high dimensional data. Proceedings of the National Academy of Sciences 10:5591-5596.
Z. Zhang (2004). Principal manifolds and nonlinear dimension reduction via local tangent space alignment. SIAM Journal of Scientific Computing 26:313-338.
Z. Zhang and J. Wang (2007). MLLE: modified locally linear embedding using multiple weights. In B. Schoelkopf, J. Platt, and T. Hofmann (eds.), Advances in Neural Information Processing Systems 19, pages 1593-1600. MIT Press: Cambridge, MA.

Kernel PCA

B. Schoelkopf, A. J. Smola, K. R. Mueller (1998). Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation 10:1299-1319.
J. Hamm, D. D. Lee, S. Mika, and B. Schoelkopf (2004). A kernel view of the dimensionality reduction of manifolds. In Proceedings of the Twenty First International Confernence on Machine Learning (ICML-04), pages 369-376. Banff, Canada.
K. Q. Weinberger, F. Sha, and L. K. Saul (2004). Learning a kernel matrix for nonlinear dimensionality reduction. In Proceedings of the Twenty First International Confernence on Machine Learning (ICML-04), pages 839-846. Banff, Canada.