DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
UNIVERSITY OF CALIFORNIA, SAN DIEGO

References on data mining and analytics


To keep current with what is happening in the world of data mining, subscribe (free) to the KDnuggets newsletter.

For a business perspective on data mining and analytics, without technical detail, see Competing on Analytics: The New Science of Winning by Thomas H. Davenport and Jeanne G. Harris. For the table of contents see http://www.amazon.com/gp/reader/1422103323/

Machine learning is the name of the principal research area underlying data mining. One of the best undergraduate-level textbooks in this area is Introduction to Machine Learning by Ethem Alpaydin. For the detailed table of contents, see here. This book is recommended for students who have no previous experience with machine learning. Read the relevant sections as the corresponding topics arise in 255. Feel free to ask questions on Piazza about which sections to read. Do also ask questions on Piazza about anything that is not clear in the book.

The R system, with the RStudio frontend, is recommended for assignments. There are dozens of books available on R; choose one that you like. The Art of R Programming: A Tour of Statistical Software Design is by a good computer science author, Norman Matloff. The interactive data mining environment Rattle is also recommended. Its author has written a good hands-on guide, Data Mining with Rattle and R. Because this book is published by Springer and UCSD has a subscription, its full text is available online from campus IP addresses. For access from off campus, use a VPN.

The most well-known graduate-level textbook on machine learning is Pattern Recognition and Machine Learning by Christopher M. Bishop. For the table of contents see http://www.amazon.com/gp/reader/0387310738. The full texts of two good newer books are available free: Introduction to Machine Learning by Alex Smola and S.V.N. Vishwanathan, and Bayesian Reasoning and Machine Learning by David Barber.

Understanding Complex Datasets: Data Mining with Matrix Decompositions by David Skillicorn is a good specialized book on a growing technical subfield, namely matrix methods applied to modeling two-dimensional data.  For the table of contents see http://www.amazon.com/gp/reader/1584888326/

It is conventional wisdom that 80% of the effort in a data mining project is devoted to data acquisition and cleaning. A recommended book on this topic is Data Preparation for Data Mining by Dorian Pyle. Here is the table of contents. Unfortunately this book is out of print and online sellers now are price gouging.

Two good and up-to-date books in related areas are also available online: Analyzing Linguistic Data: A practical introduction to statistics by R. H. Baayen, which includes an introduction to R, and Introduction to Information Retrieval by Christopher D. Manning, Prabhakar Raghavan and Hinrich Sch├╝tze.

 

Most recently updated on April 2, 2013 by Charles Elkan, elkan@cs.ucsd.edu