DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
UNIVERSITY OF CALIFORNIA, SAN DIEGO


CSE 255: Data mining and predictive analytics

Spring 2013


For the version of CSE 255 taught in Winter 2015 by Prof. Julian McCauley, see http://cseweb.ucsd.edu/~jmcauley/cse255/.




CSE 255 in the spring quarter of 2013 was a graduate-level lecture course devoted to current methods for data mining and predictive analytics. No previous background in machine learning is necessary, but all participants should be comfortable with programming, and following arguments that use calculus and linear algebra. Although the course will use some mathematics, the focus will be on useful skills more than on theoretical foundations. Lecture notes in the file dm.pdf are updated every few days. Please ask questions on Piazza.
 
The course meets
once a week on Tuesday evenings for ten weeks. Meetings will be from 6:30pm to 9pm in room 2111 of Warren Lecture Hall (near the CSE building) beginning on Tuesday April 2. The last lecture will be on Tuesday June 4. The final exam will be on Tuesday June 11 at 7pm.

Note that CSE 255 used to be numbered CSE 291. To register for the course, use section id 779310.

date
topics
lecture notes
handouts
quiz
assignment
April 2
Course outline, supervised learning, overfitting
Chapters 1, 2
FT article
Sample on page 11
p. 22
April 9
Linear regression, preprocessing, missing data, regularization
Chapters 3, 6

Quiz 1 with solution
p. 29
April 16
Linear and nonlinear support vector machines
Chapter 5

Quiz 2 with solution
p. 48
April 23
Learning when one class is rare, F1 and AUC scores
Chapter 7

Quiz 3 with solution
p. 65
April 30
Estimating calibrated probabilities, making cost-sensitive decisions
Chapters 8, 9

Quiz 4 with solution
p. 87
May 7
Sample selection bias, importance weighting, reject inference
Chapter 10

Quiz 5 with solution
p. 99
May 14
Recommender systems, collaborative filtering, matrix factorization via alternating least squares
Chapter 11

Quiz 6 with solution
p. 111
May 21
Text mining: bag of words representation, classifier learning
Chapter 12

Quiz 7 with solution
p. 125
May 28
Network analytics, link prediction, singular value decomposition (SVD)
Chapter 14 and Section 13.1

Quiz 8 with solution
p. 145
June 4
Guest lecture by Dr. Ramon Huerta


Quiz 9 with solution

June 11
Final exam at 7pm





For the 2012 version of the course, see http://cseweb.ucsd.edu/users/elkan/291spring2012. Topics for 2013 will include most of the following:
The course will not follow a textbook, and no specific book is required. See here for some suggested books.

Enrollment is unrestricted.
Everyone is welcome. Community members will register through UCSD Extension concurrent enrollment. Participants may take the course for a letter grade (recommended), or for a satisfactory/unsatisfactory (S/U) grade. All participants should register for exactly four units, and are expected to do the assignments, quizzes, and final exam. There will not be any midterm. 

The instructor is Charles Elkan (Professor), whose office is in the CSE building, room 4134. Feel free to send email to arrange an appointment. The teaching assistant is Eric Christiansen. He will have office hours in room B250A in the basement of the CSE building three days per week: at 5:30pm on Tuesdays, 2:30pm on Thursdays, and 2pm on Fridays.

Each week there will be a hands-on assignment due in class the next week. Assignments will include pointers to datasets. You should do each assignment in a team of exactly two people. You are free to keep the same partner for multiple assignments, or to switch. You should look for intellectual diversity in whom you choose as a partner. Specifically, people from the same company should not work together. Students from outside CSE should pick partners who are CSE students, or similar. In each pair, at least one student should be good at writing code in multiple programming languages.

For each assignment, each team should turn in a brief joint report. Each report should be single-spaced and include figures, tables, and citations as appropriate. Do not include appendices or listings of code. Grades will be based entirely on the joint reports. Reports should be concise, more like memos than like full papers. Reports will be graded using this rubric.

Students can choose which software to use. The recommended package, called R, is completely free and open-source. RStudio is an easy-to-use free interactive environment that makes R comparable to Matlab (which is recommended also, but is not free). The New York Times had a good article about R. The Rattle package for R is an interactive data mining environment that can be used by non-programmers. The Rapidminer package is a 100% Java alternative data mining environment. It is easier than R for non-programmers to use, but R is more flexible and scales better to large datasets.

Starting on the second Tuesday of classes, there will be a seven minute quiz at the start of each class meeting. The first few minutes of each class will be for answering questions and returning previous work. The purpose of the quiz is to encourage participants to arrive on time, and to review the material from the previous class meeting. You will do each quiz jointly with one partner. Talking is encouraged, and overhearing other students is fine. After each quiz, there will be a brief class discussion about what the correct answer is.

The lowest two quiz scores will be dropped. If you miss a quiz, its score is one that will be dropped. Quizzes and the final exam will be open-notes. This means that you may bring your own notes, a printed copy of the class lecture notes, and copies of your own previous assignments and quizzes. Since no book is required for this class, please do not bring books. Electronic devices are also not allowed, except a simple calculator and an e-book reader for lecture notes.



Most recently updated on June 6, 2013 by Charles Elkan, elkan@cs.ucsd.edu