DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
UNIVERSITY OF CALIFORNIA, SAN DIEGO


CSE 291: Data mining and predictive analytics

Spring 2009



Please ask all questions about CSE 291 on the discussion board at http://www.quicktopic.com/43/H/JSyZtuKXRkRM



CSE 291 is a graduate lecture course devoted to current methods for data mining and predictive analytics.  The course is open to UCSD students and to outside participants.  No previous background in machine learning is necessary, but all participants should be comfortable writing programs and following arguments that use basic calculus and linear algebra.  Although the course will use some mathematics, the focus will be on useful skills more than on theoretical foundations.

If you want to learn about techniques for text mining, customer management, website optimization, and related applications, then you should take this course.  A tentative list of topics to be covered is:
The course will meet once a week in the evening for ten weeks.  Meetings will be on Tuesdays from 6:30pm to 9pm in the CSE building room 2154, beginning on Tuesday March 31.

date topic or title
March 31 Classifier learning, linear regression, data cleaning and recoding, demo of Rapidminer.
April 7 Hands-on Rapidminer tutorial with Aditya Menon
April 14 Support vector machines
April 21 Learning when one class is rare
April 28 Making optimal decisions
May 5 Learning despite missing labels

Lecture notes, quizzes, and assignments are all available in one PDF document.  To read the document, replace "291/index.html" by "291/dm.pdf" in the URL of this page.  (No link is provided, in order to dissuade search engines from indexing the incomplete lecture notes.)

The instructor is Charles Elkan (Professor), whose office is in the CSE building, room 4134.  Feel free to send email to arrange an appointment, or telephone (619) 379-9852. 

ORGANIZATION

Each week there will be a hands-on assignment due in class the next week.  (Exception: the first assignment will be due on April 14.)  For each assignment, you should turn in a very brief report.  Each report should be approximately two pages, single-spaced with one inch margins, including figures and tables as appropriate.  Assignments will include links to datasets.  If you want to use a different dataset, discuss this with the instructor as soon as possible.

You should do each assignment in a team of two.  You are free to keep the same partner for multiple assignments, or to switch.  You should look for intellectual diversity in whom you choose as a partner.  Specifically, people from the same company should not work together.  Students from outside CSE should pick partners who are CSE students, or similar.  Undergraduates and MBA students should pick partners from different degree programs.

Starting on April 14, there will be a ten minute quiz at 6:35pm in each class meeting.  The first five minutes of each class will be for answering questions.  Each quiz will be individual but open-book.  The purpose of the quiz is to encourage participants to arrive on time, and to review the material from the previous class meeting.  The lowest quiz score will be dropped.  If you miss a quiz, its score is the one that will be dropped.


RESOURCES

(1) The course will run in parallel with a data mining contest for students sponsored by Fair Isaac, with cash prizes.

(2) Lecture notes will be provided by the instructor for each topic.  The course will not follow a textbook, and no specific book is required.  See here for some suggested books.

(3) Students will have choices about what software to use.  Two recommended packages are completely free and open-source.  One, which is called R, is comparable in some ways to Matlab (which is recommended also, but is not free).  The New York Times had a good article recently about R.  The Rapidminer package is a high-quality complete 100% Java interactive data mining environment.

(4) Off-campus participants who drive will need to purchase parking permits, but we will make an effort to ensure that convenient parking spaces are always available.


REGISTRATION

Participants should expect to do some programming and to follow mathematical arguments based on elementary calculus, linear algebra, and probability theory.  Any recent course in statistics or machine learning should provide enough mathematical background, and many other courses may be sufficient also.  Everyone who is interested, regardless of prior experience, is welcome to contact the instructor to discuss participation.

Participants may take the course for a letter grade, or for a satisfactory/unsatisfactory (S/U) grade.  All participants should register for exactly four units, and are expected to do the assignments and quizzes.  The final exam will be take-home and open-book.  There will not be any midterm.

UCSD students should register for CSE 291, section id 651588.  Community members should register through UCSD Extension concurrent enrollment.


Most recently updated on May 5, 2009 by Charles Elkan, elkan@cs.ucsd.edu