DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
UNIVERSITY OF CALIFORNIA, SAN DIEGO
CSE 291: Data mining and predictive analytics
Spring 2009
CSE 291 is a graduate lecture course devoted to current methods
for data mining and predictive analytics. The course is open to UCSD students and to outside
participants. No previous background in machine learning is
necessary, but all participants should be comfortable writing programs and
following arguments that use basic calculus and linear algebra.
Although the course will use some mathematics, the focus will be on useful skills more than on
theoretical foundations.
If you want to learn
about techniques for text mining, customer management, website
optimization, and related applications, then you should take
this course. A tentative list of topics to be covered is:
- foundational probability and statistics
- cleaning and recoding data, dealing with dirty data
- support vector machines (SVMs)
- detecting and preventing overfitting
- making optimal decisions based on predictions
- sensitivity/specificity tradeoffs
- learning from positive and unlabeled examples
- evaluating data mining tools
- text mining: getting useful knowledge from document collections
- predicting customer preferences
- recommender systems
- collaborative filtering
- maximizing the value of customers
- social network analytics
- data-driven optimization of websites
The course will meet once a week in the evening for ten weeks. Meetings will be on Tuesdays from 6:30pm to 9pm in the
CSE building room 2154, beginning on Tuesday March 31.
| date |
topic or title |
| March 31 |
Classifier learning, linear regression, data cleaning and recoding, demo of Rapidminer.
|
| April 7 |
Hands-on Rapidminer tutorial with Aditya Menon |
| April 14 |
Support vector machines |
| April 21 |
Learning when one class is rare |
| April 28 |
Making optimal decisions |
| May 5 |
Learning despite missing labels |
Lecture notes, quizzes, and assignments are all available in one PDF
document. To read the document, replace "291/index.html" by
"291/dm.pdf" in the URL of this page. (No link is provided, in
order to dissuade search engines from indexing the incomplete lecture
notes.)
The instructor is Charles
Elkan (Professor), whose office is in the CSE building, room 4134. Feel free to send email
to arrange an appointment, or telephone (619) 379-9852.
ORGANIZATION
Each week there will be a hands-on assignment due in class the next
week. (Exception: the first assignment will be due on April 14.)
For each assignment, you should turn in a very brief report.
Each report should be approximately two pages, single-spaced with
one inch margins, including figures and tables as appropriate.
Assignments will include links to datasets. If you want to
use a different dataset, discuss this with the instructor as soon as
possible.
You should do each assignment in a team of two. You are free to
keep the same partner for multiple assignments, or to switch. You
should look for intellectual diversity in whom you choose as a partner.
Specifically, people from the same company should not work
together. Students from outside CSE should pick partners who are
CSE students, or similar. Undergraduates and MBA students should
pick partners from different degree programs.
Starting on April 14, there will be a ten minute quiz at 6:35pm in each
class meeting. The first five minutes of each class will be for
answering questions. Each quiz will be individual but open-book.
The purpose of the quiz is to encourage participants to arrive on
time, and to review the material from the previous class meeting.
The lowest quiz score will be dropped. If you miss a quiz,
its score is the one that will be dropped.
RESOURCES
(1) The course will run in parallel with a data mining contest for students sponsored by Fair Isaac, with cash
prizes.
(2) Lecture notes will be provided by the instructor for each topic.
The course will not follow a textbook, and no specific book is
required. See here for some suggested books.
(3) Students
will have choices about what software to use. Two recommended packages are
completely free and open-source. One, which is called R,
is comparable in some ways to Matlab (which is recommended also, but is
not free). The New York Times had a good article recently about R. The Rapidminer package is a high-quality complete 100% Java interactive data mining environment.
(4) Off-campus participants who drive will need to purchase parking
permits, but we will make an effort to ensure that convenient parking
spaces are always available.
REGISTRATION
Participants should expect to do some
programming and to follow mathematical arguments based on elementary
calculus, linear algebra, and probability theory.
Any recent course in statistics or machine learning should
provide enough mathematical background, and many other courses may be
sufficient also. Everyone who is
interested, regardless of prior experience, is welcome to contact the
instructor to discuss participation.
Participants may take the course for a letter grade, or for a
satisfactory/unsatisfactory (S/U) grade. All participants should
register for exactly four units, and are
expected to do the assignments and quizzes. The final exam will
be take-home and open-book. There will not be any midterm.
UCSD students should register for CSE
291, section id 651588. Community members should register through
UCSD Extension concurrent enrollment.
Most recently updated on May 5, 2009 by Charles Elkan, elkan@cs.ucsd.edu