DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
UNIVERSITY OF CALIFORNIA, SAN DIEGO


CSE 291: Web-scale information retrieval and data mining

Spring 2008



Please use this discussion board to ask questions about CSE 291, especially questions about the projects.

CSE 291 is a graduate seminar devoted to recent research on methods from machine learning, information retrieval, and data mining that are useful for analyzing web-scale
datasets.  If you want to learn about the algorithms that underlie many of the applications provided by Google, Yahoo, and their competitors, then you should take this course.

A partial list of specific topics to be covered is:
The course will include lectures and class meetings where a student gives a talk presenting a recent technical paper in detail.  Class meetings will be on Tuesdays and Thursdays from 3:30pm to 4:50pm in the CSE building room 2154, beginning on Tuesday April 1.

date topic or title speaker
reading or handout
slides
April 1 Overview list of possible papers
April 3 Improving the k-means algorithm Sanjoy Dasgupta www.cs.ucsd.edu/~dasgupta/291
April 8 Learning without negative examples Charles Elkan
April 9 Wednesday 2pm: Special lecture Yuri Lifshits web algorithms
April 10 The k-medoids problem Sanjoy Dasgupta www.cs.ucsd.edu/~dasgupta/291
April 15 WebIC: A Complete Web Recommender System Russ Greiner
April 17 Scalable Collaborative Filtering with Jointly Derived Neighborhood Interpolation Weights Matt Rodriguez http://public.research.att.com/~yehuda/pubs/BellKorIcdm07.pdf
April 22 Adversarial Classification Aditya Menon http://www.cs.washington.edu/homes/pedrod/papers/kdd04.pdf slides
April 24 Google News Personalization: Scalable Online Collaborative Filtering Jerry Fu http://www.www2007.org/papers/paper570.pdf slides
April 29 Automatic Identification of User Goals in Web Search Emiran Curtmola www2005.org/cdrom/docs/p391.pdf slides
May 1 Semantic Annotation and Retrieval of Music and Sound Effects
Doug Turnbull http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4432652
May 6 Clustering Billions of Images with Large Scale Nearest Neighbor Search Dafna Bitton http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4118757 slides
May 8
A Web-Based Kernel Function for Measuring the Similarity of Short Text Snippets
Jostein Oysad
http://www2006.org/programme/item.php?id=3069
May 13 On the Gravity Recommendation System Matt Rodriguez www.cs.uic.edu/~liub/KDD-cup-2007/proceedings/gravity-Tikk.pdf slides
May 15 Estimation of the Click Volume by Large Scale Regression Analysis Aditya Menon http://yury.name/papers/lifshits2007click.pdf
May 20 PageRank for Product Image Search Dafna Bitton http://www.www2008.org/papers/fp506.html slides
May 22 The Lane’s Gifts v. Google Report Jostein Oysad http://www.cis.upenn.edu/~mkearns/teaching/SponsoredSearch/Tuzhilin_Report.pdf slides
May 23 The Economics of Internet Search Hal Varian http://www.calit2.net/events/popup.php?id=1328
May 27 Practical Guide to Controlled Experiments on the Web: Listen to Your Customers not to the HiPPO Jerry Fu http://exp-platform.com/Documents/GuideControlledExperiments.pdf
May 29
June 3
June 5

Each student who takes the course for four units will do a term project following specific guidelines.  The project should be at the frontier of current research, and closely inspired by at least one of the papers discussed in the class.  Many projects will be devoted to reproducing the results of a selected recent high-quality published paper.  Doing this has very high educational value and provides an excellent starting point for subsequent high-quality original research. 

Project reports will be evaluated using these grading criteria.  There is a schedule for handing in a detailed project proposal, a draft project report, and then the final report.  The seminar will have no final exam.  Letter grades will be based mostly on the final project report, but the presentations, participation in class and in the web-based discussions, and the intermediate project deliverables are all important also.

The instructor is Charles Elkan (Professor), whose office is in the CSE building, room 4134.  Feel free to send email to arrange an appointment, or telephone (858) 534-8897. 

Note:  The seminar runs in parallel with a data mining contest sponsored by Fair Isaac, with cash prizes.


SOURCES OF RELEVANT PAPERS

Papers will be recent technical articles, mainly from high-quality conferences and workshops.  Students may choose articles from this list of relevant papers, in consultation with the instructor.  Suggestions of other relevant papers are very welcome.  These papers may be published in many venues.  The following conferences and workshops have a high density of interesting papers.
Relevant papers can also be found through the 2006 University of Pennsylvania seminar on sponsored search, and in this list of papers written by people at Google.  Yahoo has a similar list.  Also see the website on web algorithms of Yuri Lifshits.

PRESENTATIONS

The schedule of presentations will be determined as early as possible.  Students should choose a date first, and then agree with the instructor about a paper to present.  To find ideas, students can look at this list of possible papers and contact the instructor.   Each presentation will cover a single paper (or at most two), to ensure that it is explained and discussed in sufficient depth.  If you want to change your presentation date, please arrange a swap with another student and notify the instructor at least two weeks in advance.

The procedure for each student presentation is as follows:
Each presentation should be prepared using Powerpoint or similar software, and should consist of at least 40 slides.  You must write out all important equations, and copy all important diagrams, charts, and tables from the paper into your slides.  Presentations will be evaluated, in a friendly way but with high standards, using a feedback form.

We have a web-based discussion area.  For each paper, each student is expected to contribute at least one thoughtful message to the discussion, before the presentation.  A message may ask an interesting question, point out a strength or weakness of the paper, or answer a question asked by someone else.


REGISTRATION

The prerequisite for this seminar is at least one graduate-level course (at UCSD or elsewhere) in machine learning or a related area.  Appropriate courses at UCSD include CSE 250B (Principles of AI: Learning) and CSE 291 (Probabilistic Methods in AI and Machine Learning).  Many other courses may be appropriate also.  Anyone who is interested, regardless of experience, is welcome to contact the instructor to discuss participation.

Students may take the seminar for a letter grade for four units, or for one or two units, satisfactory/unsatisfactory (S/U): For four units, a student should register for CSE 291, section id 620781 for a letter grade.  For one or two units, a student should register for the instructor's CSE 293, section id 620800. 


Most recently updated on May 27, 2008 by Charles Elkan, elkan@cs.ucsd.edu