CSE 234: Data Systems for Machine Learning (In-Person Edition)

Lectures: TueThu 12:30-1:50pm PT @ CENTR 105

Discussions: Mon 7:00-8:00pm PT @ CENTR 212 (this slot will be used only twice)

Instructor: Arun Kumar

  • Email: arunkk [at] eng.ucsd.edu

  • Office Hours: Thu 2:00-3:00pm PT @ 3218 CSE

Teaching Assistants:

  • Yuhao Zhang

    • Email: yuz870 [at] eng.ucsd.edu

    • Office Hours: Thu 10:30-11:30am PT @ 3230 CSE

    • Tasks: Project logistics, quizzes, exams

  • Tara Mirmira

    • Email: tmirmira [at] eng.ucsd.edu

    • Office Hours: Tue 2:00-3:00pm PT @ Zoom only (linked posted on Piazza / Canvas)

    • Tasks: Paper reviews

Piazza: CSE 234 (Requires access code posted on Canvas)

Course Goals and Content

This is a research-based course on data systems for machine learning (ML), at the intersection of the fields of ML/AI, data management, and systems. Such systems power modern data science applications on large and complex datasets, including enterprise analytics, recommendation systems, and social media analytics. Students will learn about the landscape and evolution of such systems and the latest research. This is a lecture-driven course with quizzes, exams, and paper reviewing components for evaluation. It is primarily tailored for MS students, PhD students, and advanced undergraduates interested in the state of the art of systems for scalable data science and ML engineering.

This course will cover key systems topics spanning the whole lifecycle of ML-based data analytics, including data sourcing and preparation for ML, programming models and systems for scalable ML model building, and systems for faster ML deployment. Emerging topics such as governance, explanation, and ethics of ML systems will likely be covered too. A major component of this course is reviewing cutting edge research papers from recent top conferences on these topics. See the course schedule page for the entire list of topics, as well as the paper reading list.

Course Format and In-Person Modality Instructions

  • The class meets 2 times a week for 80-minute lectures.

    • All lectures will be in-person only during the lecture slot. The lectures will also be automatically podcast and available online for asynchronous viewing.

    • Attending the lectures and discussions is not mandatory but highly encouraged.

    • The discussion slot will be used only twice, one before each exam.

    • As per UCSD's pandemic-related policies, all must wear a mask or other approved face covering properly throughout the class duration.

    • Familiarize yourself with Canvas for this course. You are also encouraged to join the class Piazza page (see link above). Follow all class announcements via Piazza or Canvas Discussions.

  • This class has two pathways for learning evaluations: exams-based and project-based.

    • Both pathways share the following components: paper reviewing and surprise quizzes.

    • The exams-based pathway will have a midterm exam and a cumulative final exam (more details below).

    • Projects can be either a small research project or a comprehensive technical survey project.

    • You must decide your pathway by 6:00pm PT Wed, Sep 29 and inform the instructor by submitting this Google Form. A one-time change is allowed from the project-based to the exams-based pathway if you request it before the midterm exam.

  • Projects:

    • The instructor will suggest a bunch of suitable project topics. You are also welcome to propose your own topic, as long as it is relevant for the course.

    • Research projects will ideally lay the groundwork for a publication at a top research conference or workshop.

    • Survey projects must provide a comprehensive analysis of a topic beyond just summarizing the papers as a laundry list.

    • All projects must be done as teams of 2. You can find your own partner or request the TA to assign you a random partner.

    • All teams will have short weekly meetings with the instructor at a mutually feasible meeting slot (in-person or via Zoom) to discuss progress and questions.

    • Project performance will be assessed solely by the instructor. The main criteria for evaluation are diligence, technical depth, and independence; for the research projects, technical creativity is a bonus criterion.

    • All projects conclude with a final written report and a short live presentation to the class. Project reports can be 6-12 pages long and must use the ACM SIG proceedings LaTeX template. The deadline for emailing the report is EOD Thursday, Dec 9. The talks will be held in the last week of classes. More tips and evaluation criteria for the reports and talks will be released in due course.

  • Midterm and final exams:

    • The exams will have primarily multiple choice questions (MCQ). Quantitative/longer problems may exist but only final answer may need to be selected. Some questions will have partial credits.

    • The guideline for time per question is a max of 1min per point. The points of each question will be calibrated accordingly.

    • If you miss an exam, you will get no credit for it, unless you notify the instructor in advance with a certifiable medical or emergency reason and receive a makeup exam slot.

    • The exams are closed notes/books/Web. And you should neither give nor receive help from anyone by any means.

  • 5 in-class surprise quizzes.

    • The quizzes will be spread throughout the quarter and may have different lengths. At the end of the class, only your 4 best scores will be used for grading.

    • Each quiz will have multiple-choice questions (MCQ). Quantitative/longer problems may exist but only final answer is needed. No partial credits.

    • If you miss a quiz, you will get no credit by default. If you miss a quiz due to a pre-notified and certifiable medical or emergency reason, that quiz will be waived for you and this component will be reweighted accordingly.

    • The quizzes are open books/notes/Web (unlike the exams). The only requirement is you should neither give nor receive help from anyone by any means.

    • This is a no-fault component, i.e., the better of the two grades, with this component and without it (rest rescaled accordingly), will be used for your overall grade.

  • I will also release ungraded exercises on the exercises page throughout the quarter. These questions will act as practice for the graded exams and surprise quizzes.

  • 8 paper reviews; submitted via Google Forms.

    • Each week will have a paper assigned for review along with a deadline. At the end of the class, only your 7 best scores will be used for grading.

    • Discussion with your peers of the papers assigned for review is acceptable. But the final submitted reviews must be entirely your own.

    • If you submit multiple entries per review, only the latest review will be evaluated.

    • I will discuss the papers’ content in class, including the extra readings listed.

    • Resources for how to read and evaluate research papers: Keshav's Writeup and Mitzenmacher's Writeup.

    • The TA will evaluate your reviews with the following 3-point criteria:

      • Pertinence: Does your review demonstrate that you actually read the whole paper and know what it is about?

      • Thoroughness: Have you covered both the major strong points and the major limitations correctly?

      • Exposition: Is your review constructive, well written, and easy to read?

Prerequisites

  • A course on ML algorithms (e.g., CSE 151) is absolutely necessary.

  • A course on either database systems (e.g., CSE 132C) or operating systems (e.g., CSE 120) is also necessary.

  • The above courses could have been taken at UCSD or elsewhere.

  • Substantial project or industrial experience can be substituted for prior coursework, subject to the instructor's consent. Email the instructor if you would like to enroll but are unsure if you satisfy the prerequisites.

Suggested Textbooks

  • Recommended: Data Management in Machine Learning Systems, by Matthias Boehm, Arun Kumar, and Jun Yang (Free ebook via UCSD VPN).

  • Additional (optional) for background/foundations on the respective component areas:

    • Machine Learning, by Tom Mitchell (McGraw Hill).

    • Deep Learning, by Ian Goodfellow, Yoshua Bengio, and Aaron Courville (MIT Press)

    • Database Management Systems, by Raghu Ramakrishnan and Johannes Gehrke (McGraw Hill)

    • Operating Systems: Three Easy Pieces, by Remzi and Andrea Arpaci-Dusseau (Free ebook).

Exam Dates

  • Midterm Exam: Thursday, Nov 4, 12:30pm to 1:50pm PT in class (CENTR 105).

  • Cumulative Final Exam: Friday, Dec 10, 11:30am to 2:30pm PT @ Room TBD.

Grading

Exams-based pathway:

  • Paper Reviews: 28% (7 x 4%)

  • Surprise Quizzes: 12% (4 x 3%); no-fault component

  • Midterm Exam: 20%

  • Cumulative Final Exam: 40%

Project-based pathway:

  • Paper Reviews: 28% (7 x 4%)

  • Surprise Quizzes: 12% (4 x 3%); no-fault component

  • Project Performance: 40%

  • Final Report: 10%

  • Final Presentation: 10%

Cutoffs

The grading scheme is a hybrid of absolute and relative grading. The absolute cutoffs are based on your absolute total score. The relative bins are based on your position in the total score distribution of the class. The better grade among the two (absolute-based and relative-based) will be your final grade.

Grade Absolute Cutoff (>=) Relative Bin (Use strictest)
A+ 92 Highest 10%
A 85 Next 15% (10-25)
A- 80 Next 15% (25-40)
B+ 75 Next 15% (40-55)
B 70 Next 15% (55-70)
B- 65 Next 5% (70-75)
C+ 60 Next 5% (75-80)
C 55 Next 5% (80-85)
C- 50 Next 5% (85-90)
D 45 Next 5% (90-95)
F < 45 Lowest 5%


Example: Suppose the total score is 82 and the percentile is 43. The relative grade is B, while the absolute grade is A-. The final grade then is A-.

Non-Letter Grade Options: You have the option of taking this course for a non-letter grade. As per the CSE department's guidelines, the policy for P in a P/F option is a pass-equivalent letter grade, i.e., D or better; the policy for S in an S/U option is a letter grade of B- or better.

Classroom Rules

  • No late days for submitting the paper reviews. Plan your work well up front accordingly.

  • Students are encouraged to ask questions and participate in the discussion during the live lecture and also on Piazza. Please raise your hand before speaking and the instructor will call on you to speak.

  • Please review all UCSD policies on pandemic-related public health and safety on this website. In particular, all are required to wear a proper mask indoors, including during lectures.

  • Please review UCSD's honor code and policies and procedures on academic integrity on this website. If plagiarism is detected in your paper reviews and/or exams, or if any other form of academic integrity violation is identified, the University authorities will be notified for appropriate disciplinary action to be taken. You will also get 0 for that component of your score and get downgraded substantially.

  • Please review UCSD's principles of community and our commitment to creating an inclusive learning environment on this website.

  • Harassment or intimidation of any form against any student will not be tolerated in class or on Piazza. Please review UCSD's policies on dealing with harassment and discrimination on this website.