CSE 291D / 234: Data Systems for Machine Learning (Online-Only Edition)

Lectures: TueThu 1:00-2:20pm PT @ Zoom only (link posted on Piazza page)

Instructor: Arun Kumar

  • Email: arunkk [at] eng.ucsd.edu

  • Office Hours: Thu 2:30-3:30pm PT @ Zoom/phone only (link posted on Piazza page)

Teaching Assistant: Htut Khine Win

  • Email: hhtaywin [at] ucsd.edu

  • Office Hours: Mon 12:00-1:00pm PT @ Zoom/phone only (link posted on Piazza page)

Piazza: CSE 291D/234 (Requires access code posted on Canvas)

Announcements

  • New! I will hold extra OHs ahead of Quiz 3: Monday, 11/23 9:00am to 10:00am PT. My OHs on 11/26 are moved up to 11/25 11:00am to 12:00pm PT.

  • New! Exercise 4 has been released.

  • New! Quiz 3 will be on Monday, 11/23 with the time window 2:30pm to 8:30pm PT. It is for 15pts and the time limit is 25min.

Course Goals and Content

This is a research-based course on data systems for machine learning (ML), at the intersection of the fields of ML/AI, data management, and systems. Such systems power modern data science applications on large and complex datasets, including enterprise analytics, recommendation systems, and social media analytics. Students will learn about the landscape and evolution of such systems and the latest research. This is a lecture-driven course with quizzes, exams, and paper reviewing components for evaluation. It is primarily tailored for MS students, PhD students, and advanced undergraduates interested in the state of the art of systems for scalable data science and ML engineering.

This course will cover key systems topics spanning the whole lifecycle of ML-based data analytics, including data sourcing and preparation for ML, programming models and systems for scalable ML model building, and systems for faster ML deployment. Emerging topics such as governance, explanation, and ethics of ML systems will likely be covered too. A major component of this course is reviewing cutting edge research papers from recent top conferences on these topics. See the course schedule page for the entire list of topics, as well as the paper reading list.

Course Format and Online-only Modality Instructions

  • The class meets 2 times a week for 80-minute lectures.

    • All lectures will be via a Zoom video conference call. I will play a recorded video of my lecture. You can interrupt to ask doubts/questions live during this call. Major Q & A from lectures may be summarized as a Piazza post. All lecture videos will also be made available online for asynchronous viewing by students. The links will be posted on the schedule page.

    • You must join the class Piazza page (see link above) and follow class announcements and disussions. You must familiarize yourself with Canvas for this course. You are also highly encouraged to install and familiarize yourself with Zoom, which is UCSD's recommended video conferencing software.

    • Students are NOT required to have webcams. But microphones are highly encouraged. All Zoom meetings can be joined via phone as well.

  • 4 short online quizzes on Canvas.

    • Each quiz will typically be up to 25 min long (NB: not 15min!). It will have primarily multiple choice questions (MCQ). Quantitative/longer problems may exist but only final answer may need to be selected. Partial credits are possible if the answer is explained in detail.

    • The guideline for time per question is a max of 45sec to 1min per point. The points of each question will be calibrated accordingly.

    • The quizzes will be available on Canvas for a fixed time window (e.g., 6 hours). You must take the quiz within this time window; note that time limit still applies.

    • If you fail to take a quiz, you will get no credit by default. If you miss a quiz due to a pre-notified and certifiable medical or emergency reason, that quiz will be waived for you and your score will be reweighted accordingly.

    • The quizzes are open books/notes/Web. The only requirement is you should neither give nor receive help from anyone by any means.

  • 2 exams; the second exam is not cumulative.

    • These will also be delivered as Canvas Quizzes, exactly like the quizzes above. Each exam will be 80min long. The time windows will also be longer.

    • If you miss an exam, you will get no credit for it unless you notify the instructor in advance with a certifiable medical or emergency reason.

    • Both exams are also open books/notes/Web. The only requirement is you should neither give nor receive help from anyone by any means.

  • 9 paper reviews; submitted via Google Forms.

    • Each week will have 1 or 2 papers assigned for review along with a deadline.

    • Discussion with your peers of the papers assigned for review is acceptable. But the final submitted reviews must be entirely your own.

    • At the end of the class, only your 8 best scores will be used for grading.

    • If you submit multiple entries per review, only the latest review will be evaluated.

    • I will likely discuss the papers’ content in class, including the extra readings listed.

    • Resources for how to read and evaluate research papers: Keshav's Writeup and Mitzenmacher's Writeup.

    • The TA will evaluate your reviews with the following 3-point criteria:

      • Pertinence: Does your review demonstrate that you actually read the whole paper and know what it is about?

      • Thoroughness: Have you covered the major strong and weak points correctly?

      • Exposition: Is your review constructive, well written, and easy to read?

  • I will also release ungraded exercises on the exercises page throughout the quarter. These questions will act as practice for the graded quizzes and exams.

Prerequisites

  • A course on ML algorithms (e.g., CSE 151) is absolutely necessary.

  • A course on either database systems (e.g., CSE 132C) or operating systems (e.g., CSE 120) is also necessary.

  • The above courses could have been taken at UCSD or elsewhere.

  • Email the instructor if you would like to enroll but are unsure if you satisfy the prerequisites.

Suggested Textbooks

  • Recommended: Data Management in Machine Learning Systems, by Matthias Boehm, Arun Kumar, and Jun Yang (Free ebook via UCSD VPN).

  • Additional (optional) for background/foundations on the respective component areas:

    • Machine Learning, by Tom Mitchell (McGraw Hill).

    • Deep Learning, by Ian Goodfellow, Yoshua Bengio, and Aaron Courville (MIT Press)

    • Database Management Systems, by Raghu Ramakrishnan and Johannes Gehrke (McGraw Hill)

    • Operating Systems: Three Easy Pieces, by Remzi and Andrea Arpaci-Dusseau (Free ebook).

Exam Dates

  • Exam 1: Thursday, 11/5; preferred slot: 1:00pm to 2:20pm PT; time window 00:01am 11/5 to 11:59pm PT 11/5

  • Exam 2: Saturday, 12/12; preferred slot: 11:30am to 12:50pm PT; time window 11:30am PT 12/12 to 11:29am PT 12/13

Grading

  • Quizzes: 24% (4 x 6%)

  • Exam 1: 26%

  • Exam 2: 26%

  • Paper Reviews: 24% (8 x 3%)

Cutoffs

Since this is the very first non-project and online-only edition of this course, the grading scheme is a hybrid of absolute and relative grading to mitigate the "cold start" issue. The absolute cutoffs are based on your absolute total score. The relative bins are based on your position in the total score distribution of the class. The better grade among the two (absolute-based and relative-based) will be your final grade.

Grade Absolute Cutoff (>=) Relative Bin (Use strictest)
A+ 92 Highest 10%
A 85 Next 15% (10-25)
A- 80 Next 15% (25-40)
B+ 75 Next 15% (40-55)
B 70 Next 15% (55-70)
B- 65 Next 5% (70-75)
C+ 60 Next 5% (75-80)
C 55 Next 5% (80-85)
C- 50 Next 5% (85-90)
D 45 Next 5% (90-95)
F < 45 Lowest 5%


Example: Suppose the total score is 82 and the percentile is 43. The relative grade is B, while the absolute grade is A-. The final grade then is A-.

Non-Letter Grade Options: You have the option of taking this course for a non-letter grade. As per the CSE department's guidelines, the policy for P in a P/F option is a pass-equivalent letter grade, i.e., D or better; the policy for S in an S/U option is a letter grade of B- or better.

Classroom Rules

  • No late days for submitting the paper reviews. Partial credits are possible as per TA's assessment. Plan your work well up front accordingly.

  • Students are encouraged to ask questions and participate in the discussion during the live lecture slot and on Piazza. Enter your name or click "raise your hand" on Zoom chat; the instructor will pause and ask you to speak or type your question.

  • Please review UCSD's honor code and policies and procedures on academic integrity here. If plagiarism is detected in your paper reviews, or if we detect collusion on the graded quizzes or exams, or if any other form of academic integrity violation is identified, University authorities will be notified for appropriate disciplinary action to be taken. You will also get 0 for that component of your score and get downgraded substantially.

  • Harassment or intimidation of any form against any student will not be tolerated during the calls or on Piazza.

  • In the rare chance of a Zoombombing during a live lecture, I will end that session and immediately announce a new link on Piazza to resume that lecture.