CSE 234: Data Systems for Machine Learning

Lectures: MWF 3-3:50pm PT @ York 2622

Instructor: Arun Kumar

  • Email: arunkk [at] eng.ucsd.edu

  • Office Hours: Wed 4-5pm PT @ 3218 CSE

Teaching Assistants:

  • Pratik Ratadiya

    • Email: pratadiy [at] ucsd.edu

    • Office Hours: Fri 11am-12pm PT @ Zoom

  • Soham Pachpande

    • Email: spachpan [at] ucsd.edu

    • Office Hours: Thu 4-5pm PT @ Zoom

  • Yuhao Zhang

    • Email: yuz870 [at] eng.ucsd.edu

    • Office Hours: Wed 1-2pm PT @ 3230 CSE

Piazza: CSE 234

Announcements

  • The introductory lecture is on Mon, Jan 9.

Course Goals and Content

This is a research-based course on data systems for machine learning (ML), at the intersection of the fields of ML/AI, data management, and systems. Such systems power modern data science applications on large and complex datasets, including enterprise analytics, recommendation systems, and social media analytics. Students will learn about the landscape and evolution of such systems and the latest research. This is a lecture-driven course with quizzes, exams, and paper reviewing components for evaluation. It is primarily tailored for MS students, PhD students, and advanced undergraduates interested in the state of the art of systems for scalable data science and ML engineering.

This course will cover key systems topics spanning the whole lifecycle of ML-based data analytics, including programming models and systems for scalable ML model building, data sourcing and preparation for ML, ML platforms and governance issues, and issues in ML deployment and MLOps. A major component of this course is reviewing cutting edge research papers from recent top conferences on these topics. See the course schedule page for the entire list of topics, as well as the paper reading list.

Course Format and Instructions

  • Lectures and Discussions:

    • The class meets 3 times a week for 50-minute lectures.

    • All lectures will be held in person only. The lectures will be automatically podcast and available online for asynchronous viewing.

    • The discussion slot will be used only twice, once before each exam for a review discussion.

    • Attending the lectures and discussions is not mandatory but highly encouraged.

    • Familiarize yourself with this course website and Piazza. All class announcements and asynchronous discussions will be on Piazza.

  • 2 Quizzes and 2 Exams:

    • This course has two progress quizzes, a midterm exam, a cumulative final exam. All of them will be held in person only on pre-announced dates.

    • The exams will have primarily multiple choice questions (MCQ). Quantitative or essay questions may exist but only final answer may need to be selected. Some questions may have partial credits. The quizzes will have only MCQ.

    • The guideline for time per question is a max of 1min per point. The points of each question will be calibrated accordingly.

    • If you miss a quiz or an exam, you will get no credit for it unless you notify the instructor in advance with a certifiable medical or emergency reason and receive a makeup exam slot.

    • Both the quizzes and exams are closed notes/books/Web. For all of them, you should neither give nor receive help from anyone by any means.

  • 9 Paper Reviews:

    • Each week will have a paper assigned for review via Google Forms along with a deadline.

    • At the end of the class, only your 8 best scores will be used for grading.

    • Discussion with your peers over the papers assigned for review is acceptable. But the final submitted reviews must be entirely your own.

    • If you submit multiple entries per review, only the latest review will be evaluated.

    • I will discuss the papers’ content in class, including the extra readings listed.

    • Resources for how to read and evaluate research papers: Keshav's Writeup and Mitzenmacher's Writeup.

    • The TAs will evaluate your reviews with the following 3-point criteria:

      • Pertinence: Does your review demonstrate that you actually read the whole paper and know what it is about?

      • Thoroughness: Have you covered both the major strong points and the major limitations correctly?

      • Exposition: Is your review constructive, well written, and easy to read?

  • 9 Peer Instruction Activities:

    • They will be held live in class using iClicker, spread randomly across the quarter.

    • Each activity will have 2 multiple choice questions (MCQ). Quantitative problems may exist but only the final answer will need to be selected. No partial credits.

    • For each question, you must first answer individually. Then you can discuss the question with you neighbor(s). After that, you can answer the question again.

    • These activities are also open books/notes/Web.

    • Grading is based on earnest participation in the whole activity.

    • If you miss an activity, you will get no credit for it, unless you notify the instructor in advance with a university approved reason.

    • You can miss up to 1 activity out of the 9 without losing credit.

    • Make sure to bring your clicker to every lecture. If you happen to forget it one day, submit your written answers on a sheet and hand it to me right after that class. Out of band submissions later will not be accepted.

    • You are allowed to possess only your own clicker. Using someone else's clicker is an academic integrity violation that will entail serious consequences as listed below.

  • There will be 2 Peer Evaluation activities delivered via Canvas only. These will be related to the invited industry guest lectures. I will announce more details in due course.

  • I will release ungraded exercises on the exercises page throughout the quarter. These questions will act as practice for the quizzes and exams.

  • The discussion slot will be used only twice, for a review discussion before each exams.

  • NB: Unlike the prior editions of this course, this edition does not offer a project-based pathway, primarily due to the size of the class.

Prerequisites

  • A course on ML algorithms (e.g., CSE 151) is absolutely necessary.

  • A course on either database systems (e.g., CSE 132C) or operating systems (e.g., CSE 120) is also necessary.

  • The above courses could have been taken at UCSD or elsewhere.

  • DSC 102 suffices as a perequisite for both of the above aspects.

  • Substantial project or industrial experience on relevant topics can be substituted for prior coursework, subject to the instructor's consent. Email the instructor if you would like to enroll but are unsure if you satisfy the prerequisites.

Suggested Textbooks

  • Recommended: Data Management in Machine Learning Systems, by Matthias Boehm, Arun Kumar, and Jun Yang (Free ebook via UCSD VPN).

  • Additional (optional) for background/foundations on the respective component areas:

    • Machine Learning, by Tom Mitchell (McGraw Hill).

    • Deep Learning, by Ian Goodfellow, Yoshua Bengio, and Aaron Courville (MIT Press)

    • Database Management Systems, by Raghu Ramakrishnan and Johannes Gehrke (McGraw Hill)

    • Operating Systems: Three Easy Pieces, by Remzi and Andrea Arpaci-Dusseau (Free ebook).

Exam Dates

  • Quiz 1: Wed, Feb 8, in class.

  • Midterm Exam: Fri, Feb 17, 3-3:50pm PT in class.

  • Quiz 2: Fri, Mar 10, in class.

  • Cumulative Final Exam: Wed, Mar 22 , 3-6pm PT in class (York 2622).

Grading

  • Paper Reviews: 24% (8 x 3%)

  • Quizzes: 10% (2 x 5%)

  • Midterm Exam: 15%

  • Cumulative Final Exam: 40%

  • Peer Instruction Activities: 8% (8 x 1%)

  • Peer Evaluation Activities: 3% (2 x 1.5%)

Cutoffs

The grading scheme is a hybrid of absolute and relative grading. The absolute cutoffs are based on your absolute total score. The relative bins are based on your position in the total score distribution of the class. The better grade among the two (absolute-based and relative-based) will be your final grade.

Grade Absolute Cutoff (>=) Relative Bin (Use strictest)
A+ 92 Highest 10%
A 85 Next 15% (10-25)
A- 80 Next 15% (25-40)
B+ 75 Next 15% (40-55)
B 70 Next 15% (55-70)
B- 65 Next 5% (70-75)
C+ 60 Next 5% (75-80)
C 55 Next 5% (80-85)
C- 50 Next 5% (85-90)
D 45 Next 5% (90-95)
F < 45 Lowest 5%


Example: Suppose the total score is 82 and the percentile is 43. The relative grade is B, while the absolute grade is A-. The final grade then is A-.

Non-Letter Grade Options: You have the option of taking this course for a non-letter grade. As per the CSE department's guidelines, the policy for P in a P/F option is a pass-equivalent letter grade, i.e., D or better; the policy for S in an S/U option is a letter grade of B- or better.

Classroom Rules

  • No late days for submitting the paper reviews. Plan your work well up front accordingly.

  • Students are encouraged to ask questions and participate in the discussions in class and also on Piazza. Please raise your hand before speaking and the instructor will call on you to speak.

  • Please review UCSD's honor code and policies and procedures on academic integrity on this website. If plagiarism is detected in your paper reviews and/or exams, or if any other form of academic integrity violation is identified, the University authorities will be notified for appropriate disciplinary action to be taken. You will also get 0 for that component of your score and get downgraded substantially.

  • Please review UCSD's principles of community and our commitment to creating an inclusive learning environment on this website.

  • Harassment or intimidation of any form against any student will not be tolerated in class or on Piazza. Please review UCSD's policies on dealing with harassment and discrimination on this website.