DSC 102: Systems for Scalable Analytics

Administrivia

Instructor: Arun Kumar; Office: 3218 CSE; Office Hours: Thu 2:00-3:00pm

Lectures: TueThu 12:30-1:50pm; PCYNH 106

Discussions: Fri 8:00-8:50am; CENTR 115

TAs:

  • Supun Nakandala (snakanda [at] eng.ucsd.edu); Office Hours: Tue 3:30-4:00pm and Wed 2:30-3:00pm at 3232 CSE

  • Vraj Shah (vps002 [at] eng.ucsd.edu); Office Hours: Mon 9:30-10:00am at 3217 CSE and Thu 10:00-10:30am at 3109 CSE

  • Yuhao Zhang (yuz870 [at] eng.ucsd.edu); Office Hours: Thu 2:00-2:30pm and Fri 1:30-2:00pm at 3230 CSE

Piazza: DSC 102

Announcements

  • New! Exercise 3 has been released on the course documents page.

  • New! The TA will hold a discussion on PA 1 solution and common mistakes in the discussion slot on Friday, 02/21.

  • PA2 has been released on the schedule page. The due date is Wednesday, 03/11, 11:59pm PT. Absolutely no late days (even with penalty), i.e., you will get zero for PA2 if you do not submit on time.

Course Overview and Content

This course covers the principles of computing systems and tools for scaling data analytics to large datasets. Scalable analytics systems are a central part of modern data science in numerous application domains spanning enterprise business intelligence, Web search, e-commerce, social media, natural and social sciences, healthcare, digitial humanities, e-governance, Internet of Things, and more.

Topics include computer organization, memory hierarchy, basics of operating systems, scalable and parallel computing, cloud computing, design and use of parallel dataflow systems, and the use of deep learning tools. It will cover how relational algebra, SQL, linear algebra, and more general dataflow operations in such systems can be used to perform data preparation and feature engineering for machine learning (ML) at scale, how to scale ML training, how to perform ML model selection and deployment at scale, and how to handle data heterogeneity. It will also introduce the implementation of such data systems and touch upon the latest research in this space.

A major component of this course is hands-on Python programming to implement data exploration, data preparation, and model selection pipelines on large real-world data using scalable analytics tools and cloud resources.

Course Format

  • The class meets 2 times a week for 80-minute lectures. All lectures are mandatory. While lecture slides will be made available on this webpage, additional content might be discussed in class.

  • 2 programming assignments. Students can work on projects either individually or in teams of 2. Students should email their team decisions to the TAs before 11:59pm PT Tuesday 01/14. All remaining students will be randomly paired up by the TAs. There are no late days for the programming assignments; plan your work accordingly.

  • This course will have one in-class midterm exam and one cumulative final exam. If you miss an exam, you will get no credit for it unless you pre-notify the instructor with a certifiable medical or emergency reason; in such cases, your grade will be based on a proportional reweighting of the other components.

  • You are required to bring your iClicker to every lecture. Make sure to register your clicker on Canvas before the second lecture.

  • There will be 7 short in-class surprise quizzes on random lecture dates to help you revise the material. Each quiz will be only 7min long and will have 7 multiple choice questions that need to be answered using iClickers. The grading is binary: if you answer at least 4 of the questions correctly, you get full credit for that quiz; otherwise, you get no credit for that quiz. If you are absent, you get no credit by default. If you miss a quiz due to an absence that was pre-notified with a certifiable medical or emergency reason, that quiz will be discounted for you and your score for this component will be reweighted accordingly. At the end of the class, only your 5 best quiz scores will be used for grading.

  • The discussion slots will be used by the TAs to give talks about the programming assignments. Drop by these talks to learn about the assignments in depth and ask questions about them.

Pre-requisites

  • DSC 100 (Introduction to Data Management); or substantial practical experience with scalable data systems and ML, subject to the consent of the instructor.

  • Proficiency in Python programming.

Suggested Textbooks

  • Computer Organization and Design (5th edition), by David Patterson and John Hennessy (aka the "CompOrg Book").

  • Operating Systems: Three Easy Pieces, by Remzi and Andrea Arpaci-Dusseau (aka the "Comet Book").

  • Database Management Systems (3rd edition), by Raghu Ramakrishnan and Johannes Gehrke (aka the "Cow Book").

  • Spark: The Definitive Guide (1st edition), by Bill Chambers and Matei Zaharia (aka the "Spark Book").

  • Data Management in Machine Learning Systems, by Matthias Boehm, Arun Kumar, and Jun Yang (aka the "MLSys Book").

Grading

Components

  • Midterm Exam: 20%

  • Programming Assignments: 10% + 25%

  • Surprise Quizzes: 5%

  • Cumulative Final Exam: 40%

Cutoffs

Since this is the very first edition of this course, the grading scheme is a hybrid of absolute and relative grading to mitigate the "cold start" issue. The absolute cutoffs are based on your absolute total score. The relative bins are based on your position in the total score distribution of the class. The better grade among the two (absolute-based and relative-based) will be your final grade. The cutoffs listed below offer a minimum guarantee on your grade; some thresholds might be lowered slightly later by the instructor but they will not be raised.

Grade Absolute Cutoff (>=) Relative Bin (Use strictest)
A+ 95 Highest 5%
A 90 Next 10% (5-15)
A- 85 Next 15% (15-30)
B+ 80 Next 15% (30-45)
B 75 Next 15% (45-60)
B- 70 Next 15% (60-75)
C+ 65 Next 5% (75-80)
C 60 Next 5% (80-85)
C- 55 Next 5% (85-90)
D 50 Next 5% (90-95)
F < 50 Lowest 5%


Example: Suppose the total score is 82 and the percentile is 33. So, the relative grade is B-, while the absolute grade is B+. The final grade then is B+.

Exam Dates

  • Midterm Exam: Thursday, 02/06, in class

  • Cumulative Final Exam: Tueday, 03/17, 11:30am to 2:30pm, Room TBD

Classroom Rules

  • You are encouraged to ask questions and participate in class discussions. Please raise your hand before asking questions or speaking during the lectures.

  • Harassment or intimidation of any form against other students will not be tolerated in class.

  • If plagiarism is detected in your code or if cheating is detected during an exam, the University authorities will be notified immediately for appropriate disciplinary action to be taken. You will also get zero on that entire component of your grade.