DSC 102: Systems for Scalable Analytics (Online-Only Edition)

Administrivia

Lectures: TueThu 11:00-12:20pm PT on Zoom (link posted on Canvas)

Instructor: Arun Kumar; Office: 3218 CSE; Office Hours: Thu 1:00-2:00pm on Zoom (link posted on Canvas)

Discussions: Mon 1:00-1:50pm PT on Zoom (occasionally; link posted on Canvas)

TAs:

  • Side Li (s7li [at] eng.ucsd.edu)

  • Umesh Singla (usingla [at] ucsd.edu)

  • Subrato Chakravorty (suchakra [at] eng.ucsd.edu)

  • The TA office hours in this course are distributed with deadline-aligned skews. See the detailed schedules on the TA OHs page.

Piazza: DSC 102 (access code posted on Canvas)

Announcements

  • New! The final review discussion on Friday, 3/12 has been moved to 11:00am to 1:00pm PT. Quiz 4 time window is 2:00pm to 11:59pm PT that day.

  • New! Exercise 4 and its answers have been released on the exercises page.

  • New! PA2 deadline has been extended by 24 hours to Wednesday, 3/10 at 11:59pm PT;

  • New! Peer Activities 4 and 5 have been merged and released. See the details on Canvas/Piazza.

  • The dates/times/windows for the remaining Quizzes and Peer Activities, as well as Discussion, Review, and extra OHs have been announced on Canvas/Piazza. Make a careful note of all of the deadlines.

Course Goals and Content

This course covers the principles of computing systems and tools for scaling data analytics to large datasets. Scalable analytics systems are a central part of modern data science in numerous application domains spanning enterprise business intelligence, Web search, e-commerce, social media, natural and social sciences, healthcare, digital humanities, e-governance, Internet of Things, and more.

Topics include computer organization, memory hierarchy, basics of operating systems, scalable and parallel computing, cloud computing, design and use of parallel dataflow systems (MapReduce/Hadoop and Spark), machine learning systems, and the use of deep learning tools. It will cover how relational algebra, SQL, linear algebra, and more general dataflow operations in such systems can be used to perform data preparation and feature engineering for machine learning (ML) at scale, how to scale ML training, how to perform ML model selection and deployment at scale, and how to handle data heterogeneity. It will also introduce the implementation of such data systems and touch upon the latest research in this space.

A major component of this course is hands-on Python programming to implement data exploration, data preparation, and model selection pipelines on large real-world data using scalable analytics tools and cloud resources, both Amazon Web Services (AWS) public cloud and SDSC's private cloud.

Course Format and Online-only Modality Instructions

  • The class meets 2 times a week for 80-minute lectures.

    • All lectures will be via a Zoom video conference call. I will lecture live and record a video that will later be posted to the course Canvas page. You can interrupt to ask doubts/questions live during this call.

    • Attendance of live lectures is not mandatory. All lecture videos will be available on Canvas for asynchronous viewing by students. However, you are highly encouraged to join the live lectures to participate in the in-class discussions and other interactive activities.

    • All asynchronous discussions and questions will be handled via Canvas Discussions.

    • Students are NOT required to have webcams. But microphones are highly encouraged. All Zoom meetings can be joined via phone as well.

  • 3 programming assignments (PAs).

    • Students can work on projects either individually or in teams of 2. Students should email their team decisions to the TAs before 11:59pm PT Tuesday 01/12. All remaining students will be randomly paired up by the TAs.

    • See the PA schedule and details on the PA schedule page.

    • There are no late days for the programming assignments; plan your work accordingly.

    • Your (team's) code submission must be entirely your (team's) own. The PA schedule page offers more guidance on what level of discussion outside your team is allowed.

  • 4 short online quizzes on Canvas.

    • Each quiz will typically be up to 25 min long. It will have primarily multiple choice questions (MCQ). Quantitative/longer problems may exist but only final answer may need to be selected. Partial credit will be possible for some questions.

    • The quizzes will be available on Canvas for a fixed time window. You must take the quiz within this time window; note that time limit still applies.

    • If you fail to take a quiz, you will get no credit by default. If you miss a quiz due to a pre-notified and certifiable medical or emergency reason, that quiz will be waived for you and your score will be reweighted accordingly.

    • The quizzes are open books/notes/Web. The only requirement is you should neither give nor receive help from anyone by any means.

  • This course will have 1 in-class midterm exam and 1 cumulative final exam.

    • If you miss an exam, you will get no credit for it unless you pre-notify the instructor with a certifiable medical or emergency reason; in such cases, your grade will be based on a proportional reweighting of the other components.

    • Both exams are open books/notes/Web. The only requirement is you should neither give nor receive help from anyone by any means.

  • There will be a few peer-oriented activities in class or on Canvas that will require you to evaluate some specific essay submissions of your peers. I will give more details about these activities in due course.

  • The discussion slots will be used by the TAs to give talks about the programming assignments. I might also use them for review discussions.

  • I will also release ungraded exercises on the exercises page throughout the quarter. These questions will act as practice for the graded quizzes and exams.

Pre-requisites

  • DSC 100 (Introduction to Data Management); or substantial practical experience with scalable data systems and ML, subject to the consent of the instructor.

  • Proficiency in Python programming.

Suggested Textbooks

  • Computer Organization and Design (5th edition), by David Patterson and John Hennessy (aka the "CompOrg Book").

  • Operating Systems: Three Easy Pieces, by Remzi and Andrea Arpaci-Dusseau (aka the "Comet Book").

  • Database Management Systems (3rd edition), by Raghu Ramakrishnan and Johannes Gehrke (aka the "Cow Book").

  • Spark: The Definitive Guide (1st edition), by Bill Chambers and Matei Zaharia (aka the "Spark Book").

  • Data Management in Machine Learning Systems, by Matthias Boehm, Arun Kumar, and Jun Yang (aka the "MLSys Book").

Grading

Components

  • Midterm Exam: 15%

  • Programming Assignments: 7% + 14% + 14%

  • Quizzes: 20% (4 x 5%)

  • Cumulative Final Exam: 25%

  • Peer Evaluation Activities: 5%

Cutoffs

Since this is my first online-only edition of this course, the grading scheme is a hybrid of absolute and relative grading to mitigate the "cold start" issue. The absolute cutoffs are based on your absolute total score. The relative bins are based on your position in the total score distribution of the class. The better grade among the two (absolute-based and relative-based) will be your final grade.

Grade Absolute Cutoff (>=) Relative Bin (Use strictest)
A+ 95 Highest 5%
A 90 Next 10% (5-15)
A- 85 Next 15% (15-30)
B+ 80 Next 15% (30-45)
B 75 Next 15% (45-60)
B- 70 Next 15% (60-75)
C+ 65 Next 5% (75-80)
C 60 Next 5% (80-85)
C- 55 Next 5% (85-90)
D 50 Next 5% (90-95)
F < 50 Lowest 5%


Example: Suppose the total score is 82 and the percentile is 33. So, the relative grade is B-, while the absolute grade is B+. The final grade then is B+.

Non-Letter Grade Options: You have the option of taking this course for a non-letter grade. The policy for P in a P/F option is a letter grade of C- or better; for S in an S/U option is a letter grade of B- or better.

Exam Dates

  • Midterm Exam: Tuesday, 2/9; time limit: 80min + 20min grace time; window: 00:01am PT 2/9 to 11:59pm PT 2/9

  • Cumulative Final Exam: Thursday, 3/18; time limit: 180min + 30min grace time; time window: 00:01am PT 3/18 to 11:59pm PT 3/18

Classroom Rules

  • No late days for submitting the programming assignments or assigned peer activities. Plan your work well up front accordingly.

  • You are encouraged to ask questions and participate in the discussion during the live lecture slot and on Canvas. Enter your name or click "raise your hand" on Zoom chat; the instructor will pause and ask you to speak or type your question.

  • Please review UCSD's honor code and policies and procedures on academic integrity here. If plagiarism is detected in your code, or if we detect collusion on the graded quizzes or exams, or if any other form of academic integrity violation is identified, you will get zero for that component of your score and get downgraded substantially. I will also notify the University authorities for appropriate disciplinary action to be taken, up to and including expulsion from the University.

  • Harassment or intimidation of any form against any student will not be tolerated during the calls or on the discussion forum.

  • In the rare chance of a Zoombombing during a live lecture, I will end that session and immediately announce a new link on Canvas to resume that lecture.