DSC 102: Systems for Scalable Analytics

!!! As per UCSD regulations, this class will be online-only (Zoom-based) for the first four weeks and then in-person only afterward. !!!

Administrivia

Lectures: MWF 1:00-1:50pm PT at WLH 2005

Instructor: Arun Kumar

  • Office Hours: Wed 2:00-3:00pm PT at 3218 CSE and Zoom

Discussions: Fri 4:00-4:50pm PT at MANDE B-210 or Zoom (only occasionally)

TAs:

  • Pradyumna Sridhara (prsridha [at] ucsd.edu

  • Umesh Singla (usingla [at] ucsd.edu)

  • Vignesh Nandakumar (vnandakumar [at] ucsd.edu)

  • The TA office hours in this course are distributed non-uniformly. See the detailed schedules on the TA OHs page.

Piazza: DSC 102 (access code posted on Canvas)

Course Goals and Content

This course covers the principles of computing systems and tools for scaling data analytics to large datasets. Scalable analytics systems are a central part of modern data science in numerous application domains spanning enterprise business intelligence, Web search, e-commerce, social media, natural and social sciences, healthcare, digital humanities, e-governance, Internet of Things, and more.

Topics include computer organization, memory hierarchy, basics of operating systems, scalable and parallel computing, cloud computing, design and use of parallel dataflow systems (MapReduce/Hadoop and Spark), machine learning systems, and the use of deep learning tools. It will cover how relational algebra, SQL, linear algebra, and more general dataflow operations in such systems can be used to perform data preparation and feature engineering for machine learning (ML) at scale, how to scale ML training, how to perform ML model selection and deployment at scale, and how to handle data heterogeneity. It will also introduce the implementation of such data systems and touch upon the latest research in this space.

A major component of this course is hands-on Python programming to implement data exploration, data preparation, and model selection pipelines on large real-world data using scalable analytics tools and cloud resources, both Amazon Web Services (AWS) public cloud and SDSC's private cloud.

Course Format and Mixed Modality Instructions

  • The class meets 3 times a week for 50-minute lectures.

    • All lectures will be delivered live. As per UCSD regulations, the first four weeks will be virtual (Zoom-only). All subsequent lectures will be in-person only.

    • All Zoom recordings from the first four weeks will be posted to Canvas Media Gallery. All in-person lectures will be automatically podcast afterward.

    • Attending lectures live is NOT mandatory. But you are highly encouraged to join them live to participate in the in-class discussions, surprise quizzes, and other interactive activities.

    • Students are NOT required to have webcams. But microphones are highly encouraged. All Zoom meetings can be joined via phone as well.

    • We will use Piazza for asynchronous discussions and questions. Canvas Discussions is okay too.

  • 3 programming assignments (PAs).

    • Students can work on projects either individually or in teams of 2. Students should email their team decisions to the TAs before 11:59pm PT Monday 01/10. All remaining students will be randomly paired up by the TAs.

    • See the PA schedule and details on the PA schedule page.

    • There are no late days for the programming assignments; plan your work accordingly.

    • Your (team's) code submission must be entirely your (team's) own. The PA schedule page offers more guidance on what level of discussion outside your team is allowed.

  • Midterm exam and cumulative final exam.

    • These will be held as Canvas Quizzes only (formerly in person only). The dates and time windows are listed below.

    • The exams will have primarily multiple choice questions (MCQ). Quantitative/longer problems may exist but only final answer may need to be selected. Some questions will have partial credits.

    • The guideline for time per question is a max of 1min per point. The points of each question will be calibrated accordingly.

    • If you miss an exam, you will get no credit for it, unless you notify the instructor in advance with a university approved reason and receive a makeup exam slot.

    • The exams are open books/notes/Web. The only requirement is you should neither give nor receive help from anyone by any means.

  • 6 surprise quizzes with Google Forms / iClicker.

    • The quizzes will be spread throughout the quarter. At the end, only your 5 best scores

    • The quizzes held in the first four week of remote instruction will be via Google Forms. The quizzes help in person will use iClicker. will be used for grading.

    • Each quiz will have multiple-choice questions (MCQ). Quantitative/longer problems may exist but only final answer is needed. No partial credits.

    • If you miss a quiz, you will get no credit by default. If you miss a quiz due to a pre-notified university appoved reason, that quiz will be waived for you and this component will be rescaled accordingly.

    • The quizzes are also open books/notes/Web. The only requirement is you should neither give nor receive help from anyone by any means.

    • This is a no-fault component, i.e., the better of the two grades, with this component and without it (rest rescaled accordingly), will be used for your overall grade.

  • I will also release ungraded exercises on the exercises page throughout the quarter. These questions will act as practice for the graded exams and surprise quizzes.

  • The discussion slots will be used by the TAs to give talks about the programming assignments. I might also use them for review discussions before the two exams.

Pre-requisites

  • DSC 100 (Introduction to Data Management); or substantial practical experience with scalable data systems and ML, subject to the consent of the instructor.

  • Proficiency in Python programming.

Suggested Textbooks

  • Computer Organization and Design (5th edition), by David Patterson and John Hennessy (aka the "CompOrg Book").

  • Operating Systems: Three Easy Pieces, by Remzi and Andrea Arpaci-Dusseau (aka the "Comet Book").

  • Database Management Systems (3rd edition), by Raghu Ramakrishnan and Johannes Gehrke (aka the "Cow Book").

  • Spark: The Definitive Guide (1st edition), by Bill Chambers and Matei Zaharia (aka the "Spark Book").

  • Data Management in Machine Learning Systems, by Matthias Boehm, Arun Kumar, and Jun Yang (aka the "MLSys Book").

Grading

Components

  • Programming Assignments: 7% + 14% + 14%

  • Surprise Quizzes: 15% (5 x 3%)

  • Midterm Exam: 15%

  • Cumulative Final Exam: 35%

Cutoffs

The grading scheme is a hybrid of absolute and relative grading. The absolute cutoffs are based on your absolute total score. The relative bins are based on your position in the total score distribution of the class. The better grade among the two (absolute-based and relative-based) will be your final grade.

Grade Absolute Cutoff (>=) Relative Bin (Use strictest)
A+ 95 Highest 5%
A 90 Next 10% (5-15)
A- 85 Next 15% (15-30)
B+ 80 Next 15% (30-45)
B 75 Next 15% (45-60)
B- 70 Next 15% (60-75)
C+ 65 Next 5% (75-80)
C 60 Next 5% (80-85)
C- 55 Next 5% (85-90)
D 50 Next 5% (90-95)
F < 50 Lowest 5%


Example: Suppose the total score is 82 and the percentile is 33. So, the relative grade is B-, while the absolute grade is B+. The final grade then is B+.

Non-Letter Grade Options: You have the option of taking this course for a non-letter grade. The policy for P in a P/F option is a letter grade of C- or better; for S in an S/U option is a letter grade of B- or better.

Exam Dates

  • Midterm Exam: Wednesday, February 9; time window: 10:00am-10:00pm PT; time limit: 70 minutes

  • Cumulative Final Exam: Friday, March 18; time window: 00:01am-11:59pm PT; time limit: 4 hours

Classroom Rules

  • No late days for submitting the programming assignments. No extensions on the exam time windows. Plan your work well up front accordingly.

  • Students are encouraged to ask questions and participate in the discussion during the live lecture and also on Piazza. Please raise your hand before speaking and the instructor will call on you to speak.

  • Please review all UCSD policies on pandemic-related public health and safety on this website. In particular, all are required to wear a proper mask indoors, including during lectures and OHs.

  • Please review UCSD's honor code and policies and procedures on academic integrity here. If plagiarism is detected in your code, or if we detect collusion on the graded quizzes or exams, or if any other form of academic integrity violation is identified, you will get zero for that component of your score and get downgraded substantially. I will also notify the University authorities for appropriate disciplinary action to be taken, up to and including expulsion from the University.

  • Please review UCSD's principles of community and our commitment to creating an inclusive learning environment on this website.

  • Harassment or intimidation of any form against any student will not be tolerated in class or on Piazza. Please review UCSD's policies on dealing with harassment and discrimination on this website.

  • In the rare chance of a Zoombombing during a live lecture, I will end that session and immediately announce a new link to resume that lecture.