DSC 102: Systems for Scalable Analytics

Administrivia

Lectures: MWF 3:00-3:50pm PT at CENTR 109

Instructor: Arun Kumar

  • Office Hours: Wed 4:00-5:00pm PT at 3218 CSE

Discussions: Mon 5:00-5:50pm PT at CENTR 109 (only occasionally)

TAs:

  • Aniruddha Das (andas [at] ucsd.edu)

  • Areeb Syed (aas050 [at] ucsd.edu)

  • Trevor Tuttle (tjtuttle [at] ucsd.edu)

  • The TA office hours are distributed non-uniformly. See the detailed schedules on the TA OHs page.

Piazza: DSC 102 (access code posted on Canvas)

Course Goals and Content

This course covers the principles of computing systems and tools for scaling data analytics to large datasets. Scalable analytics systems are a central part of modern data science in numerous application domains spanning enterprise business intelligence, Web search, e-commerce, social media, natural and social sciences, healthcare, digital humanities, e-governance, Internet of Things, and more.

Topics include basics of computer organization, memory hierarchy, operating systems, and cloud computing; principles of scalable and parallel data-intensive computing; design and use of parallel dataflow systems (MapReduce/Hadoop and Spark); and scaling of end-to-end machine learning (ML) workloads. It will cover how relational algebra, SQL, linear algebra, and more general dataflow operations in such systems can be used to perform data preparation and feature engineering for ML at scale, how to scale ML model building, and how to handle data heterogeneity.

A major component of this course is hands-on Python programming to implement data exploration, data preparation, and model selection pipelines on large real-world data using scalable analytics tools and cloud resources, both Amazon Web Services (AWS) public cloud and SDSC's private cloud.

Course Format

  • The class meets 3 times a week for 50-minute lectures in person.

    • All lectures will be automatically podcast here afterward.

    • Attending the lectures is not mandatory. But there are Peer Instruction activities involving discussing questions with peers in class only (details below). There will be other interactive activities as well.

    • We will use Piazza for asynchronous discussions and questions.

  • 3 Programming Assignments (PAs).

    • See the PAs page for the PA schedule and details.

    • There are no late days for the PAs. Plan your work accordingly.

  • 12 Peer Instruction activities via iClickers.

    • They will be held live in class using iClicker, spread randomly across the quarter.

    • Each activity will have 2 multiple-choice questions (MCQ). Quantitative problems may exist but only the final answer will need to be selected. No partial credits.

    • For each question, you must first answer individually. Then you can discuss the question with you neighbor(s). After that, you can answer the question again.

    • These activities are also open books/notes/electronics/Web.

    • Grading is based on earnest participation in the whole activity.

    • If you miss an activity, you will get no credit for it, unless you notify the instructor in advance with a university approved reason.

    • You can miss up to 2 activities out of the 12 without losing credit.

    • Make sure to bring your clicker to every lecture. If you happen to forget it one day, submit your written answers on a sheet.

    • You are allowed to possess only your own clicker. Using someone else's clicker is an academic integrity violation that will entail serious consequences as listed below.

  • Midterm exam and cumulative final exam.

    • The midterm exam will be held in person only. The final exam will be held as a Canvas Quiz only. The dates and logistics are listed below.

    • The exams will have primarily multiple choice questions (MCQ). Quantitative/longer problems wil exist but only the final answer may need to be selected. Some questions will have partial credits.

    • The guideline for time per question is a max of 1min per point. The points of each question will be calibrated accordingly.

    • If you miss an exam, you will get no credit for it, unless you notify the instructor in advance with a university approved reason and receive a makeup exam slot.

    • The midterm exam is closed books/notes/electronics/Web. You are allowed to keep with you two A4-sized sheets (four sides) with any content you want.

    • The final exam is open books/Web/etc. The only requirement is you should neither give nor receive help from anyone by any means.

  • There will be 3 extra credit Peer Evaluation activities delivered via Canvas only. I will announce more details on these in due course.

  • I will release ungraded exercises on the exercises page throughout the quarter. These questions will act as practice for the graded exams and surprise quizzes.

  • The discussion slots will be used by the TAs to give talks about the PAs. I might also use them for review discussions before the two exams.

Pre-requisites

  • DSC 100 (Introduction to Data Management); or substantial practical experience with scalable data systems and ML algorithms, subject to the consent of the instructor.

  • Proficiency in Python programming.

Suggested Textbooks

  • Computer Organization and Design (5th edition), by David Patterson and John Hennessy (aka the "CompOrg Book").

  • Operating Systems: Three Easy Pieces, by Remzi and Andrea Arpaci-Dusseau (aka the "Comet Book").

  • Database Management Systems (3rd edition), by Raghu Ramakrishnan and Johannes Gehrke (aka the "Cow Book").

  • Spark: The Definitive Guide (1st edition), by Bill Chambers and Matei Zaharia (aka the "Spark Book").

  • Data Management in Machine Learning Systems, by Matthias Boehm, Arun Kumar, and Jun Yang (aka the "MLSys Book").

Grading

Components

  • Programming Assignments: 8% + 16% + 16%

  • Midterm Exam: 15%

  • Cumulative Final Exam: 35%

  • In-class Peer Instruction Activities: 10%

  • Extra Credit Peer Evaluation Activities: 4% (likely)

Cutoffs

The grading scheme is a hybrid of absolute and relative grading. The absolute cutoffs are based on your absolute total score. The relative bins are based on your position in the total score distribution of the class. The better grade among the two (absolute-based and relative-based) will be your final grade.

Grade Absolute Cutoff (>=) Relative Bin (Use strictest)
A+ 95 Highest 5%
A 90 Next 10% (5-15)
A- 85 Next 15% (15-30)
B+ 80 Next 15% (30-45)
B 75 Next 15% (45-60)
B- 70 Next 15% (60-75)
C+ 65 Next 5% (75-80)
C 60 Next 5% (80-85)
C- 55 Next 5% (85-90)
D 50 Next 5% (90-95)
F < 50 Lowest 5%


Example: Suppose the total score is 82 and the percentile is 33. So, the relative grade is B-, while the absolute grade is B+. The final grade then is B+.

Non-Letter Grade Options: You have the option of taking this course for a non-letter grade. The policy for P in a P/F option is a letter grade of C- or better; for S in an S/U option is a letter grade of B- or better.

Exam Dates

  • Midterm Exam: Wed, Nov 2; in class, i.e., 3:00-3:50pm CENTR 109

  • Cumulative Final Exam: Fri, Dec 9; time window: 3:00-9:00pm PT; time limit: 4 hours

Classroom Rules

  • No late days for submitting the PAs. No extensions on the final exam time window. Plan all your work well up front accordingly.

  • Students are encouraged to ask questions and participate in discussions in class and on Piazza. Please raise your hand before speaking and the instructor will call on you to speak.

  • Please review all UCSD policies on pandemic-related public health and safety on this website. In particular, all are required to wear a proper mask indoors, including during lectures and OHs.

  • Please review UCSD's honor code and policies and procedures on academic integrity here. If plagiarism is detected in your code, or if we detect collusion on the graded quizzes or exams, or if you are found to be using someone else's clickers, or if any other form of academic integrity violation is identified, you will get zero for that component of your score and get downgraded substantially. I will also notify the University authorities for appropriate disciplinary action to be taken, up to and including expulsion from the University.

  • Please review UCSD's principles of community and our commitment to creating an inclusive learning environment on this website.

  • Harassment, discrimination, or intimidation of any form against any student will not be tolerated in class or on Piazza. Please review UCSD's policies on dealing with harassment and discrimination on this website.