DSC 102: Systems for Scalable Analytics
!!! This website is archived. Please see the website of the latest edition of this course among the links listed here. !!!
Administrivia
Instructor: Arun Kumar; Office: 3218 CSE; Office Hours: Thu 2:00-3:00pm
Lectures: TueThu 12:30-1:50pm; PCYNH 106
Discussions: Fri 8:00-8:50am; CENTR 115
TAs:
Supun Nakandala (snakanda [at] eng.ucsd.edu); Office Hours: See announcement below
Vraj Shah (vps002 [at] eng.ucsd.edu); Office Hours: Closed
Yuhao Zhang (yuz870 [at] eng.ucsd.edu); Office Hours: See announcement below
Piazza: DSC 102
Announcements
New! Final exam scores and answers have been released on Canvas. A summary with statistics on the class performance on the final exam, comparison with the midterm exam, and a histogram of tentative grades is provided in this short video: MP4 Video
New! PA 2 solution discussion slides and video have been released on the schedule page. Grades will be posted on Canvas by 3/16.
New! Supun and Yuhao will hold their last set of office hours related to PA 2 matters as given below. Zoom links will be given in their Piazza post.
Thu (3/19) 1:00-2:00pm: Supun
Thu (3/19) 2:00-2:30pm: Yuhao
Fri (3/20) 1:30-2:00pm: Yuhao
Course Overview and Content
This course covers the principles of computing systems and tools for scaling data analytics to large datasets.
Scalable analytics systems are a central part of modern data science in numerous application domains spanning
enterprise business intelligence, Web search, e-commerce, social media, natural and social sciences, healthcare,
digitial humanities, e-governance, Internet of Things, and more.
Topics include computer organization, memory hierarchy, basics of operating systems, scalable and parallel
computing, cloud computing, design and use of parallel dataflow systems, and the use of deep learning tools.
It will cover how relational algebra, SQL, linear algebra, and more general dataflow operations in such systems
can be used to perform data preparation and feature engineering for machine learning (ML) at scale, how to scale
ML training, how to perform ML model selection and deployment at scale, and how to handle data heterogeneity.
It will also introduce the implementation of such data systems and touch upon the latest research in this space.
A major component of this course is hands-on Python programming to implement data exploration, data preparation,
and model selection pipelines on large real-world data using scalable analytics tools and cloud resources.
Course Format
This course will have one in-class midterm exam and one cumulative final exam.
If you miss an exam, you will get no credit for it unless you pre-notify the instructor with a certifiable medical
or emergency reason; in such cases, your grade will be based on a proportional reweighting of the other components.
There will be 7 short in-class surprise quizzes on random lecture dates to help you revise the material.
Each quiz will be only 7min long and will have 7 multiple choice questions that need to be answered using
iClickers.
The grading is binary: if you answer at least 4 of the questions correctly, you get full credit for that quiz;
otherwise, you get no credit for that quiz. If you are absent, you get no credit by default.
If you miss a quiz due to an absence that was pre-notified with a certifiable medical or emergency reason, that
quiz will be discounted for you and your score for this component will be reweighted accordingly.
At the end of the class, only your 5 best quiz scores will be used for grading.
Pre-requisites
DSC 100 (Introduction to Data Management); or substantial practical experience with scalable data systems and ML,
subject to the consent of the instructor.
Proficiency in Python programming.
Suggested Textbooks
Computer Organization and Design (5th edition), by David Patterson and John Hennessy (aka the "CompOrg Book").
Operating Systems: Three Easy Pieces, by Remzi and Andrea Arpaci-Dusseau (aka the "Comet Book").
Database Management Systems (3rd edition), by Raghu Ramakrishnan and Johannes Gehrke (aka the "Cow Book").
Spark: The Definitive Guide (1st edition), by Bill Chambers and Matei Zaharia (aka the "Spark Book").
Data Management in Machine Learning Systems, by Matthias Boehm, Arun Kumar, and Jun Yang (aka the "MLSys Book").
Grading
Components
Cutoffs
Since this is the very first edition of this course, the grading scheme is a hybrid of absolute and relative grading
to mitigate the "cold start" issue. The absolute cutoffs are based on your absolute total score.
The relative bins are based on your position in the total score distribution of the class.
The better grade among the two (absolute-based and relative-based) will be your final grade.
The cutoffs listed below offer a minimum guarantee on your grade; some thresholds might be lowered slightly later
by the instructor but they will not be raised.
Grade | Absolute Cutoff (>=) | Relative Bin (Use strictest) |
| | |
A+ | 95 | Highest 5% |
A | 90 | Next 10% (5-15) |
A- | 85 | Next 15% (15-30) |
B+ | 80 | Next 15% (30-45) |
B | 75 | Next 15% (45-60) |
B- | 70 | Next 15% (60-75) |
C+ | 65 | Next 5% (75-80) |
C | 60 | Next 5% (80-85) |
C- | 55 | Next 5% (85-90) |
D | 50 | Next 5% (90-95) |
F | < 50 | Lowest 5%
|
Example: Suppose the total score is 82 and the percentile is 33. So, the relative grade is B-, while the absolute grade is B+. The final grade then is B+.
Exam Dates
Midterm Exam: Thursday, 02/06, in class
Cumulative Final Exam: Tueday, 03/17, 11:30am to 2:30pm, Room TBD
Classroom Rules
You are encouraged to ask questions and participate in class discussions.
Please raise your hand before asking questions or speaking during the lectures.
Harassment or intimidation of any form against other students will not be tolerated in class.
If plagiarism is detected in your code or if cheating is detected during an exam,
the University authorities will be notified immediately for appropriate disciplinary action to be taken.
You will also get zero on that entire component of your grade.
|