DSC 102: Systems for Scalable Analytics
Administrivia
Lectures: MWF 3:00-3:50pm PT at CENTR 109
Instructor: Arun Kumar
Discussions: Mon 5:00-5:50pm PT at CENTR 109 (only occasionally)
TAs:
Aniruddha Das (andas [at] ucsd.edu)
Areeb Syed (aas050 [at] ucsd.edu)
Trevor Tuttle (tjtuttle [at] ucsd.edu)
The TA office hours are distributed non-uniformly. See the detailed schedules on the TA OHs page.
Piazza: DSC 102 (access code posted on Canvas)
Course Goals and Content
This course covers the principles of computing systems and tools for scaling data analytics to large datasets.
Scalable analytics systems are a central part of modern data science in numerous application domains spanning
enterprise business intelligence, Web search, e-commerce, social media, natural and social sciences, healthcare,
digital humanities, e-governance, Internet of Things, and more.
Topics include basics of computer organization, memory hierarchy, operating systems, and cloud computing;
principles of scalable and parallel data-intensive computing;
design and use of parallel dataflow systems (MapReduce/Hadoop and Spark);
and scaling of end-to-end machine learning (ML) workloads.
It will cover how relational algebra, SQL, linear algebra, and more general dataflow operations in such systems
can be used to perform data preparation and feature engineering for ML at scale, how to scale
ML model building, and how to handle data heterogeneity.
A major component of this course is hands-on Python programming to implement data exploration, data preparation,
and model selection pipelines on large real-world data using scalable analytics tools and cloud resources, both
Amazon Web Services (AWS) public cloud and SDSC's private cloud.
Course Format
Pre-requisites
DSC 100 (Introduction to Data Management); or substantial practical experience with
scalable data systems and ML algorithms, subject to the consent of the instructor.
Proficiency in Python programming.
Suggested Textbooks
Computer Organization and Design (5th edition), by David Patterson and John Hennessy (aka the "CompOrg Book").
Operating Systems: Three Easy Pieces, by Remzi and Andrea Arpaci-Dusseau (aka the "Comet Book").
Database Management Systems (3rd edition), by Raghu Ramakrishnan and Johannes Gehrke (aka the "Cow Book").
Spark: The Definitive Guide (1st edition), by Bill Chambers and Matei Zaharia (aka the "Spark Book").
Data Management in Machine Learning Systems, by Matthias Boehm, Arun Kumar, and Jun Yang (aka the "MLSys Book").
Grading
Components
Programming Assignments: 8% + 16% + 16%
Midterm Exam: 15%
Cumulative Final Exam: 35%
In-class Peer Instruction Activities: 10%
Extra Credit Peer Evaluation Activities: 4% (likely)
Cutoffs
The grading scheme is a hybrid of absolute and relative grading.
The absolute cutoffs are based on your absolute total score.
The relative bins are based on your position in the total score distribution of the class.
The better grade among the two (absolute-based and relative-based) will be your final grade.
Grade | Absolute Cutoff (>=) | Relative Bin (Use strictest) |
| | |
A+ | 95 | Highest 5% |
A | 90 | Next 10% (5-15) |
A- | 85 | Next 15% (15-30) |
B+ | 80 | Next 15% (30-45) |
B | 75 | Next 15% (45-60) |
B- | 70 | Next 15% (60-75) |
C+ | 65 | Next 5% (75-80) |
C | 60 | Next 5% (80-85) |
C- | 55 | Next 5% (85-90) |
D | 50 | Next 5% (90-95) |
F | < 50 | Lowest 5%
|
Example: Suppose the total score is 82 and the percentile is 33. So, the relative grade is B-, while the absolute grade is B+. The final grade then is B+.
Non-Letter Grade Options: You have the option of taking this course for a non-letter grade.
The policy for P in a P/F option is a letter grade of C- or better;
for S in an S/U option is a letter grade of B- or better.
Exam Dates
Midterm Exam: Wed, Nov 2; in class, i.e., 3:00-3:50pm CENTR 109
Cumulative Final Exam: Fri, Dec 9; time window: 3:00-9:00pm PT; time limit: 4 hours
Classroom Rules
Please review UCSD's honor code and policies and procedures on academic integrity here.
If plagiarism is detected in your code, or if we detect collusion on the graded quizzes or exams,
or if you are found to be using someone else's clickers,
or if any other form of academic integrity violation is identified, you will get zero for
that component of your score and get downgraded substantially. I will also notify the University
authorities for appropriate disciplinary action to be taken, up to and including
expulsion from the University.
Harassment, discrimination, or intimidation of any form against any student will not be
tolerated in class or on Piazza. Please review UCSD's policies on dealing with harassment and
discrimination on this website.
|