DSC 102: Systems for Scalable Analytics (Online-Only Edition)!!! This website is archived. Please see the website of the latest edition of this course among the links listed here. !!! AdministriviaLectures: TueThu 11:00-12:20pm PT on Zoom (link posted on Canvas) Instructor: Arun Kumar; Office: 3218 CSE; Office Hours: Thu 1:00-2:00pm on Zoom (link posted on Canvas) Discussions: Mon 1:00-1:50pm PT on Zoom (occasionally; link posted on Canvas) TAs:
Piazza: DSC 102 (access code posted on Canvas) Course Goals and ContentThis course covers the principles of computing systems and tools for scaling data analytics to large datasets. Scalable analytics systems are a central part of modern data science in numerous application domains spanning enterprise business intelligence, Web search, e-commerce, social media, natural and social sciences, healthcare, digital humanities, e-governance, Internet of Things, and more. Topics include computer organization, memory hierarchy, basics of operating systems, scalable and parallel computing, cloud computing, design and use of parallel dataflow systems (MapReduce/Hadoop and Spark), machine learning systems, and the use of deep learning tools. It will cover how relational algebra, SQL, linear algebra, and more general dataflow operations in such systems can be used to perform data preparation and feature engineering for machine learning (ML) at scale, how to scale ML training, how to perform ML model selection and deployment at scale, and how to handle data heterogeneity. It will also introduce the implementation of such data systems and touch upon the latest research in this space. A major component of this course is hands-on Python programming to implement data exploration, data preparation, and model selection pipelines on large real-world data using scalable analytics tools and cloud resources, both Amazon Web Services (AWS) public cloud and SDSC's private cloud. Course Format and Online-only Modality Instructions
Pre-requisites
Suggested Textbooks
GradingComponents
CutoffsSince this is my first online-only edition of this course, the grading scheme is a hybrid of absolute and relative grading to mitigate the "cold start" issue. The absolute cutoffs are based on your absolute total score. The relative bins are based on your position in the total score distribution of the class. The better grade among the two (absolute-based and relative-based) will be your final grade.
Non-Letter Grade Options: You have the option of taking this course for a non-letter grade. The policy for P in a P/F option is a letter grade of C- or better; for S in an S/U option is a letter grade of B- or better. Exam Dates
Classroom Rules
|