Dependability
is critically important for today's systems and applications. Computer
hardware, software, disks, networks and configurations are unfortunately
subject to faults which can eventually manifest in visible failures, causing
damages. This CSE291 course will cover topics
ranging from classic fault tolerant computing, error detection techniques for
hardware and software faults, to failure diagnosis and recovery in today's data
centers. The format will include lectures by me, student presentations and
discussions, and a couple of invited speakers from industry such as Splunk and
Google on relevant topics. In our class project, we will inject
various faults such as network failures, node crashes, etc
into popular open source software of your choice (e.g. Zookeeper, HDFS, MySQL)
to see how many issues you can expose in these software.
Class hours:
Tue/Thu 2-3:20pm, Lectures:
ZOOM
Instructor: Prof. Yuanyuan Zhou
Office hours: Tue/Thu 3:30-4:30, ZOOM
Graduate Course
Assistants: Yudong Wu
Course Project: Canvas
Class Schedule and Videos: Canvas
Textbook: No text book.
The course will use
technical conference and journal papers. You are expected to get the papers
from IEEExplore or ACM Digital Library.
Reference Books:
1.
I. Koren and C. Mani Krishna, Fault-tolerant Systems, 1st edition, 2007, Morgan
Kaufmann.
2.
D. P. Siewiorek and R. S. Swarz, Reliable
Computer Systems - Design and Evaluation, 3rd edition, 1998, A.K.
Peters, Limited.
3.
D. K. Pradhan, ed., Fault
Tolerant Computer System Design, 1st edition, 1996,
Prentice-Hall.
4.
K. Trivedi, Probability
and Statistics with Reliability, Queuing and Computer Science Applications,
2nd edition, 2001, John Wiley & Sons.
Grade Allocation:
·
Course project: 50%
o Milestore 1 ---10%
o Milestone 2 (In-class proposal
presentation) --- 5%
o Milestone 3 --- 10%
o Milestone 4 (Final
presentation) --- 5%
o Final report -- 15%
o Submission of list of issues
--- 5%
·
Class paper presentation (30%)
·
Quizzes (20%, 3 quizzes, pick top 2 scores)
Schedule:
Date |
Format |
Topics |
Jan
5th |
Lecture |
Class
overview, intro |
Jan
7th |
Lecture |
Basic
Concepts, Taxonomy |
Jan
12th |
Lecture |
Failure
Characteristic Studies |
Jan
14th |
Lecture |
Error
coding and detection |
Jan
19th |
Lecture |
Human
Errors |
Jan
21st |
Student
Presentation |
Fault
injection techniques and tools |
Student
Presentation |
Why
Do Computers Stop And What Can Be Done About It? |
|
Jan
28th |
Student
Presentation |
SIFT:
Design and analysis of a fault-tolerant computer for aircraft control |
Feb
2nd |
Student
Presentation |
A Survey of Rollback-Recovery Protocols in Message-Passing
Systems |
Feb
4th |
Student
Presentation |
Implementing fault-tolerant services using the state machine
approach: a tutorial |
Feb
9th |
Student
Presentation |
|
Feb
11th |
Project
Proposal |
Project
proposals (10min each project) |
Feb
16th |
Student
Presentation |
|
Feb
18th |
Student
Presentation |
Recovery-oriented
computing (ROC): Motivation, definition, techniques, and case studies |
Feb
23rd |
Student
Presentation |
Enhancing
Server Availability and Security Through Failure-Oblivious Computing |
Feb
25th |
Student
Presentation |
|
March
2nd |
Student
Presentation |
Systems
approaches to tackling configuration errors: A survey |
March
4th |
TBD |
TBD |
March
9th |
Project
Final presentations |
|
March
11th |
Project
Final presentations |