Reading and Schedule

Below are the tentative schedule and reading list of this course.

Date Reading Lead
1/7 Course Introduction and the History of Virtualization
Yiying
Slides
1/9 Background and Virtualization Overview
Comet Book Chapter on Virtual Machine Monitors
Additional Readings

  1. Formal Requirements for Virtualizable Third Generation Architectures (Comm ACM 1974)
  2. Disco: Running Commodity Operating Systems on Scalable Multiprocessors (TOCS'97)
  3. Scale and Performance in the Denali Isolation Kernel

Yiying
Slides
1/14 Virtualizing CPU
A Comparison of Software and Hardware Techniques for x86 Virtualization (ASPLOS'06)
Questions

  1. Why is x86 un-virtualizable with trap-and-emulate? Give one example.
  2. How are jump instructions translated?
  3. With hardware virtualization extensions (e.g., Intel VT), do we still need binary translation? Why or why not?

Additional Readings

  1. The Evolution of an x86 Virtual Machine Monitor
  2. Software Techniques for Avoiding Hardware Virtualization Exits
  3. Embra: Fast and Flexible Machine Simulation
  4. Fast Dynamic Binary Translation for the Kernel
  5. Enabling Intel Virtualization Technology Features and Benefits

Yiying
Slides
1/16 Virtualizing Memory
Memory Resource Management in VMware ESX Server (OSDI'02)
Questions

  1. What is the double paging problem and what caused it?
  2. Would a malicious guest OS (or a buggy one) be able to access memory that it has swapped out during ballooning? Why/why not?
  3. What is the benefit of keeping a "hint" entry for each scanned (but unshared) page (as compared to not maintaining anything for the page)

Additional Readings

  1. Difference Engine: Harnessing Memory Redundancy in Virtual Machines
  2. Performance Evaluation of Intel EPT Hardware Assist

Yiying
Slides
1/21 Virtualizing I/O
Virtualizing I/O Devices on VMware Workstation's Hosted Virtual Machine Monitor (ATC'01)
Questions

  1. What is the down side of virtualizing I/O with hosted architecture (VMware Workstation)?
  2. At least how many context switches (including VM/VMM switches and VMM/Host switches) do a guest OS network send and a receive involve?
  3. Name two optimization techniques that could reduce the number of world switches during network operations.

Additional Readings

  1. virtio: Towards a De-Facto Standard For Virtual I/O Devices
  2. vIC: Interrupt Coalescing for Virtual Machine Storage Device IO
  3. ELI: Bare-Metal Performance for I/O Virtualization

Yiying
Slides
1/23 Container Basics
Understanding and Hardening Linux Containers (mainly Ch 2 to Ch 5; you can ignore many of the details in these chapters. Read Ch 1 for more background on virtualization. Read other chapters if you are interested in security.)
Questions

  1. What types of isolations does Linux containers achieve?
  2. Can one Linux container affect the performance of another Linux container on the same machine (i.e., performance isolation)? Why or why not?
  3. Why do you think containers are less "secure" than virtual machines?

Additional Readings

  1. LXC/LXD
  2. Docker
  3. Understanding Security Implications of Using Containers in the Cloud
  4. Container Security: Issues, Challenges, and the Road Ahead
  5. Slacker: Fast Distribution with Lazy Docker Containers

Yiying
Slides
1/28 Kubernetes and gVisor
Kubernetes and gVisor, Quiz 1
Questions

  1. What is a Kubernetes Pod? How do you think it is useful in container orchestration?
  2. What does Kubernetes use etcd for? Why is having a consistent, atomic key-value store important for Kubernetes' control plane?
  3. Vulnerabilities in the Linux kernel makes it unsafe for containers to call Linux system calls. How does gVisor solve this problem?

Additional Readings

  1. Borg, Omega, and Kubernetes (Google)
  2. The True Cost of Containing: A gVisor Case Study
  3. Container Isolation at Scale (Introducing gVisor) - Dawn Chen & Zhengyu He, Google
  4. Nabla Containers

Yiying
Slides
1/30 Serverless Computing Overview
Cloud Programming Simplified: A Berkeley View on Serverless Computing
Questions

  1. Current datacenters use container as the host to run serverless functions. Do you think that is a good way? Why and why not?
  2. Today's serverless functions are stateless. How do you think different functions can share data and communicate?
  3. Can you think of any security threats of serverless computing? Bonus points if you can outline a real threat/attack.

Additional Readings

  1. Amazon Lambda
  2. Google Cloud Functions
  3. Azure Functions
  4. Serverless Computing: Current Trends and Open Problems
  5. Occupy the Cloud: Distributed Computing for the 99% (PyWren)
  6. Serverless Computing: One Step Forward, Two Steps Back

Yiying
Slides
Lihao
Slides
2/4 Serverless Computing Advanced
Pocket: Elastic Ephemeral Storage for Serverless Analytics (OSDI'18) , Debate on Serverless
Questions

  1. Why isn't using existing in-memory key-value stores such as Redis and Memcached a good option for storing ephemeral data in serverless computing?
  2. How does Pocket balance storage load?
  3. Do you think Pocket solve all the problems of managing states in serverless computing? If not, what do you think are the remaining problems?

Additional Readings

  1. SAND: Towards High-Performance Serverless Computing
  2. Encoding, Fast and Slow: Low-Latency Video Processing Using Thousands of Tiny Threads
  3. A Case for Serverless Machine Learning
  4. Archipelago: A Scalable Low-Latency Serverless Platform
  5. Cloudburst: Stateful Functions-as-a-Service

Yiying
Slides
2/6 Library OS
Unikernels: Library Operating Systems for the Cloud (ASPLOS'13)
Questions

  1. Name one benefit and one drawback of compiling a single-image VM.
  2. Unikernel runs in a single address space. Give one example of how this design helps improve performance.
  3. Comparing gVisor and Unikernels, which one do you think is more secure and which is more lightweight?

Additional Readings

  1. Unikernels as Processes
  2. Rethinking the Library OS from the Top-Down
  3. Mirage OS
  4. Nabla Containers
  5. ClickOS and the Art of Network Function Virtualization
  6. Libra: a library operating system for a JVM in a virtualized execution environment
  7. Exokernel: an operating system architecture for application-level resource management

Yiying
Slides
2/11 Para-Virtualization
Xen and the Art of Virtualization (SOSP'03)
Questions

  1. Why can Xen allow guest OS system call handlers to be accessed directly (without any ring-0 Xen involvement) but not guest page fault handler?
  2. What's the benefit of using asynchronous event notifications from Xen to a VM?
  3. What goals of Xen are not valid or less valid in today's cloud environments?

Additional Readings

  1. Understanding Full Virtualization, Paravirtualization, and Hardware Assist
  2. Safe Hardware Access with the Xen Virtual Machine Monitor
  3. Optimizing Network Virtualization in Xen
  4. Measuring CPU Overhead for I/O Processing in the Xen Virtual Machine Monitor
  5. Breaking Up is Hard to Do: Security and Functionality in a Commodity Hypervisor (SOSP'11)

Xinyuan, Tianrui, Xiaohan
Slides
2/13 Light VM
My VM is Lighter (and Safer) than your Container (SOSP'17) , Quiz 2
Questions

  1. What are the benefits of LightVM (Tinyx) over Unikernels?
  2. Why does device creation dominate VM creation when there are fewer concurrent VMs, while XenStore interactions dominate VM create when there are more concurrent VMs?
  3. What is the case that requires guest OS modification to run with LightVM?

Additional Readings

  1. OSv - Optimizing the OS for VMs
  2. Kata Containers
  3. Jitsu: Just-In-Time Summoning of Unikernels

Yiying
Slides
2/18 KVM and Dune
KVM, Dune: Safe User-level Access to Privileged CPU Features (OSDI'12)
* It is OK if you only understand the very high-level ideas of Dune
Questions

  1. Why does Linux needs to notify KVM when it preempts a process that has guest states or when it flushes the PTEs of KVM guests?
  2. How is a Dune process different from a regular VM?

Additional Readings

  1. KVM Documentation

Yiying
Slides
2/20 VM Migration and Replication
Live Migration of Virtual Machines (NSDI'05)
Questions

  1. How does the system proposed in this paper determine when to end the pre-copy phase and perform a stop-and-copy?
  2. What do you think are the pros and cons of managed migration and self migration?
  3. Can you think of at least one new challenge and one new opportunity when designing a system for live container migration?

Additional Readings

  1. The Design and Implementation of Zap: A System for Migrating Computing Environments (OSDI'02)
  2. XvMotion: Unified Virtual Machine Migration over Long Distance (ATC'14)
  3. VMware vMotion
  4. Remus: High Availability via Asynchronous Virtual Machine Replication (NSDI'08)
  5. RemusDB: Transparent High Availability for Database Systems

Yiying
Slides
2/25 Security
When Virtual is Harder than Real: Security Challenges in Virtual Machine Based Computing Environments (HotOS'05)
Secure Container Isolation: Problem Statement & Solution Space
Questions

  1. Can you think of some drawback of enforcing security mechanisms at the hypervisor level (compared to at the guest OS or above)?
  2. Choose one type in the "Attack Surfaces" section of the Google doc (the last section), and describe how sandboxing can help defend that type of attacks.
  3. Name one vulnerability of containers that cannot be addressed by sandboxing.

Additional Readings

  1. When Virtual Is Better Than Real (HotOS'01)
  2. Secure Pods: Sandboxing workloads in Kubernetes
  3. TrustVisor: Efficient TCB Reduction and Attestation
  4. SecVisor: A Tiny Hypervisor to Provide Lifetime Kernel Code Integrity for Commodity OSes (SOSP'07)
  5. Breaking Up is Hard to Do: Security and Functionality in a Commodity Hypervisor (SOSP'11)
  6. InkTag: Secure Applications on an Untrusted Operating System (ASPLOS'13)
  7. Overshadow: A Virtualization-Based Approach to Retrofitting Protection in Commodity Operating Systems
  8. VirtuOS: An Operating System with Kernel Virtualization
  9. SCONE: Secure Linux Containers with Intel SGX
  10. Understanding Security Implications of Using Containers in the Cloud (ASPLOS'08)
  11. Container Security: Issues, Challenges, and the Road Ahead

Ian
Slides
Yiying
Slides
2/27 Virtualizing non-CPU Processors (Accelerators)
A Full GPU Virtualization Solution with Mediated Pass-Through (ATC'14)
Sharing, Protection and Compatibility for Reconfigurable Fabric with AmorphOS (OSDI'18)
* It is OK to only understand the very high-level ideas of AmorphOS.
Questions

  1. Why does gVirt uses a fixed and large (16ms) time slice to schedule virtual GPUs?
  2. Name at least one place where virtualizing GPU memory is different from virtualizing CPU memory.
  3. What makes it hard to space multiplex FPGA? (bonus point for an additional question of "what makes it hard to time multiple (dynamically schedule) FPGA?"

Additional Readings

  1. Accelerating & Optimizing HPC/ML on vSphere Leveraging NVIDIA GPU (2019/02 talk)
  2. GPUvm: Why Not Virtualizing GPUs at the Hypervisor? (ATC'14)
  3. PTask: Operating System Abstractions To Manage GPUs as Compute Devices (SOSP'11)

Yiying
Slides
Yizhou Shan
Slides
3/3 Amazon Nitro
Amazon Nitro (esp. the video talk on that page)
Questions

  1. With Amazon Nitro, virtualization functions are mostly offloaded to hardware. Do we still need a hypervisor (or an OS)? Can everything just run in user space and interact with Nitro cards directly?
  2. Can you think of a drawback of offloading tasks to hardware (i.e., Nitro's approach)?

Additional Readings

Vansh, Kunlin
Slides
3/5 Amazon Firecracker
Firecracker: Lightweight Virtualization for Serverless Applications (NSDI'20)
Questions

  1. What is the benefit of Firecracker over gVisor in terms of the specific goals Amazon has for their cloud production environments?
  2. What mechanism(s) allow Firecracker to run thousands of MicroVMs on the same machine (with 10x-20x oversubscription rate)?
  3. Why do you think Firecracker (when deployed to power AWS Lambda) run one process (one slot) in one MicroVM?

Additional Readings

  1. Amazon Firecracker Git repo
  2. Nabla Containers
  3. Kata Containers

Yiying
Slides
3/10 Course Summary
Hints for Computer System Design -- Butler Lampson
Questions

Read the "Hints for Computer System Design" paper and summarize what you have learned over the course. Feel free to write about anything else you want to comment on the course.

Yiying
Slides
3/12 Project Presentations