Readings and Schedule

Date Reading and Questions Slides
9/26 Course Introduction and Generative AI Overview
Slides
9/29 AI Agents Overview
What is an AI agent? Definition, Examples, and Types and ReAct: Synergizing Reasoning and Acting in Language Models (ICLR '23)
Additional Readings

  1. AI agents — what they are, and how they’ll change the way we work
  2. Understanding AI Agents: How They Work, Types, and Practical Applications

Slides
10/1 Transformer Overview and LLM Performance Analysis
LLM Inference Performance Engineering: Best Practices
Questions

  1. What are the key performance metrics to look out for in LLM inference?
  2. What would be the important metrics to evaluate if you were to run a language model on your PC/tablet?

Additional Readings

  1. A Queueing Theoretic Perspective on Low-Latency LLM Inference with Variable Token Length
  2. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
  3. TensorRT-LLM

Slides
10/3 Distributed Machine Learning Training
PipeDream: Generalized Pipeline Parallelism for DNN Training (SOSP'19)
Questions

  1. By making the pipeline more smooth (less pipeline bubbles), what tradeoff does PipeDream make? i.e., in what aspect is GPipe better than PipeDream?
  2. What type of parallelism do you think is most widely adopted in practice? Why?

Additional Readings

  1. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
  2. GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism
  3. Memory-Efficient Pipeline-Parallel DNN Training (PipeDream-2BW)
  4. Scaling Distributed Machine Learning with the Parameter Server
  5. Optimization of Collective Communication Operations in MPICH

Slides
10/6 Distributed Machine Learning Training in Production (guest lecture by Haozheng Fan, AWS)
HLAT: High-quality Large Language Model Pre-trained on AWS Trainium (IEEE Big Data 2024)
Questions

  1. How does HLAT determine its parallelism strategies and system optimizations, and how might these trade-offs evolve as model size or cluster scale increases?
  2. Which stage or component of pre-training do you consider most critical in practice, and why?

Additional Readings

  1. Distributed training of large language models on AWS Trainium (SoCC'24)
  2. GEMINI: Fast Failure Recovery in Distributed Training with In-Memory Checkpoints (SOSP'23)
  3. Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning (OSDI'22)
  4. DeepSeek-V3 Technical Report
  5. s1: Simple test-time scaling

Slides
10/8 Large Language Model Inference - Pt. 1 Intro and Iterative Scheduling
Orca: A Distributed Serving System for Transformer-Based Generative Models (OSDI'22)
Questions

  1. List one benefit and one drawback of iterative scheduling (as compared to request-level scheduling).
  2. How would you decide the batch size (number of request) for an LLM inference system you are building? List all the factors you would consider.

Additional Readings

  1. ServerlessLLM: Low-Latency Serverless Inference for Large Language Models
  2. Fairness in Serving Large Language Models
  3. Can Scheduling Overhead Dominate LLM Inference Performance? A Study of CPU Scheduling Overhead on Two Popular LLM Inference Systems

Slides
10/10 Large Language Model Inference - Pt. 2 PagedAttention
Efficient Memory Management for Large Language Model Serving with PagedAttention (SOSP'23)
Questions

  1. List at least two reasons why GPU memory for KV cache is wasted without PagedAttention.
  2. What are the tradeoffs of using larger/smaller block sizes in PagedAttention?

Additional Readings

  1. FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU
  2. Efficiently Programming Large Language Models using SGLang
  3. MemShare: Memory Efficient Inference for Large Reasoning Models through KV Cache Reuse
  4. KVLINK: Accelerating Large Language Models via Efficient KV Cache Reuse

Slides
10/13 Speculative Decoding (guest lecture by Reyna Abhyankar, TogetherAI)
Fast Inference from Transformers via Speculative Decoding (ICML '23)
Questions

  1. List two drawbacks of speculative decoding.
  2. In an agentic setting where LLMs are repeatedly called, can you think of a way to speculate output tokens without using a draft model (Mq in the paper)?

Additional Readings

  1. SpecInfer: Accelerating Generative Large Language Model Serving with Tree-based Speculative Inference and Verification (ASPLOS '24)
  2. EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty
  3. EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test
  4. SuffixDecoding: Extreme Speculative Decoding for Emerging AI Applications
  5. Turning Trash into Treasure: Accelerating Inference of Large Language Models with Token Recycling

Slides
10/15 GPU Hardware and Megakernel (guest lecture by Zhiyuan Guo, Cornell)
Compiling LLMs into a MegaKernel: A Path to Low-Latency Inference (and links in the blog posts as needed)
Questions

  1. What's the pros and cons of megakernel as compared to traditional more fine-grained GPU kernels?
  2. Do you think it's feasible to build a megakernel that incorporates speculative decoding? How?

Additional Readings

  1. Mirage: A Multi-Level Superoptimizer for Tensor Programs
  2. We Bought the Whole GPU, So We're Damn Well Going to Use the Whole GPU
  3. ARK: GPU-driven Code Execution for Distributed Deep Learning
  4. FlashDMoE: Fast Distributed MoE in a Single Kernel
  5. Inside NVIDIA GPUs: Anatomy of high performance matmul kernels
  6. POD-Attention: Unlocking Full Prefill-Decode Overlap for Faster LLM Inference
  7. LithOS: An Operating System for Efficient Machine Learning on GPUs

Slides
10/17 Distributed Large Language Model Serving - Pt. 1 Prefix Caching
Preble: Efficient Distributed Prompt Scheduling for LLM Serving (ICLR'25)
Questions

  1. What will happen if all requests are scheduled only based on where their matched prefix cache reside?
  2. What does Preble do when a lot of requests share the same prefix (i.e., highly skewed workloads)?

Additional Readings

  1. AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving (OSDI'23)
  2. Ray Serve: Scalable and Programmable Serving

Slides
10/20 Distributed Large Language Model Serving - Pt. 2 Prefill-Decode Disaggregation (guest lecture by Junda Chen, UCSD)
DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving (OSDI '24)
Questions

  1. How does DistServe designed their system to mitigate the overhead of KV cache migration?
  2. Do you think DistServe can scale to a very large cluster, say 100/1k/10k GPUs for serving? Why or why not?

Additional Readings

  1. Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve
  2. Prefill-Decode Aggregation or Disaggregation? Unifying Both for Goodput-Optimized LLM Serving
  3. Splitwise: Efficient generative LLM inference using phase splitting
  4. Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads
  5. POD-Attention: Unlocking Full Prefill-Decode Overlap for Faster LLM Inference
  6. Nexus:Proactive Intra-GPU Disaggregation of Prefill and Decode in LLM Serving
  7. LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference

Slides
10/22 Distributed Large Language Model Serving - Pt. 3 Long Context Serving
DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models (PODC '24)
Questions

  1. Can DeepSpeed-Ulysses work when the number of GPUs is higher than the number of heads? Why or why not?
  2. What is the latency impact for one transformer layer caused by DeepSpeed-Ulysses?

Additional Readings

  1. Sequence Parallelism: Long Sequence Training from System Perspective
  2. Tensor Parallelism and Sequence Parallelism: Detailed Analysis
  3. Introducing Context Parallelism
  4. Ring Attention with Blockwise Transformers for Near-Infinite Context
  5. DISTFLASHATTN: Distributed Memory-efficient Attention for Long-context LLMs Training
  6. LoongTrain: Efficient Training of Long-Sequence LLMs with Head-Context Parallelism

Slides
10/24 LoRA Serving
dLoRA: Dynamically Orchestrating Requests and Adapters for LoRA LLM Serving
Questions

  1. What is the pros and cons of merged and unmerged adapter in LoRA inference?
  2. Can Algorithm 2 in the paper lead to request starvation? If so, how does dLoRA solve it?

Additional Readings

  1. LoRA: Low-Rank Adaptation of Large Language Models
  2. Punica: Multi-Tenant LoRA Serving
  3. S-LORA: SERVING THOUSANDS OF CONCURRENT LORA ADAPTERS
  4. AIBrix: Towards Scalable, Cost-Effective Large Language Model Inference Infrastructure

Slides
10/27 Augmented LLM and Their Serving
InferCept: Efficient Intercept Support for Augmented Large Language Model Inference (ICML'24)
Questions

  1. What are the three ways of dealing with KV cache when a model calls an API?
  2. Why does InferCept consider running requests that are not currently calling tools when calculating GPU memory waste?

Additional Readings

  1. Toolformer: Language Models Can Teach Themselves to Use Tools
  2. KVFlow: Efficient Prefix Caching for Accelerating LLM-Based Multi-Agent Workflows
  3. Asynchronous LLM Function Calling

Slides
10/29 Mixture-of-Expert Serving (guest lecture by Ruidong Zhu, Peking University)
MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism (SIGCOMM '25)
Questions

  1. Why can the disaggregation of attention and FFN improve the GPU utilization during MoE serving?
  2. Does the ping-pong pipeline impose any inherent constraints on the model architecture? If yes, what problems will arise when such constraints are not satisfied?

Additional Readings

  1. fMoE: Fine-Grained Expert Offloading for Large Mixture-of-Experts Serving

Slides
10/31 Reasoning Models
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning and Demystifying Delays in Reasoning: A Pilot Temporal and Token Analysis of Reasoning Systems
Questions

  1. What are the main advantages and risks of using this purely RL-based approach (without supervised fine-tuning) to incentivize chain-of-thought reasoning in LLMs?
  2. What implications does the study results have for designing efficient large reasoning systems? Any proposal for improving request latency?

Additional Readings

  1. Resa: Transparent Reasoning Models via SAEs
  2. Reinforcement Learning for Reasoning in Large Language Models with One Training Example
  3. The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity
  4. ARM: Adaptive Reasoning Model
  5. Hierarchical Reasoning Model

Slides
11/3 Agent Tooling
What are Tools? and Introducing the Model Context Protocol
Questions

  1. For tool-based AI systems, do you belive we should have more autonomy (more agent like) or less autonomy (more workflow like)? Find a use case to justify your answer.
  2. Can you think of some challenges of benchmarking and ensuring robustness in agents that depend on external tools?

Additional Readings

  1. Announcing the Agent2Agent Protocol (A2A)
  2. Conveyor: Efficient Tool-aware LLM Serving with Tool Partial Execution
  3. An LLM Compiler for Parallel Function Calling

Slides
11/5 Reinforcement Learning Infrastructure (guest lecture by Vikranth Srivatsa, UCSD and TogetherAI) HybridFlow: A Flexible and Efficient RLHF Framework (verl)
Questions

  1. How is the LLM inference (rollout) phase in RL training different from LLM serving (for end users)? List at least two differences.
  2. Can you think of a way to improve the LLM inference phase in RL training?

Additional Readings

  1. FastGRPO: Accelerating Policy Optimization via Concurrency-aware Speculative Decoding and Online Draft Learning
  2. SPEC-RL: Accelerating On-Policy Reinforcement Learning via Speculative Rollouts
  3. StreamRL: Scalable, Heterogeneous, and Elastic RL for LLMs with Disaggregated Stream Generation
  4. History Rhymes: Accelerating LLM Reinforcement Learning with RhymeRL
  5. Kimi k1.5: Scaling Reinforcement Learning with LLMs

Slides
11/7 Agent Memory and Context
MemGPT: Towards LLMs as Operating Systems
Questions

  1. What's the memory-hierarchy architecture proposed in MemGPT? What are the tradeoffs this architecture introduces in terms of latency?
  2. How would you redesign the memory hierachy for throughput-oriented tasks?

Additional Readings

  1. DeepSeek-OCR: Contexts Optical Compression
  2. SagaLLM: Context Management, Validation, and Transaction Guarantees for Multi-Agent LLM Planning
  3. Git Context Controller: Manage the Context of LLM-based Agents like Git
  4. A Comprehensive Guide to Context Engineering for AI Agents
  5. Recursive Language Models
  6. A-Mem: Agentic Memory for LLM Agents

Slides
11/10 AI Workflow and Agent Autotuning
Cognify: Supercharging Gen-AI Workflows With Hierarchical Autotuning (KDD '25)
Questions

  1. What's the pros and cons of Cognify compared to RL-based agent tuning methods?
  2. How does Cognify decides the size (number of samples) of each layer? Do you think that's reasonable?

Additional Readings

  1. GEPA: REFLECTIVE PROMPT EVOLUTION CAN OUTPERFORM REINFORCEMENT LEARNING
  2. Optimizing Instructions and Demonstrations for Multi-Stage Language Model Programs
  3. Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More?
  4. Language Agents as Optimizable Graphs
  5. DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

Slides
11/12 Retrieval Augmented Generation (guest lecture by Siddhant Ray, University of Chicago)
METIS: Fast Quality-Aware RAG Systems with Configuration Adaptation (SOSP '25)
Questions

  1. How does METIS's LLM profiler's data inform the adaptation mechanism's scheduling decision at query time?
  2. Consider an agent that must perform both RAG and tool use. How could the core principles of METIS be extended to create a "quality-aware scheduler" for this type of agents?

Additional Readings

  1. Quake: Adaptive Indexing for Vector Search (OSDI '25)
  2. A Comprehensive Survey on Vector Database: Storage and Retrieval Technique, Challenge
  3. HedraRAG: Coordinating LLM Generation and Database Retrieval in Heterogeneous RAG Serving (SOSP '25)
  4. Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More?
  5. Seven Failure Points When Engineering a Retrieval Augmented Generation System

Slides
11/14 Agent Use Cases (guest lecture by Zhenwen Shao, Director of Data Science, ML, and AI, Johnson & Johnson)
SimpleDoc: Multi-Modal Document Understanding with Dual-Cue Page Retrieval and Iterative Refinement
Questions

  1. What's the trade-offs of this dual-index approach versus relying on a single, powerful multi-modal embedding for retrieval?
  2. Discuss the system-level implications of the proposed design on: (a) offline indexing cost and complexity, (b) online query latency, and (c) handling new documents that have not been fully processed

Additional Readings

  1. Deep Research Agents: A Systematic Examination And Roadmap
  2. Cline Autonomous Coding Agent

11/17 Agent Evaluation and Benchmarks
Survey on Evaluation of LLM-based Agents
Questions

  1. How might an agent's effective use of tools (e.g., code execution, web search) potentially obscure the evaluation of its true underlying reasoning or planning abilities?
  2. How would you design an interactive evaluation for computer-use agents?

Additional Readings

  1. Evaluation and Benchmarking of LLM Agents: A Survey
  2. AgentBench: Evaluating LLMs as Agents
  3. Establishing Best Practices for Building Rigorous Agentic Benchmarks
  4. Can We Trust AI Benchmarks? An Interdisciplinary Review of Current Issues in AI Evaluation

Slides
11/19 Computer-Use Agents
OSWorld-Human: Benchmarking the Efficiency of Computer-Use Agents (CUA Workshop '25)
Questions

  1. What is the fundamental limitation of using human efficiency as the "gold standard" for an AI agent?
  2. Can you think of a way to reduce the end-to-end CUA latency?

Additional Readings

  1. A Case for Declarative LLM-friendly Interfaces for Improved Efficiency of Computer-Use Agents
  2. The Unreasonable Effectiveness of Scaling Agents for Computer Use
  3. GTA1: GUI Test-time Scaling Agent

Slides
11/21 Multi-Agent Systems (guest lecture by Guohao Li, Founder and CEO, CamelAI and EigentAI)
CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society (NeurIPS '23)
Questions

  1. How does the CAMEL framework's role-playing structure mitigates the instability issue of multi-agent systems?
  2. How do you think multi-agent systems can scale (to more agents/more roles)? Any key challenges you can think of?

Additional Readings

  1. Why Do Multi-Agent LLM Systems Fail?
  2. ChatDev: Communicative Agents for Software Development
  3. AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

Slides
11/24 Agent Infra in Production (Panpan Xu, Principal Applied Scientist, AWS Bedrock)
Amazon Bedrock AgentCore
Questions

  1. Can you think of some new challenges or opportunities when serving agents in a serverless way from traditional serverless services like AWS Lambda?
  2. What do you think are the key requirements for agent hosting?

Additional Readings

  1. SALT: Step-level Advantage Assignment for Long-horizon Agents via Trajectory Graph
  2. Azure AI Foundry
  3. Strands documentation
  4. AgentRL training
  5. Self-Challenging Language Model Agents

11/26 AgentOps and Agent Security
Security of AI Agents
Questions

  1. How do stateful, action-oriented vulnerabilities render traditional LLM defenses (like standard alignment or prompt filtering) insufficient?
  2. Can you think of an example where a particular agent we've introduced in the course has a vulnerability?

Additional Readings

  1. AI Agents Are Here. So Are the Threats

Slides
12/1 KV Cache as the New AI Memory Abstraction (Guest lecture by Prof. Juncheng Jiang, University of Chicago)
CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion
Questions

  1. How does CacheBlend efficiently combine retrieved KV caches with the precomputed KV cache of a large text chunk without recomputing attention for the entire sequence?
  2. How does Selective Recomputation decide which tokens to recompute versus which to blend?

Additional Readings

Slides
12/3 No Class. Work on your project!
12/5 Entrepreneurship in AI Infra and Agents and Course Summary
Hints for Computer System Design - Butler Lampson
Questions

Read the "Hints for Computer System Design" paper and summarize what you have learned over the course. Feel free to write about anything else you want to comment on the course.

Slides