CSE 121: Operating Systems, Architecture and Implementation Fall 2003 Discussion Notes: 12/01 I. Administration - Project 3 deadline 12/04 - Final Exam Review Session Friday 12/05, 10:30-11:50 in 2321 HSS. II. Microkernels: L4 "On microkernel construction" Jochen Liedtke (1995) Motivation: ----------- Why use microkernels? + We want more reliability and portability. - We lose performance. Arguments against microkernels fell into 2 categories: 1) abstraction too low-level: extra context switch overhead 2) abstraction too high: exokernel; Mach was slow (an implementation of Mach-Linux was 50% slower than Linux). Goal: ----- Liedtke believed that poor performance was a result of poor implementation and not fundamental to microkernel architectures. He wanted to build a fast microkernel. What are the minimal functions of a kernel? ------------------------------------------- Kernel should minimally provide protection and communication. How are we going to provide this: address spaces and IPC. Q: what were the bottlenecks of existing microkernel systems? A: IPC, inefficient context switching. Liedtke identified 3 major pieces a micro-kernel must support: 1) Address Spaces 2) Threads 3) IPC 1) Address Spaces: kernel only implements 3 mechanisms: a) mapping: owner of an address space allows a recipient to map a set of pages. b) flushing: owner of an address space removes mappings of a set of pages from all other address spaces except its own. c) granting: removes mapping of pages from owner and maps to recipient. - Now user-level memory management and pagers can implement virtual memory subsystem. Q: Why is the grant operation required? 2) Threads: have register set, $IP, $SP, state, address space information. Do we need them? We need some fundamental unit of execution, so threads will serve as that. What do threads need? 3a) IPC: need some form of communication between threads. 3b) UID: identification of threads. Performance Results ------------------- 1) Kernel-User switching: kernel-user mode switches are not a conceptual problem, but an implementation one. In this position paper, they perform this rough analysis: Measured costs of kernel calls is 900 cycles. Analysis shows that the lower bound on a kernel call is: 71 cycles for entering the kernel 36 cycles for returning to user mode 107 cycles is a lower bound. Is the almost 90% overhead necessary? Most of the cycles are due to misses in TLB/cache. L3 achieved implementation using only 57 cycles overhead. Conclusion: kernel switched can be implemented to be much quicker. What happened when they actually implemented it [Liedtke97]? Analyze performance of getpid() system call: Lower bound: 82 cycles. Native Linux: 223 cycles. L4 Linux: 526 cycles. L4 Linux + trampoline code: 733 cycles. kMach-Linux: 2050 cycles. uMach-Linux: 14710 cycles. 2) Thread Switching: IPC can be implemented fast enough to handle hardware interrupts normally handled by the kernel. Observation: most overhead due to TLB misses, flushes. Solution: Use a tagged TLB. Note: they couldn't actually rely on existence of a tagged TLB (HW dependent) so they approximated one using segment registers to simulate TLB tagging bits, and thus flushing the TLB would require reloading segment registers. 3) Memory: Mach memory subsystem performance can be improved. Observations: Mach MCPI (memory cycles per instruction) were much higher than Ultrix, and a major source of performance loss. Mach MCPI mainly due to higher capacity cache misses. Large working sets in u-kernel at fault. Solution: Reduce working set size of u-kernel. How? Results: Hard to say, look at some macrobenchmarks: A macrobenchmark [Liedtke97]: Real time compiling for Linux Server Linux 476 seconds L4 Linux 506 seconds ( +6.3%) L4 Linyx + tc 509 seconds ( +6.9%) kMach-Linux 555 seconds ( +16.6%) uMach-Linux 605 seconds ( +27.1%) Conclusion: Most practical applications, L4 imposes a 5-10% overhead over monolithic Linux. III. Exokernels "Exterminate All Operating System Abstractions" Dawson Engler, M. Frans Kaashoek (1995) Motivation: ----------- Goal: ----- "The operating system is basically hardware masquerading as software: it cannot be changed, all applications must it it, and the information it hides cannot be recovered." The operating system should only multiplex physical resources and _not_ abstract physical resources. Why is abstracting the physical resources bad? 1) Poor reliability: abstracting resources involves a lot of complex code and decreases the reliability of the system. 2) Poor adapatability: The OS is large, is tied to all applications and thus does not support changes well. 3) Poor performance: Needless and sometimes redundant OS abstractions can only consume resources and harm performance. End-to-end argument (e.g. how can you optimize without knowing what cases to optimize for?). 4) Poor flexibility: If one wants to implement their own abstractions, it'd have to be done at such a high level that it would be too expensive. Usually abstraction makes it impossible to access the raw device interfaces that one is interested in. Note: See "End-to-End Arguments in System Design" by J.H. Saltzer and David Clark if you don't understand the last two points. What should the Operating System provide? The exokernel exports a HW interface to allocate, deallocate and multiplex physical resources. [Draw Picture on board, Fig. 1 in Kaashoek97sosp] 1) Address Spaces: only provide a few boostrapping page table mappings. 2) Processes: exokernel only provides address space, exception program counters and ability to call prologue and epilogue code for time- slicing. Application defines prologue and epliogue code. 3) IPC: just a transfer of control to another protected domain. Allow applications to define what is needed to pass back and forth. What might this buy you? 1) reliability: less abstractions -> less complexity to manage. 2) adaptability: push modifications to application level. 3) performance: allow application-specific optimizations. 4) flexibility: more powerful abstractions supported. Results - implemented Cheetah Web server: 10x faster for small document transfers. Why? - merged file cache and retransmission protocol: remove many block copies since data transmitted from file cache over wire. - knowledge-based package merging: piggy-back ACKs to reduce overhead for small file transfers. - HTML-based file grouping: used co-FFS, which is great for small files. Problems? - portability - protection (can't trust that all applications are bug free) - Can unmodified applications run with modified applications? - will this decentralized scheduling work? - how difficult is it to write applications now? IV. Exokernels vs microkernels Q1: What's the difference? Q2: What is the same?