Evaluations

lsui (lsui@cs.ucsd.edu)
Mon, 08 May 2000 17:21:33 -0700

A.Why aren't OS getting faster as fast as hardware?

John Ousterhout

MainPoint:
Discusses how OS performance is not keeping up with hardware (esp
RISC).

Hardware and OS platform:
Used ten hardware configurations based on either CISC or RISC with
approximate MIPS ranging from 20 to 0.9. The Microvax II is used as a
reference point since it is the slowest.
Used six operating systems including Mach, Sprite and various UNIX.

Benchmarks:

RISC machines were generally faster than the CISC machines on a absolute
scale, but they were not as fast as their MIPS ratings would suggest.

Eight benchmarks are tested.
Kernel entry-exit: cost of entering and leaving the os kernel by
invoking the getpid kernal call. RISC machines is not as fast as it
supposed to be.
Context switching: cost of context switching plus processing small
pipe reads and writes.
Select: use select to check the readability of pipe(s).
Block copy: cost of transfering large blocks of data within memory.
The faster processors, the smaller the relative performance gain, for
both CISC and RISC machines. RISC machines particularly do not scale
well with memory-intensive applications.
Read from file cache: cost of entering the kernel and copying data
from the kernel's file cache to a buffer in user's space. Write-back
cache performs better than a write-through cache, since the receving
buffer is likely in stay in the cache.
Andrew benchmark: a bigger benchmark which measures copying files
(from local disk for one configuration, and from remote disks for
another); and compilation of the files. Faster machines generally
have smaller relative speedup when compared to slower machines. The
MIPS-relative speeds vary more from os to os than machine to machine.
Sprite is faster than other OSs in local and particulary remote case.
Open-close: repeatedly open and close a single file. Sprite is
slower than other OSs because of the cost in server communication.
Create-delete: open->write->close->open->read->close->delete.
Sprite wins because of its policy for short-lived files; info only goes
to disk after it has lived at least 30 seconds. Other OSs force more
data to disk (and sometimes must be transferred over the network to
the file server then to disk) which increase the overhead.

Conclusions:

Hardware issues: Need to do something about keeping memory bandwidth
up to CPU speed otherwise some classes of application may be limited by
memory performance.
Context switching is about 2x more expensive in RISC than CISC. But
as long as the relative degradation is constant it is not too serious of
a problem
Software issues:
Decouple file system from disk performance Otherwise, file-intensive
programs cannot fully benefit from the faster CPU.
The author believes that the assumptions in NFS (statelessness,
write-through-on-close) represent a fundamental performance limitation.

B. The integration of Architecture and Operating System Design

Mainpoint: Architectural and OS Design trends have not been co-aware,
resulting in OS performance that is poorer than application code
performance on contemporary RISCs.

Current Trends:
Move to RISC, make implementation of architecture visible to
higher-level software, migration of function from hardware to software,
OS's migrating from a monolithic kernel structure to modular,
microkernel with multiple OS threads and address spaces, relying on IPC
(RPC) for communications more and more.
OS designers have optimized performance until they hit hardware
limits. They do not optimize design given trends and characteristics of
architecture.

Primitive OS system functions :
Dramatis personae:
Null system call - enter null C kernel procedure with
interrupts (re)enabled, then return.
Trap - user program takes fault (prot. violation), vector to
null C kernel procedure, then return. Requires save/restore of
unpreserved registers.
PTE change - VA->PTE conversion, update PTE with protection
info, update TLB.
Context switch - once in kernel, time to switch process
context, including address space.
Discovered that app code performance exceeds OS performance (for
RISCs relative to a CVAX), by a factor of 2-3 in relative speed, and
that modern RISC OS's were taking more instructions to do these
functions than older CISC OS's.

IPC/RPC:
Central to modern OS structure and performance, because of
modularity and distributability.
Cross-machine communication is limited by OS overhead involved
in doing data copying, processing interrupts, thread management, and
stub functions like data marshalling/unmarshalling, but NOT network
transfer time. (Argument for AM) Increasing CPU speed won't help as much
as it could because of reasons above.
System Calls and Interrupts:
Three components of system call:
1)kernel entry/exit
2)call preparation
3) C call/return:
VM:
Increase address space, copy-one-write optimizations,
distributed shared memory, RVM, etc.
Memory mgmt is difficult on current RISCs, as pipeline structure
often exposed - OS often must examine pipeline state and emulate
instructions in order to maintain correctness at restart after
page-fault. FPU pipeline is a hairy, ugly beast.
Fault management often difficult because information is hidden
from OS, such as address that caused fault. OS must interpret faulting
instruction to deduce it.
TLB management done by OS typically. TLB's without process ID
tags must be purged on context switch, including context switches within
kernel.
Virtually addressed caches obviate a virtual->physical
translation before cache lookup, but now caches must be flushed on
context switches - potentially disastrous for low-level OS functions.

Threads:
Usual issues of user-level vs. system-level threads.
Architecture can impact user-level threads, most crucially on
thread context switches. Many registers may need to be saved/loaded.
SPARC: register windows complicate matters again - on average 3
windows saved/restored on thread context switch. Also, current window
pointer is privileged, so pay kernel trap on every thread context
switch.
Why large register sets? Assume procedure calls more frequent
than context switches. Less true for parallel systems today that make
heavy use of threads.
Kernel-level threads cause decreased TLB effectiveness because
of thread context switches in different address spaces.
Atomic lock instructions are missing in some RISC IS, requiring
either kernel trap or expensive synchronization algorithm.

Behaviour of OSs and Apps:
Low-level primitives (trap, context switch, ...) are used
frequently
They are used much more frequently (2-4 orders of magnitude) in
modern OSs.
TLB misses are similarly much more frequent in modern OSs.