Reproducible User-Level Simulation of Multi-Threaded Workloads

Cristiano Pereira

UC San Diego Technical Report CS2007-0907, September 2007


As the complexity of processors increases, it becomes harder for designers to understand the non-trivial and many times non-intuitive interactions among the micro-architecture internal structures. Understanding these interactions is important because it helps pinpoint bottlenecks, enabling designers to reason about sources of performance loss and improve their next generation of processors. To help designers understand these interactions in current and, more importantly, in future generation designs, designers make heavy use of computer architecture detailed simulation. These simulators model the behavior of the processor on a per-cycle basis, allowing designers to look at very detailed trade-offs. Building and maintaining these simulators is a large and complicated task. In addition, recent trends in designing micro-architectures with multiple cores in the same chip brings new challenges that affect the way simulation results should be compared. This dissertation focuses on techniques to help build and maintain simulators, as well as techniques to improve the way architects evaluate design choices using simulation.

Existing user-level simulators require manual hand coding for the emulation of each and every possible system effect (e.g., system call, interrupt, DMA transfer) that can impact the application.s execution. Developing such an emulator for a given operating system is a tedious exercise, and it can also be costly to maintain it to support newer versions of that operating system. Furthermore, porting the emulator to a completely different operating system might involve building it all together from scratch. The first contribution of this dissertation is a technique to automatically capture the system effects to an application. The system effects are captured in logs and then used to guide achitecture simulation. By using the proposed technique, the complexity of implementing and maintaining user-level simulators is greatly reduced. In addition, the technique guarantees deterministic simulation on uni-processor systems.

As multi-core processors become main stream, techniques to address efficient simulation of multi-threaded workloads are needed. Simulation of multithreaded workloads on multi-core systems suffer from non-determinism across runs in different architecture configurations. If the execution paths between two simulation runs of the same benchmark, with the same input, are too different, the simulation results cannot be used to compare the configurations. The other contributions of this dissertation focus on techniques to efficiently collect simulation checkpoints for multi-threaded workloads. It extends the previous technique to efficiently collect logs for uni-processor simulation. Using these checkpoints, multi-threaded simulation in multi-core systems becomes deterministic. The deterministic simulation results in stalls that would not naturally occur in execution. This dissertation proposes techniques that allow one to accurately compare performance across architecture configurations in the presence of these stalls.