Sampled Simulation for Multithreaded Processors

Michael Van Biesbrouck

UC San Diego Technical Report CS2007-XXXX, September 2007


Microarchitectural simulation of multithreaded architectures with shared resources, such as simultaneous multithreading (SMT) cores and multi-core processors with shared caches, is time-consuming and the results of simulation may be dicult to interpret. It is time-consuming because modern benchmarks run for hundreds of billions (or even trillions) of instructions, and accurate multi-core and SMT simulation requires higher-detail models than single-threaded simulation. The statistics collected when two programs execute together can be dicult to interpret because the programs both exhibit independent phase behavior and affect each other's execution. Starting one program slightly later than during the original execution will change the phases that execute together and thus change the eects that the programs have on each other.

Accurate sampled simulation requires accurate sample collection. We evaluate techniques to improve sampling accuracy and performance, both for single-threaded and multithreaded simulation. These techniques include warming the CPU with detailed execution, storing cache state and techniques to minimize the size of checkpoints. Previous work showed that single-program performance can be accurately estimated by dividing execution into phases and only simulating representative samples from each phase. We demonstrate that the juxtaposition of phases (`co-phase') from a pair of programs has similar behavior to a single-threaded phase. Furthermore, simulation of all possible co-phases allows analysis of all distinct SMT behaviors and this comprehensive knowledge of program interactions can be combined with information about the sequence of phases executed by each program to reconstruct the combined execution of the programs from any given starting point. Given the short samples, the set of executions from all possible starting oof the programs. This removes the problem of interpreting the results of small numbers of experiments.

Finally, we propose three techniques for using the co-phase techniques to summarize the behavior of all possible interactions within a suite of benchmarks. We reduce the scale of this problem using Priciple Components Analysis, allowing our techniques to scale to large numbers of benchmarks an concentrate simulation on the most signicant behaviors.