Understanding the cycle-level behavior of a processor running an application is crucial to modern computer architecture research. To gain this understanding, architects typically employ detailed cycle-level simulators. Unfortunately, this level of detail comes at the cost of simulation speed, and simulating the full execution of an industry standard benchmark on even the fastest simulator can take weeks or months to complete. Offline phase analysis provides an accurate and efficient work around for this problem.
Choosing Simulation Points
To address this problem we created the SimPoint tool to choose simulation points intelligently using our offline phase classification algorithm. SimPoint calculates phases for a program/input pair, and then chooses a single representative from each phase and estimates the remaining intervals’ behaviors by performing a detailed program analysis or simulation only on that chosen representative. We choose this representative for each phase (which is a cluster of intervals) by finding the interval closest to the cluster’s center (centroid). This selected interval for a phase is called a simulation point for that phase. We then perform detailed simulation at the simulation points and weigh the performance by the size (number of intervals) in its cluster.
The Figure above shows the complete execution of gzip and the three phases (red, blue and yellow) our Off-Line Phase Classification algorithm found. As described above, the last step for the SimPoint algorithm is to take this phase clustering and choose the interval closest to the centroid of the phase as the simulation point for that phase. In the above example, SP1, SP2, and SP3 represent the three points chosen by this approach. By only simulating these three carefully chosen samples, SimPoint is capable of significantly reducing simulation time and provides an accurate characterization of the full program.
The above Figure shows simulation accuracy results using SimPoint for the SPEC 2000 programs, comparing them to the complete execution of these programs and to fast-forwarding through 1 billion instructions before starting simulation. As described earlier, we choose one simulation point for each cluster, so for gzip, which our off-line analysis labeled as having 3 clusters, we simulated 300 million instructions. Despite simulating this limited number of instructions, our method had only a 4 percent error for gzip. For these results, SimPoint is run with a limit on the number of clusters to create, so it chooses less than 10 hundred-million-instruction intervals to simulate. The median results are for the complete SPEC 2000 benchmark suite.
For the non-SimPoint results, we ran a simulation for the same number of instructions as the SimPoint data to provide a fair comparison. The above Figure shows that starting simulation at the program’s beginning results in a median error of 58 percent when compared to the full simulation, whereas blindly fast forwarding for 1 billion instructions results in a median 23 percent IPC error. When using the clustering algorithm to create multiple simulation points, we saw a median IPC error of 2 percent, and an average IPC error of 3 percent. In comparison to random-sampling approaches, SimPoint achieves similar error rates but requires significantly less—five times less—fast-forwarding time.
One of the questions we encounter is how accurate are these simulation points across different architecture configurations. Since these points were chosen independent of the architecture, we have found that they can accurately be used when making trade-off decisions when searching a design space.
The absolute error of a program/input run on one hardware configuration is not as important as tracking the change in metrics across different architecture configurations. We found that SimPoint accurately tracks the relative change in hardware metrics across several different architecture configurations. Using the same set of simulation points from SimPoint results in a bias error that is consistent (not random) across the different configurations.
These simulation points, which are independent from the underlying architecture, are representative of the full program's execution when varying the architecture parameters. The key insight for why this works is that the original SimPoint algorithm for clustering, choosing a cluster, and picking the simulation points is based only on the code usage over the program's execution not on any architecture metrics.
To examine the independence of our simulation points from the underlying architecture, we used the simulation points for the Off-Line Phase Classification algorithm with 1 million intervals and max K set to 300. For the program/input runs we examine, we performed full program simulations varying the memory hierarchy, and for every run we used the same set of simulation points when calculated the SimPoint estimates. We varied the configurations and the latencies of the L1 and L2 caches. As we increased the size and the associativity of the caches we increased their latency to model the effect of architecture scaling. We chose L1 instruction and data cache sizes between 4 KBytes and 64 KBytes, varied the associativity between direct mapped and 4-way associative, and varied their latency from 1 to 3 cycles. At the same time we varied the size of the unified L2 cache from 500 KBytes to 2 MBytes, its associativity from 4-way to 8-way associative, and varied the latency from 10 to 40 cycles.
The above Figure shows the results across the 19 different architecture configurations we examined for gcc-166 from the SPEC 2000 benchmark suite. The left y-axis represents the performance in Instructions Per Cycle and the x-axis represents different memory configurations from the baseline architecture. The right y-axis shows the miss rates for the data cache and unified L2 cache, and the L2 miss rate is a local miss rate. For each metric, two lines are shown, one for the true metric from the complete detailed simulation for every configuration, and the second for the estimated metric using our simulation points. The configurations on the x-axis are sorted by the IPC of the full run.
When performing this analysis across various programs in the SPEC 2000 benchmark suite, the simulation points from using SimPoint creates a bias in the metrics, and this bias is consistent and always in the same direction across the different configurations for a given program/input run. This is true for IPC and the cache miss rates. The absolute error of one configuration is not as important as tracking the change in a metric across different architecture configurations. These results show that simulation points, which are chosen by only looking at code usage, track the relative change in metrics between configurations. When there is a change between two configurations, that change is clearly seen using SimPoint, and even small changes are accurately tracked.
In the original SimPoint algorithm the goal was to pick a single simulation point from each cluster that best represents all of the intervals in that cluster. This may pick simulation points that are at the very end of a program's execution. If the simulator supports checkpointing, then simulation can be started very quickly at a point at the end of the program. But, for simulation environments that do not support checkpointing, it can require up to several days to fast-forward to the latter part of execution to reach a late simulation point.
The goal of Early SimPoint is to find simulation points that are earlier in the program's execution that still accurately represent the overall execution of the program. These early simulation points can then be used to significantly reduce the time spent fast-forwarding to reach all of the simulation points for a program.
Choosing a Clustering
The Early SimPoint algorithm focuses on choosing a clustering that is both representative of the program's execution and has some feasible simulation points early in the program for all clusters. This might not be achievable for all programs, since an important phase of execution may only appear at the end of execution. We therefore still give priority in our algorithm to ensure that the clustering groups together intervals of execution that are similar to one another. Once the early clustering is chosen, we pick representative simulation points early in the execution from all the clusters.
Similar to the BIC score, we introduce the EarlySP score to provide us with a goodness of fit of the clustering but which is also weighted by how early the last cluster starts. We score each clustering of our off-line algorithm and then pick the smallest k that achieves at least 90% of the spread between the largest and smallest EarlySP scores. We can set the threshold higher if tighter clusters are desired, at the cost of having more simulation points.
Picking Early Simulation Points
After Early SimPoint picks a clustering using the EarlySP score, we determine a cutoff point in the program's execution and we consider picking simulation points only from the start of the program to this cutoff point. The cutoff point for a program/input is determined by first picking an early simulation point for the cluster that starts the latest in execution. No intervals of execution after this cutoff point will be considered for simulation points. This bounds the amount of overhead due to fast-forwarding.
The above Figure shows the three early simulation points (ESP1, ESP2, ESP3) chosen using the Early SimPoint algorithm in comparison to the original simulation points (SP1, SP2, SP3). The line at ESP3 shows how far into the execution of gzip EarlySP has to fast forward to reach the last simulation point. The graphs shows that EarlySP would reduce fast-forward time by 1/2 over the Original SimPoint algorithm, which has to fast-forward up to the line at SP3. These savings come at a marginal loss in accuracy, which may be well worth the tradeoff for some researchers.
Given a set of simulation points derived from SimPoint, a user may want to know the confidence and error for this set of simulation points. We introduce the Variance SimPoint algorithm as a new technique for estimating what the expected error is for a given level of statistical confidence for a single set of simulation points and clustering. More importantly, Variance SimPoint can be used to guide how many simulation points should be chosen, to reach a given level of confidence and error bound. See the PACT 2003 paper for the details of the algorithm.
Send mail to
questions or comments about this web site.