Phase Analysis

The way a program’s execution changes over time is not totally random; in fact, it often falls into repeating behaviors, called phases. Automatically identifying this phase behavior is the goal of our research and key to unlocking many new optimizations. We define a phase as a set of intervals (or slices in time) within a program’s execution that have similar behavior, regardless of temporal adjacency. Recent research has shown that it is indeed possible to accurately identify and predict these phases in program behavior to capture meaningful phase behavior.

The key observation for phase recognition is that any program metric is a direct function of the way a program traverses the code during execution. We can find this phase behavior and classify it by examining only the ratios in which different regions of code are being executed over time. We can simply and quickly collect this information using basic block vector profiles for off-line classification or through dynamic branch profiling for online classification. In addition, accurately capturing phase behavior through the computation of a single metric, independent of the underlying architectural details, means that it is possible to use phase information to guide many optimizations and policy decisions without duplicating phase detection mechanisms for each optimization.

A good high level overview of our approach to phase classification can be found at:

Timothy Sherwood, Erez Perelman, Greg Hamerly, Suleyman Sair, and Brad Calder, Discovering and Exploiting Program Phases, IEEE Micro: Micro's Top Picks from Computer Architecture Conferences, December 2003

The following is a set of definitions we will use in describing phase classification and using SimPoint to find simulation points.

	An *interval* is a section of continuous execution (a slice in time) within a program. For the results presented here, we chose intervals of the same size, as measured by the number of instructions executed within an interval (either 1, 10, or 100 million instructions). We are currently exploring the use of variable-sized intervals.
	A *phase* is a set of intervals within a program’s execution that have similar behavior, regardless of temporal adjacency. In this way, a phase can reoccur multiple times through the program’s execution.
	Phase classification breaks a program’s intervals of execution on a given input into phases with similar behavior. This phase behavior is for a specific program binary running a specific input (a binary-input pair).
	Similarity defines how close the behavior of two intervals are as measured across some set of metrics. Well-formed phases should have intervals with similar behavior across architecture metrics (such as IPC, cache misses, and branch mispredicts) and have similar working sets.
	The *similarity metric* is independent of hardware metrics; we use it to judge the similarity of intervals when performing phase classification.
	A phase change is a noticeable and sudden change in program behavior (similarity) over time (as if going from one phase to another phase).

Phase Observations: The above Figure shows the time varying full execution of gzip with the graphic-ref input for several architectural metrics. Within our context a phase is defined to be segments of execution with similar behavior to each other. In this plot the phases have been color coded, and there are 5 major observations to be made:

Three major phases can be see, which we colored red, blue and yellow.
The behavior of each architectural metric shifts in unison with the other metrics.
Phases range over large sections of execution (10s of billion instructions).
The behavior between phases can vary significantly ( the blue phase is very different from the yellow or red phases).
There is strong structure and repetition for all the phases.

SimPoint's Off-Line Phase Classification

At a high level, our approach automatically finds simulation points of a program by clustering a interval partitioned code-profile of the full execution, and then picking a sample from each cluster and weighing it by the size of the cluster. The key insight to this approach is that the entire analysis uses only the code that executes and is independent of architectural parameters. The entire method can be broken down into 4 steps:

1) Basic Block Vector Analysis- architecture independent code profiling

2) Random Projection- reduce dimensionality of data

3) Phase Classification using K-Means Clustering - classifies all intervals into a set of phases where similar intervals are in the same phase.

5) Picking simulation points- the last step which finds a good phase representation of the execution and then chooses a sample from each phase to form the simulation points

Step 1: Basic Block Vector Analysis

To concisely capture information about how a program changes its behavior over time we developed the Basic Block Vector (BBV). A basic block is a section of code executed from start to finish with one entry and one exit. We use the frequencies with which basic blocks execute as the metric for comparing sections of the application’s execution. The intuition behind this is that program behavior at a given time directly relates to the code executing during that interval, and basic block distributions provide us with this information. A-program, when run for any interval of time, will execute each basic block a certain number of times. Knowing this information provides a fingerprint for that interval of execution and shows where the application is spending its time in the code. The basic idea is that the basic block distributions for two intervals are fingerprints that indicate the similarity between the intervals. If the fingerprints are similar, then the two intervals spend about the same amount of time in the same code, and the performance of those two intervals should be similar.

More formally, a BBV is a one-dimensional array with one element in the array for each static basic block in the program. During each interval, the number of times program execution enters each basic block is counted and recorded in the BBV (weighed by the number of instructions in the basic block). Therefore, each element in the array is this count of entry into a basic block multiplied by the number of instructions in that basic block. The BBV is then normalized by dividing each element by the sum of all the elements in the vector that occurred during that interval.

The above Figure shows a hypothetical program segment during three different execution instances. There are 5 basic blocks (A, B, C, D, E) represented by each box and the black edges are control flow edges between the blocks. The colored arrow is the execution path, with the colored numbers next to each block quantifying how many times that block was executed. For simplicity, each basic block is assumed to have the same number of instructions. Shown next to each basic block is the number of times it is executed.

Shown above is the Basic Block Vectors for each of the three different intervals of execution. Also shown is the distance between Interval 1 and 2, and the distance between interval 2 and 3. Distance between basic-block vectors correlates to how similar are the intervals of execution. We measure this similarity between basic block vectors using the Manhattan Distance. The Manhattan Distance is computed by summing the absolute value of the element-wise subtraction of two vectors. A small distance means the vectors are similar and more likely to be part of the same phase. A large distance means the vectors are not similar and represent different phases of execution. As you can see in the above example, Intervals 2 and 3 have a large distance, whereas Intervals 1 and 2 have a small distance, and are therefore classified into the same phase, since they are similar. Note, before taking the difference between two vectors, the vectors are normalized to one (this is not shown in the above example).

We use a basic block similarity matrix to visually inspect the effectiveness of using BBVs in determining the similarities among intervals. The similarity matrix is the upper triangular of an N ´ N matrix, where N is the number of intervals in the program’s execution. An entry at (x, y) in the matrix represents the Manhattan distance (similarity) between the BBVs at intervals x and y. The Figure below shows the similarity matrices for the two example programs, gzip and gcc. The matrix’s diagonal represents the program’s execution over time from start to completion.

The above shows the basic block similarity matrices for gzip-graphic. The matrix diagonal represents a program’s execution to completion with units in 100s of millions of instructions. The darker the points, the more similar the intervals (the Manhattan distance is closer to 0); the lighter the points, the more different the intervals (the Manhattan distance is closer to 2).

To interpret the graph, consider points along the diagonal axis. Each point is perfectly similar to itself, so all the points on the diagonal are dark. Starting from a given point on the diagonal, you can compare how that point relates to its neighbors forward and backward in execution by tracing horizontally or vertically. To compare given interval x with interval x + n, simply start at point (x, x) on the graph and trace horizontally to the right to (x, x + n). In the similarity matrices for gcc and gzip, you can see large blocks of dark which indicate that there are repeating behaviors in the program. Large triangular blocks that run along the diagonal axis indicate stable regions where the program behavior is not changing over time. Rectangular blocks of dark that occur off the diagonal axis indicate reoccurring behaviors, where a behavior that occurs later in execution been seen sometime in the past. When compared with the metrics shown in the First Figure above for gzip, it can be seen that the repeating nature of the program is being captured (this is most clear for gzip) by only examining the code that is executed. This motivates a technique to capture these patterns automatically.

Step 2: Random Projection

Our basic block vectors have dimensionality equal to the number of basic blocks in a program, which ranges from 1,000s to over 100,000 for the SPEC benchmarks. Since clustering algorithms generally suffer from high dimensional data, it is essential to reduce the dimensionality before we cluster the data. Random projection is an effective technique at reducing the dimensionality of the vectors from 100’s of thousand to only 10’s. The operation involves a matrix multiplication and preserves the structure within the vectors needed for clustering. Figure 6 shows the effects of random projection of gcc-166 from the original dimensionality of more than 80,000 to only 15. Some of the contrast between execution segments is lost, but the overall structure is still preserved. We find that 15 dimensions are sufficient for accurately finding the phases in a program.

Step 3: Off-Line Phase Classification Using K-Means Clustering

BBVs provide a compact and representative summary of the program’s behavior for each interval of execution. By examining the similarity between them, it is clear that there exists a high-level pattern within each program’s execution. To use this behavior, it is necessary to have an automated way of extracting the phase information from programs. Clustering algorithms have proven useful in breaking the complete program execution into smaller groups (phases) that have similar BBVs. Because BBVs relate to the program’s overall performance, BBV-based grouping results in phases that are similar not only in their basic block distributions but also in every other metric measured, including overall performance. In addition, you can gather BBVs quickly because they require only the counting of basic block execution frequencies.

The goal of clustering is to divide a set of points into groups such that points within each group are similar (by some metric, often distance) and the points in different groups are dissimilar. A well-known clustering algorithm, k-means, can accurately break program behavior into phases. Random linear projection reduces the dimensionality of the input data without disturbing the underlying similarity information; it is a useful technique for speeding up the execution of k-means. One serious drawback of the k-means algorithm is that it requires value k—the number of clusters—as input. To address this problem, we run the algorithm for several k values and use a score to guide our final choice for k. The following steps summarize our algorithm at a high level:

Profile the basic blocks executed in each program, breaking the program up into contiguous intervals of size N (for example, 1 million, 10 million, or 100 million instructions). Generate and normalize a BBV for every interval.
Reduce the dimensionality of the BBV data to 15 dimensions using random linear projection. The advantage of performing clustering on projected data is that it significantly accelerates the k-means algorithm and reduces the memory requirements by several orders of magnitude.
Run the k-means algorithm on the reduced-dimension data with values of k from 1 to M, where M is the maximum number of phases to use. Each run of k means produces a clustering, which is a partition of the data into k phases/clusters. During this clustering k means compares the similarity of intervals, grouping them together into phases. Each run of k means begins with a random initialization step, which requires a random seed.
To compare and evaluate the clusters formed for different k, we use the Bayesian Information Criterion (BIC) as a measure of the goodness of fit of a clustering within a dataset. More formally, BIC is an approximation to the probability of the clustering given the data that has been clustered. Thus, the larger the BIC score, the higher the probability that the clustering is a good fit to the data. For each clustering (k from 1 ¼ M), score the clustering’s fitness using the formulation given by Pelleg and Moore.
Choose the phase clustering with the smallest k, such that its BIC score is at least X percent of the best score. The clustering k chosen is the final grouping of intervals into phases. For the results presented later, we used an 80 percent threshold and M = 10.

These steps provide a grouping of intervals into phases. The k-means algorithm groups similar intervals together based on the BBV similarity metric using the Euclidean distance. We then choose a final grouping of phases from the different options based on how well formed the phases are, as measured by the BIC metric.

The Figure above shows the phases discovered in gcc-166 after we clustered the data into 7 clusters. Each color depicts a cluster/phase, and it is worthy of note how the architectural parameters are similar to each other within the same phase, and the phases were formed by only examining the frequency in which the code paths were executed.

Next: Picking Simulation Points

[ Home ]

Send mail to calder@cs.ucsd.edu with questions or comments about this web site. Last modified: 06/22/05

Send mail to calder@cs.ucsd.edu with questions or comments about this web site.
Last modified: 06/22/05