Workshop on Explicitly Parallel Instruction Computing Architectures
and Compiler Techonology, December 2001.
The Explicitly Parallel Instruction Computing (EPIC) architecture has been put forth as a viable architecture for achieving the instruction level parallelism (ILP) needed to keep increasing future processor performance. The IA-64 Itanium processor is an example of an EPIC architecture. An EPIC architecture issues wide instructions, similar to a VLIW architecture, where each instruction contains many operations.
One of the new features of the EPIC architecture is support for predicated execution, where each operation is guarded by one of the predicate registers available in the architecture. An operation is committed only if the value of its guarding predicate is true.
One advantage of predicated execution comes from predication's ability to combine several smaller basic blocks into one larger region. This provides a larger pool from which to draw instruction level parallelism (ILP) for EPIC architectures. Another advantage of predicated execution is that it can eliminate hard-to-predict branches by translating them into predicate defines, which do not need to be predicted. This comes at the cost of executing both paths following the branch as if it were a single path.
Choi et al. recently performed a study, where they reported that only 7% of cycles are spent due to branch mispredictions for the SPEC 2000 integer benchmarks (without using if-conversion). This is partially due to the in-order Itanium processor stalling because of memory latencies, which end up shadowing the stalls due to branch mispredictions. As this type of EPIC architecture progresses and memory latencies are better hidden, the stalls due to branch mispredictions will have a much larger impact.
The goal of our research is to use predicated execution to see how low we can make the branch misprediction rate by removing all of the the hard-to-predict branches, not caring about the increase in executed code created via predication. We first examine how to create predicated regions to remove hard-to-predict branches. In order to aggressively form predicated regions around these hard-to-predict branches, we had to leave unbiased, but originally predictable, branches (conditionals, unconditionals, and returns) inside predicated regions. We call branches left inside predicated regions region branches.
The creation of predicated sequences (region formation) based on removing a hard-to-predict branch can have a negative impact on the predictability of these region branches. Region branch execution is now predicated on a register defined by a predicate compare definition that was added in order to remove the hard-to-predict branch. These region branches will need to be predicted during fetch more frequently than they were in the original, non-predicated code (i.e. a region branch will be fetched both when its guarding predicate will be true and when it will be false). This can cause what we call misprediction migration, where the poorly predictable pattern of a hard-to-predict branch that was eliminated due to predication is merely migrated to a region branch. In addition, direct branches (e.g., unconditionals and returns) that are left in the region are also affected by misprediction migration. Before the region was formed, these region branches were accurately predictable as taken. After region formation, they now need to be predicted as either taken or not-taken when guarding predicate is TRUE, and should always be predicated as not-taken when the guarding predicate is FALSE. Because of misprediction migration, we found little improvement in branch misprediction rate for some programs when using a traditional branch predictor for region formation targeted at hard-to-predict branches.
In this paper, we examine two new branch predictor optimizations. First is a new branch prediction optimization called Squash False Path (Squash-FP) that attempts to know a branch's guarding predicate value as it is being fetched, and, if it is false, then the branch is predicted as not-taken. The goal of this predictor is to correctly predict the region branches that are on the false path as not-taken.
The second predictor we examine adds predicate information into the global history register. We examine the Predicate Global Update Branch Predictor (PGU) architecture that incorporates predicate information into the global history to try and improve the performance of region-branches that benefit from correlation. The PGU predictor updates the global history with the predicate result when the predicate defining instruction is resolved. This can allow region branches to benefit from this history correlation when making their prediction from the global prediction table.
Region branches only benefit from these two branch prediction architectures if they are scheduled far enough apart from their predicate definitions. Therefore, we examine the benefit of rescheduling the predicated region to move the region-based branches as far away as possible from their predicate defining instructions. This attempts to increase the cases where a predicate define can be resolved before the branch is fetched, so it can be used to form the prediction for the branch.