Hardware Optimizations Enabled by a Decoupled Fetch Architecture

Glenn Reinman

UC San Diego Technical Report CS2001-676, June 2001


In the pursuit of instruction-level parallelism, significant demands are placed on a processor's instruction delivery mechanism. In order to provide the performance necessary to meet future processor execution targets, the instruction delivery mechanism must scale with the execution core. Attaining these targets is a challenging task due to I-cache misses, branch mispredictions, and taken branches in the instruction stream. Moreover, there are a number of hardware scaling issues such as wire latency, clock scaling, and energy dissipation that can impact processor design.

To address these issues, this thesis presents a fetch architecture that decouples the branch predictor from the instruction fetch unit. A Fetch Target Queue (FTQ) is inserted between the branch predictor and instruction cache. This allows the branch predictor to run far in advance of the address currently being fetched by the instruction cache. The decoupling enables a number of architectural optimizations including multi-level branch predictor design and fetch directed instruction prefetching.

A multi-level branch predictor design consists of a small first level predictor that can scale well to future technology sizes and larger higher level predictors that can provide capacity for accurate branch prediction.

Fetch directed instruction cache prefetching uses the stream of fetch addresses contained in the FTQ to guide instruction cache prefetching. By following the predicted fetch path, this technique provides more accurate prefetching than simply following a sequential fetch path.

Fetch directed prefetching using a contemporary set-associative instruction cache has some complexity and energy dissipation concerns. Set-associative caches provide a great deal of performance benefit, but dissipate a large amount of energy by blindly driving a number of associative ways. By decoupling the tag and data components of the instruction cache, a complexity effective and energy efficient scheme for fetch directed instruction cache prefetching can be enabled.

This thesis explores the decoupled front-end design and these related optimizations, and suggests future research directions.