UC San Diego Technical Report CS2003-753, June 2003
CPU speeds double approximately every eighteen months, while main memory speeds double only about every ten years. These diverging rates imply an impending ``Memory Wall", in which memory accesses dominate program performance. An effective way to reduce the observed latency of memory accesses is data prefetching. Data prefetching involves predicting which data the processor will use in the near future and then bring that data into storage locations closer to the execution engine (e.g., caches, dedicated prefetch buffers etc.).
Meanwhile, applications are constantly becoming more complex and demanding. This is reflected in programs through the use of various data structures and constructs provided by the programming language. While caches are extremely successful in handling accesses with good locality, the remaining accesses result in costly cache misses. For these reasons, we performed a study that maps the miss stream to several fundamental access types inherent to programming constructs. The results of this study show that pointer-based applications with accesses to linked data structures present the biggest challenge to the memory subsystem. Consequently, they have the biggest potential for speedup when their adverse effects on program performance are removed by data prefetching. For these reasons, in this thesis, we focus on data prefetching techniques targeting pointer-based applications which have irregular access patterns.
With a better understanding of the kinds of accesses that cause cache misses, we have developed three prefetching techniques that can handle all these cases. The first technique enhances traditional stream buffer designs by following an address prediction stream instead of a fixed stride as originally proposed. This improves the stream buffers ability to target pointer-based programs.
We also explore utilizing a hardware pointer cache to predict the values of pointer loads in order to break down long dependence chains caused by recurrent pointer accesses. This allows overlapping the execution of dependent instructions with the actual memory access, providing significant benefits.
Another technique, Speculative Precomputation, attempts to prefetch data ahead of its use, by running a shortened version of the program on a spare multi-threaded hardware context. The final scheme we propose builds upon speculative precomputation and uses the pointer cache for the pointer loads in speculative threads. Furthermore, we add control flow instructions to the speculative threads to guide them down the correct traversal path when multiple next transitions are possible. Our results show considerable benefits when these techniques are utilized together, complementing each other.