CSE 240A -- Principles in Computer Architecture

Projects

Project 1 -- Branch Predictor Contest. Due Nov 15 midnight. 

This project is based on the Championship Branch Predictor contest, sponsored by JILP, and which took place at the 2004 MICRO conference.  Don't worry if you can't follow that link (it seems to have some problems from the UCSD domain), because we'll reproduce, and alter, the rules here.  Each contestant will write a branch predictor that will consume a trace of branches (generated from real execution).  It will be based on a standard interface, provided by Leo and Matt DeVuyst.  The predictor will have a budget, namely (32K + 256) bits (not bytes).  That encompasses all storage (all tables, plus GHR, if necessary -- but not PC) used for the predictor.  That may not be the amount of storage your predictor simulator uses, however -- for example, you may implement 2-bit predictors with a table of ints, in which case the simulator will use more memory -- that's okay, we're only concerned about the memory used by the simulated branch predictor.  The output of your simulator will show number of branches, number of taken branches, and mispredict rate per branch (as a percent).  Your grade will depend almost exclusively on the last item, however. 

Leo has provided a complete interface to the traces, etc., significantly minimizing the amount of work you have to do.  So even if you don't like C/C++, I guarantee it is easier to use this than to write your own predictor from scratch.  You'll simply provide three functions, as described in the README file.

Like the original contest, we will use two sets of traces -- one to allow you to test the predictor, and another to evaluate.  But the actual traces we use to constitute the test and final sets is different than the original contest.  The test traces will be available to you.  We will run your predictor on the final traces.  You will turn in a full report, including your final code, by Nov 15.  We will determine the winners by which predictors run best on the final traces.

Points for the project will go something like this.  50, 48, 46, 44, 42 points for 1st through 5th place.  40 points for any predictor above the high standard, and 35 for any predictor above the low standard.  Lower points for predictors below standard.  Much lower for buggy predictors.  The standard is established like this:  Leo will code two predictors, an Alpha 21264-like predictor (within the size constraints), and a bi-mode predictor.  The better of those predictors, overall, will represent the high standard you have to beat, and the other will represent the low standard.  Keep in mind that beating these predictors for the test traces does not guarantee success on the final traces, so make sure you are well ahead.

Rules.  1.  write your own code.  Don't look at code from others, in class or outside of classes, or from a previous class.  There are all kinds of places to get branch predictor code, if you wanted to.  Don't.  2.  Do not reference the predictors from the contest (either the code or the descriptions).  However, there are a whole set of predictors cited below that should prove helpful.  3.  Use C or C++, ideally on cse machines.  However, as long as you can get access to a unix account, and use the framework from Leo, you should be okay.  4.  Turn in the code and a short report, and mail them to Leo (more detail in the README with the files).  The report has only two parts.  (a) a table with all your results, including arithmetic means over all traces.  (b) a description of the predictor -- there may be bonus points for originality (meaning, it doesn't match any predictor I know) if you finish in the top group.  Include citations of any predictor you are based on.

Hints.  1.  Code the predictors you need to beat -- don't guess.  2.  Make sure you beat them by a good margin.  

You can find the test traces at ~tullsen/240proj1fa05/traces, the source code at ~tullsen/240proj1fa05/src, and a README at ~tullsen/240proj1fa05/.

Alternatively, you can download all the files from here

References.

McFarling, Combining Branch Predictors, WRL TN-36.  Good description of local, correlating, gshare, and combining predictors.

Kessler, The Alpha 21264 Microprocessor, IEEE Micro, 1999.  Uses variant of McFarling's combining (tournament) predictor.

A. Eden, and T. Mudge. The YAGS branch predictor. 31th Ann. IEEE/ACM Symp. Microarchitecture (MICRO-31), Dec. 1998.  Good description of a variety of anti-aliasing predictors (predictors that still work with large working sets of branches).  Don't take their word that theirs is best without testing it out...

C.-C. Lee, I.-C. Chen, and T. Mudge. The bi-mode branch predictor. 30th Ann. IEEE/ACM Symp. Microarchitecture (MICRO-30), Dec. 1997.  Although the previous had a good description of bi-mode, this is the original.

A. Seznec, S. Felix, V. Krishnan, Y. Sazeides. "Design trade-offs for the Alpha EV8 conditional branch predictor", in : Proceedings of the 29th International Symposium on Computer Architecture, May, 2002.  Take a deep breath before you enter...

 


Project Number 2. Due Thursday, Dec 8, 11:59 pm. In this project, you will create a simple cache simulator, and evaluate some simple cache optimizations. Your cache simulator will be configurable. It will take arguments of the form:

cache -a 2 -b 32 -s 64

That indicates that the cache is 64 KB total size (-s 64), has a blocksize of 32 bytes (-b 32), and associativity two (-a 2). The simulator should work with any reasonable parameters (it's okay, though, if it breaks for nonsensical values), even though you'll only use it for a few in this project.

You will evaluate three options to make a baseline cache appear more associative. The first is a victim cache of 8 entries, fully associative. The second is a victim cache of 16 entries, 4-way set-associative (so that the access can still be done in a cycle). The third is a pseudo-associative cache (as a replacement to a baseline direct-mapped cache). You will indicate these on the command line as simply (-v1 or -v2 or -pa). You'll implement the following policies:

For the victim cache, each set is implemented as a FIFO (because a hit always results in the line being removed from the victim cache, this implements an LRU algorithm). Thus, any line evicted from the L1 cache is placed in the victim cache. A hit in the victim cache results in the hit line being transferred to the L1 cache. That goes for both loads and stores. Access to the victim cache takes 1 extra cycle. So an L1 hit takes no extra cycles, a victim cache hit takes 1 cycle, and access to the L2 and main memory takes 1 cycle longer in a system with a victim cache than in one without.

For the pseudo-associative cache, the alternate entry (the index with the highest index bit flipped) is checked on a miss. A hit to the alternate entry requires an extra cycle. A hit in the alternate entry results in the two entries being swapped, so that the next access to the same line would be a fast access. A miss causes an eviction of whichever of the two lines is LRU. The new data is always placed in the fast index, so if the alternate index was evicted, the line in the fast index will need to be moved to the alternate index.. So a regular hit takes no extra cycles, a pseudohit takes 1 cycle, and access to the L2 and main memory takes 1 cycle longer in a system with a pseudo-associative cache than in one without.

You will compare these three options (well, four, really, because it is possible but unlikely that the baseline without optimization is faster, because of the reduced latency to the L2 cache) for each of these cache baselines. Thus, you will identify three different "best" configurations, one for each baseline. For all baseline caches, you will have a 256 KB, 4-way set-associative L2 cache with 25 cycle access, and main memory with 300 cycle access. The blocksize of L2 is the same as the L1. Access times are from one cache to the next, so a load that goes to memory takes 325 cycles with no optimization, and 326 cycles if we have a victim or pseudo-associative cache.

The baselines are: (a) 32 KB L1, direct-mapped, 32-byte cache blocks. (b) 64 KB L1, direct-mapped, 128-byte cache blocks. (c) 64 KB L1, 2-way set-associative, 64-byte blocks. Note that pseudo-associativity is more convoluted on a 2-way SA cache, so only consider victim caches in this case.

Other assumptions:

Assume write-allocate caches.

Assume you stall for stores just like loads - not realistic, but it makes the simulation much easier.

Assume the cpu maintains a CPI of 1 except for load and store stalls. Assume the cpu stalls immediately on all misses.

Your report will give average hit rates for L1, L2, victim, regular and pseudohits, as appropriate for all configurations, as well as cpi. Identify the best configuration for each baseline, based on performance (of course). Run for all traces in the traces directory, and use the arithmetic mean of the CPIs.

You will turn in your report and code electronically again.

You'll find the code, traces, and further instructions (a README file) in ~tullsen/240aproj2fa05.

Partial FAQ from last year

1. Can I assign statically declared arrays or must they be dynamic?

Since you don't know what the maximum number of entries or the maximum associativity we may ask of your simulator is, you should probably use dynamic memory allocation. However, you are not confined to malloc. Another possible data structure is the C++ vector class.

3. How many read and write ports can we assume?

You may assume an infinite number of read and write ports on the caches and main memory.

5. How should we calculate CPI?

You can assume that the instruction in the last sample of each tracefile is the last instruction to be executed (so the last instruction executed is always a load or a store since the samples are only loads or stores).