Homework #1, due January 18:
Homework #2, due January 27:
add r5, r1, 1000(r10)
sub r7, r1, r5
store r5, 2000(r10)
add r1, r2, r3
add r8, r7, 1000(r1)
beq r8, r4, label
P3: Consider the following code:
dadd r5, r1, r2
lw r7, 1000(r5)
sw r7, 2000(r10)
dadd r9, r7, r8
beqz r9, label
assuming the MIPS pipeline from figure A.18, and the forwarding paths from figure A.23, what forwarding paths needed to support the above code are not included in this figure? Describe the extra paths needed.
Homework #3, due February 24.
Homework #4, due March 3.
Homework #5, due March 10.
Project number one, branch predictor contest: Due Feb 17 midnight. This project is based on the Championship Branch Predictor contest, sponsored by JILP, and which took place at the last MICRO conference. Don't worry if you can't follow that link (it seems to have some problems from the UCSD domain), because we'll reproduce, and possibly alter, the rules here. Each contestant will write a branch predictor that will consume a trace of branches (generated from real execution). It will be based on a standard interface, provided by Matt. The predictor will have a budget, namely (64K + 256) bits (not bytes). That encompasses all storage (all tables, plus GHR, if necessary -- but not PC) used for the predictor. That may not be the amount of storage your predictor uses, however -- for example, you may implement 2-bit predictors with a table of ints, in which case the simulator will use more memory -- that's okay, we're only concerned about the memory used by the simulated branch predictor. The output of your simulator will show number of branches, number of taken branches, and mispredict rate per 1000 instructions. Your grade will depend almost exclusively on the last item, however.
Matt has provided a complete interface to the traces, etc., significantly minimizing the amount of work you have to do. So even if you don't like C/C++, I guarantee it is easier to use this than to write your own predictor from scratch. You'll simply provide three functions, as described in the README file.
Like the original contest, we will use two sets of traces -- one to allow you to test the predictor, and another to evaluate. But the actual traces we use to constitute the test and final sets is slightly different. The test traces will be available to you. We will run your predictor on the final traces. You will turn in a full report, including your final code, by Feb 17. We will determine the winners by which predictors run best on the final traces.
Points for the project will go something like this. 50, 48, 46, 44, 42 points for 1st through 5th place. 40 points for any predictor above standard. Lower points for predictors below standard. Much lower for buggy predictors. The standard is established like this: Matt will code a gshare predictor (obvious dimensions, given constraints) and a Alpha 21264-like predictor (not so obvious, but within constraints). The better of those predictors, overall, will represent the standard you have to beat. Keep in mind that beating these predictors for the test traces does not guarantee success on the final traces, so make sure you are well ahead. Anyone who scores among the top contenders of the real contest (without using their predictors) gets bonus points.
Rules. 1. write your own code. Don't look at code from others, in class or outside of classes. There are all kinds of places to get branch predictor code, if you wanted to. Don't. 2. Do not reference the predictors from the contest (either the code or the descriptions). However, there are a whole set of predictors cited below that should prove helpful. 3. Use C or C++, ideally on cse machines. But not everyone has an account yet. Hopefully, we'll have those soon. However, as long as you can get access to a unix account, and use the framework from Matt, you should be okay. 4. Turn in the code and a short report, and mail them to Matt (more detail in the README with the files). The report has only two parts. (a) a table with all your results, including arithmetic means over all traces. (b) a description of the predictor -- there may be bonus points for originality (meaning, it doesn't match any predictor I know) if you finish in the top group. Include citations of any predictor you are based on.
Hints. 1. Code the predictors you need to beat -- don't guess. 2. Make sure you beat them by a good margin.
You can find the test traces at ~tullsen/240proj1/traces, the source code at ~tullsen/240proj1/src, and a README at ~tullsen/240proj1/.
Alternatively, you can download all the files from here.
References.
McFarling, Combining Branch Predictors, WRL TN-36. Good description of local, correlating, gshare, and combining predictors.
Kessler, The Alpha 21264 Microprocessor, IEEE Micro, 1999. Uses variant of McFarling's combining (tournament) predictor.
A. Eden, and T. Mudge. The YAGS branch predictor. 31th Ann. IEEE/ACM Symp. Microarchitecture (MICRO-31), Dec. 1998. Good description of a variety of anti-aliasing predictors (predictors that still work with large working sets of branches). Don't take their word that theirs is best without testing it out...
A. Seznec, S. Felix, V. Krishnan, Y. Sazeides. "Design trade-offs on the EV8 branch predictor", in : Proceedings of the 29th International Symposium on Computer Architecture, May, 2002. Take a deep breath before you enter...
Project Number 2. Due Wednesday, March 16, midnight. In this project, you will create a simple cache simulator, and evaluate some simple cache optimizations. Your cache simulator will be configurable. It will take arguments of the form:
cache -a 2 -b 32 -s 64
That indicates that the cache is 64 KB total size (-s 64), has a blocksize of 32 bytes (-b 32), and associativity two (-a 2). The simulator should work with any reasonable parameters (it's okay, though, if it breaks for nonsensical values), even though you'll only use it for a few in this project.
You will evaluate three options to make a baseline cache appear more associative. The first is a victim cache of 8 entries, fully associative. The second is a victim cache of 16 entries, 4-way set-associative (so that the access can still be done in a cycle). The third is a pseudo-associative cache (as a replacement to a baseline direct-mapped cache). You will indicate these on the command line as simply (-v1 or -v2 or -pa). You'll implement the following policies:
For the victim cache, each set is implemented as a FIFO (because a hit always results in the line being removed from the victim cache, this implements an LRU algorithm). Thus, any line evicted from the L1 cache is placed in the victim cache. A hit in the victim cache results in the hit line being transferred to the L1 cache. That goes for both loads and stores. Access to the victim cache takes 1 extra cycle. So an L1 hit takes no extra cycles, a victim cache hit takes 1 cycle, and access to the L2 and main memory takes 1 cycle longer in a system with a victim cache than in one without.
For the pseudo-associative cache, the alternate entry (the index with the highest index bit flipped) is checked on a miss. A hit to the alternate entry requires an extra cycle. A hit in the alternate entry results in the two entries being swapped, so that the next access to the same line would be a fast access. A miss causes an eviction of whichever of the two lines is LRU. The new data is always placed in the fast index, so if the alternate index was evicted, the line in the fast index will need to be moved to the alternate index.. So a regular hit takes no extra cycles, a pseudohit takes 1 cycle, and access to the L2 and main memory takes 1 cycle longer in a system with a pseudo-associative cache than in one without.
You will compare these three options (well, four, really, because it is possible but unlikely that the baseline without optimization is faster, because of the reduced latency to the L2 cache) for each of these cache baselines. Thus, you will identify three different "best" configurations, one for each baseline. For all baseline caches, you will have a 256 KB, 4-way set-associative L2 cache with 25 cycle access, and main memory with 300 cycle access. The blocksize of L2 is the same as the L1. Access times are from one cache to the next, so a load that goes to memory takes 325 cycles with no optimization, and 326 cycles if we have a victim or pseudo-associative cache.
The baselines are:
(a) 32 KB L1, direct-mapped, 32-byte cache blocks.
(b) 64 KB L1, direct-mapped, 128-byte cache blocks.
(c) 64 KB L1, 2-way set-associative, 64-byte blocks. Note that pseudo-associativity is more convoluted on a 2-way SA cache, so only consider victim caches in this case.
Other assumptions.
Assume write-allocate caches.
Assume you never stall for stores. However, you do need to account for lines brought into the cache by stores. For example, if I do a store at cycle 100 and it misses in all caches, I know the line will be in the L1 cache in cycle 425. If I get a subsequent load to that line in cycle 150, I cannot simulate that as a regular hit (we typically would call this a partial hit), because we obviously cannot read the data from the cache yet. Thus, I would force the load to stall 275 cycles until the data was available.
Assume the cpu maintains a CPI of 1 except for load stalls. Assume the cpu stalls immediately on all misses (or partial hits).
Matt pointed out a subtlety you'll need to deal with. When you allocate a block for a store, you allow the CPU to proceed while the store happens in the background (including a possible access to L2, etc.). The subtlety is whether you are allowed to evict that block in the meantime. We'll assume that you cannot. Thus, a load can stall either both because it wants to access a block in transit or because it wants to evict a block in transit. In a 2-way set-associative cache, we will always evict the LRU block, even if it means we have to stall. Also notice that this means that we can stall on a store if the store would evict a block in transit -- that is the only reason we'd stall for a store.
To keep it simple, assume you cannot service a load to a block in transit (initiated by a store) until it returns with the data, even if the access is to the same word in the block.
Your report will give average hit rates for L1, L2, victim, regular and pseudohits, as appropriate for all configurations, as well as cpi. Identify the best configuration for each baseline, base on performance (of course). Run for all traces in the traces directory, and use the arithmetic mean of the CPIs.
You will turn in your report and code electronically again.
You'll find the code, traces, and further instructions in ~tullsen/240proj2. Alternatively, you can download all the files here.
H&P 1.3, 2.1, 2.6, 2.19, A.2, A.4, A.6, 3.1, 3.2, 3.3, 3.7, 3.10, 3.17, 3.18, 3.20, 3.21, 4.3, 4.5, 4.7, 4.9c, 5.5, 5.7, 5.12, 5.13, 5.16
Sections to read carefully: 1.5-1.6, 2.1-2.3, 2.7, 2.9, 2.10, A.1-A.3, A.5-A.6, 3.1-3.7, 4.3, 4.5, 5.1-5.7
Sections to read a little less carefully: 2.5, 2.11, 2.12, A.4, A.8, 3.8-3.10, 4.1-4.2, 4.4, 4.7, 5.10, 5.13
Caveats:
Pay attention to everything presented in class, whether in the book or not.
"Read less carefully" does not mean do not read/study, nor that it is unimportant.
The problems above do not give full, or balanced, coverage of the important material -- it just reflects which of the unassigned problems I felt were useful.