CSE 148 -- Advanced Processor Architecture Design Project

 

Basic Course Information:

Class Goals

In this class, you will design an advanced general-purpose processor that executes the MIPS ISA.  Your goal is to create a processor that executes a few target benchmarks as quickly as possible.  Everyone will start with the same baseline design (which executes the MIPS ISA correctly).  Because the benchmarks are fixed, instruction count is relatively fixed.  Therefore, your primary domains for improvement are CPI and cycle time.  CPI mostly.

You will work in groups of two.  You will be expected to implement 4 architectural optimizations;  beyond that, you will be judged primarily on the design's novelty and performance, as well as your analysis of the design.  We expect your choices to be driven more by performance than ease of implementation.

In addition, each group may also do a service project, possibly in lieu of one optimization. We'll discuss that further in class, and as service projects present themselves. Each group will do two topic presentations.  The presentations should mostly be completed during the first half of the quarter, and the service project (if you do one) preferably earlier rather than later.

The grading scheme will be as follows:

 Design Project & Analysis

 70 %

 Design Reviews

 15 %

 Topic Presentations

 15 %

(the following is not correct yet!)

All of the information you need for the class will be available through Piazza. Sign up for the class here, then after that you can access the class here. That will include verilog infrastructure, suggestions for service projects, suggestions for optimizations, schedules for presentations, etc.

Infrastructure

We will be using the Quartus II web edition software, which is free to download.  It can be found at Altera Download Site.

 


Design Projects 

Your design will be graded on three factors:

 Completion of least 4 Optimizations

 50 %

 Performance (as achieved by the optimizations)   

 20 %

 Analysis and Novelty of Design; expressed through Paper and Presentation

 30 %


To be successful in this class, you will need to not only build something that is cool and high performance; you will need to measure it and communicate the novel elements of your design to the rest of the class.  To do so, you will need to write an 8-page paper (along the lines of a IEEE Micro magazine paper), and give a 20-minute talk (along the lines of a HOTCHIPS presentation) that conveys these elements.  We will also have student design reviews every other week that demonstrate your progress in completing the design. We will have specific milestones/deadlines for completion of the optimizations.

It is worth thinking ahead about the graphs and results you would like to produce at the end of the quarter. A great design is of lesser value if we cannot quantify how much it has improved. An ideal design can switch any optimization on and off, and select them independently. In addition, it can change the parameters for the optimization (e.g., size of branch predictor tables) trivially. But this will not always be possible (e.g., out-of-order execution).

Service Projects 

see the Piazza page.

Optimizations

The following optimizations are candidates for implementation in your design project (but not the only possibilities!):
0.  Caches (both instruction and data, or a unified).  This year, basic instruction and data caches are part of the baseline design, but you will be expected to fiddle with associativity, size, blocksize, etc. to optimize performance. However, this is not one of your 4 optimizations. You could also do an off-chip cache, I suppose, since the DE2 board has SRAM. That would be a full optimization.

1.  Cache optimizations (victim cache, pseudo-associative, …)

2.  A lockup-free cache can service hits (and possibly misses) while waiting for a miss to return from memory.  With a lockup-free cache, the pipeline should stall on the use of data, not on a load miss.

3.  Superscalar execution.  Fetch, decode, execute multiple instructions per cycle.

4.  Superpipelining.  Run the clock at a rate that is roughly half the cycle time of the baseline (e.g., half of the ALU stage delay).

5.  Branch prediction

6.  Register renaming.  Is this useful without out-of-order execution?

7.  Out-of-order execution.  Instructions issue to the execution units in an order different than they are fetched.

8.  Multithreading.  One pipeline with multiple program counters.  Instructions from multiple threads are mixed or interleaved on the pipeline.

9.  Multicore.  Multiple CPUs (pipelines) connected via a bus or interconnection network. This one is not interesting unless running a parallel program with communication between threads -- and right now our benchmarks don't support that.

10.  Hardware prefetching (stream buffer).  Build a support hardware unit that observes the cache miss stream, recognizes patterns, and begins prefetching future misses. 

11.  Multi-path execution.  On some low-confidence branches, execute both targets of the branch.

12.  Runahead execution.  On a load miss, keep executing the instruction stream (just dropping stalled instructions).  This may cause a future miss to be initiated.  When the original load completes, you must recover back to the state following the load (similar to a branch mispredict recovery).

13.  Value prediction.  Identify instructions with predictable outcomes.  If the instruction is stalled, provide the predicted outcome and proceed.  Must be able to recover. 

14.  Instruction Reuse.  A similar technique to value prediction is instruction reuse – if an instruction executes with the same inputs as a previous instance, provide the same output as before.  The latter is non-speculative, but only helps with multiple-cycle operations (which we don't have) or if you can predict multiple at once .

15.  Other ideas: feel free to ask us or do something unquestionably cool. 

More ideas, as they come, will show up on Piazza.

Design Reviews

Every other Thursday will be devoted to design reviews.  You will be expected to make progress each time.  These design reviews will be part of your grade, so doing most of the work at the end of the quarter will not be a reasonable strategy.

Topic Presentations

Each group must make two presentations, which must be completed in the first half of the quarter.  Possible topics include the optimizations listed above.  The presentation should take most of a class period.  You should present the topic (usually based on a seminal paper or article), and then sketch out a preliminary design/hardware approach.  Students not presenting will be expected to have read the discussed papers.

References for Topic Presentations:

see the Piazza page.

Schedule

Will be posted on Piazza.  Please check it regularly.