- Patterson & Hennessy, "Computer Organization and Design -- The Hardware/Software Interface", Morgan Kaufmann, Fourth Edition.
In this class, you will design an advanced general-purpose processor that executes the MIPS ISA. Your goal is to create a processor that executes a few target benchmarks as quickly as possible. Everyone will start with the same baseline design (which executes the MIPS ISA correctly). Because the benchmarks are fixed, instruction count is relatively fixed. Therefore, your primary domains for improvement are CPI and cycle time. CPI mostly.
You will work in groups of one or two. You will be expected to implement a minimum of 5 architectural optimizations; beyond that, you will be judged primarily on the design's novelty, and performance, as well as your analysis of the design. We expect your choices to be driven more by performance than ease of implementation.
In addition, each group will also do at least one service project and one topic presentation. The presentation should be completed during the first half of the quarter, and the service project preferably earlier rather than later.
The grading scheme will be as follows:
Design Project & Analysis |
65 % |
Service Project, Contribution to Class Success |
20 % |
Topic Presentation |
15 % |
All of the information you need for the class will be available through Moodle, at https://csemoodle.ucsd.edu . That will include verilog infrastructure, code compilation infrastructure, suggestions for service projects, suggestions for optimizations, schedules for presentations, etc.
Altera has donated 10 Altera DE2 development boards for the class! To program it, we will be using the Quartus II web edition software, which is free to download. It can be found at:
http://www.altera.com/products/software/quartus-ii/web-edition/qts-we-index.html
Your design will be graded on three factors:
Completion of least 5 Optimizations (biweekly design reviews) |
50 % |
Performance (as achieved by the optimizations) |
20 % |
Analysis and Novelty of Design; expressed through Paper and Presentation |
30 % |
To be successful in this class, you will need to not only build something that is cool and high performance; you will need to measure it and communicate the novel elements of your design to the rest of the class. To do so, you will need to write an 8-page paper (along the lines of a IEEE Micro magazine paper), and give a 20-minute talk (along the lines of a HOTCHIPS presentation) that conveys these elements. We will also have student design reviews every other week that demonstrate your progress in completing the design.
see the Moodle page.
The following optimizations are candidates for implementation in your design project:
1. Caches (both instruction and data, or a unified). Could also do an off-chip cache, since the DE2 board has SRAM.
2. Cache optimizations (victim cache, pseudo-associative, …)
3. A lockup-free cache can service hits (and possible misses) while waiting for a miss to return from memory. With a lockup-free cache, the pipeline should stall on the use of data, not on a load miss.
4. Superscalar execution. Fetch, decode, execute multiple instructions per cycle.
5. Superpipelining. Run the clock at a rate that is roughly half the cycle time of the baseline (e.g., half of the ALU stage delay).
6. Branch prediction.
7. Speculative execution. Allow the pipeline to execute well beyond unresolved branches. This requires checkpointing (at least some) processor state at each branch, and a two-phase commit (write intermediate results to a buffer or pseudo register file).
8. Register renaming. Is this useful without out-of-order execution?
9. Out-of-order execution. Instructions issue to the execution units in an order different than they are fetched.
10. Multithreading. One pipeline with multiple program counters. Instructions from multiple threads are mixed or interleaved on the pipeline.
11. Multicore. Multiple CPUs (pipelines) connected via a bus or interconnection network.
12. Hardware prefetching (stream buffer). Build a support hardware unit that observes the cache miss stream, recognizes patterns, and begins prefetching future misses.
13. Multi-path execution. On some low-confidence branches, execute both targets of the branch.
14. Runahead execution. On a load miss, keep executing the instruction stream (just dropping stalled instructions). This may cause a future miss to be initiated. When the original load completes, you must recover back to the state following the load (similar to a branch mispredict recovery).
15. Value prediction. Identify instructions with predictable outcomes. If the instruction is stalled, provide the predicted outcome and proceed. Must be able to recover. A similar technique is instruction reuse – if an instruction executes with the same inputs as a previous instance, provide the same output as before. The latter is non-speculative, but only helps with multiple-cycle operations.
16. Other ideas: feel free to ask us or do something unquestionably cool.
More ideas, as they come, will show up on Moodle.
Every other Thursday will be devoted to design reviews. You will be expected to make progress each time. These design reviews will be part of your grade, so doing most of the work at the end of the quarter will not be a reasonable strategy.
Each group must make a presentation, which must be completed in the first half of the quarter. Possible topics include the optimizations listed below. The presentation should take 20-30 minutes. You should present the topic (usual based on a seminal paper or article), and then sketch out a preliminary design/hardware approach. Students not presenting will be expected to have read the discussed papers.
References for Topic Presentations:
see the Moodle page.
Schedule
Will be posted on Moodle. Please check it regularly.