cse141L Lab 5: 5-Stage MIPS Processor


Changelog

May 6 In order to synchronize more effectively with the 141 class, we have extended this lab to be 3 weeks instead of just 2. Keep in mind, however, that this extra week is going to be borrowed from Lab 6, so if you plan on doing Lab 6, it would probably be in your best interest to finish Lab 5 before the deadline.
May 6 You will be doing a demo for this lab. The signup link will be emailed to you. Please sign up promptly. Before your demo, have your setup ready to go so that we can keep the process speedy for the people going after you.
May 28 For the demo, please download the demo test cases from here. To speed up the demo, I would recommend naming all of the file with "test" prefix and storing them in a separate directory and replacing the files during the simulation, as Modelsim will reload the files upon restarting the simulation.

Due: May 28

Overview

In Lab 5, you will speedup your single-cycle MIPS processor by adding pipeline stages. Specifically, you will be adding 5 pipeline stages to your design. At the end of this lab, you will be able to see your processor run faster and measure speedup using benchmark applications which you can compile using the MIPS cross-compiler toolchain, and simulate your design in Modelsim.

This lab (and the next one) will require much more work than the previous labs and will probably take longer than you expect, so start early. We advise you to add pipeline stages one at a time and have them working. Your grades will depend on the functional working of the processor. For example, a 4-stage working MIPS processor will get better grades than a 5-stage processor that fails to run the benchmarks. The key is

Adding stages one at a time is optional. You can also code 5 stages at the same time

General Notes

5-Stage MIPS Pipeline

To add pipeline stages you need to modify the datapath and add support for stall logic. You don't need to make any changes to the control logic. We will not be providing you a schematic for this lab. You do not have to implement load delay slots in your processor. Also you don't have to implement forwarding logic.

Before you starting coding your pipeline stages

  1. Identify the data path inputs and outputs of each stage of your piepline. For example, the ID/EX stage needs to register Read_Data1, Read_Data2, Sign Extendend Immediate, PC_next and the Instruction. You will need to know the instruction to implement the stall logic.
  2. Classify the control signals according to the stages they connect. Remember that the book does not include all the control signals.
  3. Write down the list of possible hazards with the instructions implemented so far. For example, there are no structural hazards in your design.

Stall Logic

The stall logic or the hazard detection logic is a combinatorial module. It takes as input the opcode and read/write register address of the instruction begin executed in each piepline stage. When a hazard is detected, the stall signal must be asserted that pushes bubbles through the pipeline. A bubble is essential executing a NOP instruction. We will run through an example of RAW data hazard.

Consider the following two instructions

INST-1: add $(10), $(11), $(12)

INST-2: add $(13), $(10), $(10)

INST-2 needs to read from register $(10) the value written by INST-1. To detect this hazard Lets look at another example when its a Control hazard.

INST-1: BNE $(10), $(11) offset

INST-2: NOP

INST-3: ADD ($10), $(11), $(12)

The outcome of the branch instruction is only known at the output of EX stage. If the instructions continue to execute through the pipeline, INST_2 and INST_3 will be executed even if the branch is taken. The stall logic needs to identify control dependency in the instructions and assert the stall signal until the outcome of the control is known.

Branch Delay Slots

In Lab 4, you did not implement branch delay slots as they are not needed in a single-cycle processor. However, in a single-issue pipelined processor, branch delay slots are useful if one has a compiler that can fill them with independent instructions, as it enables one to reduce the number of stalls. Because the cross compiler you have is capable of filling the branch delay slots, we ask that you implement branch delay slots for this lab. The only major change that this imposes is the implementation of jal and jalr, as you will now need to store pc+8 instead of pc+4 into the destination register. To enable your compiler to fill branch delay slots, remove the "-fno-delayed-branch" flag from the CCOPTS variable in your Makefile.

Question 1: Update your schematic from Lab 4 or draw a new schematic that shows the additional logic you will need to add to your processor to implement the pipeline registers. Label the input and output signals of each pipeline stage. You must draw this schematic rather than use the tools to infer it. You will greatly appreciate having this schematic as you complete you design. Be sure to include the names of your wires and their widths.
Question 2: Create a table that outlines the when the stall signal needs to be asserted. In the first column, include a row for each type instruction. Label the top of each other column by the type of instruction(s). Fill in the table which indicates when the stall signal is asserted for them.
Instruction Type I1/I2 R-Type (excluding branch and jump) Branch Instructions Jump Instructions I-Type Instructions
R-Type (excluding branch and jump)
I2.rs==I1.rd or I2.rt == I1.rd      
Branch Instruction
       
Jump Instruction
       
I-Type Instruction
       

A Warning on Stalls

As a reminder, the book assumes that the instruction and data memories are asynchronous. However, the FPGAs that we are targeting cannot support asynchronous ROMs of the kind of size we need. Therefore, we have made a few small changes to the design in the book. Namely, the instruction memory is indexed with the pc_next (the current PC+4) value. You did this in Lab 2, so don't worry about it if you don't remember. Things probably wouldn't be working if you didn't do this.

Here's the problem: when a hazard is detected in the Decode stage and the stall is asserted, the PC should be stalled (i.e., the output value should remain the same while the stall is asserted), but the value of pc_next is now two instructions ahead of the instruction in the Decode stage. Therefore, when the stall is finally deasserted, the instruction that is registered in the IF/ID pipeline register is the instruction after what you really want. In short, you sometimes skip an instruction.

In terms of a solution, you are free to implement whatever you want to mitigate this problem. For example, you might want to add logic to change which address you address the instruction memory with. You might want to have some kind of stall signal in the instruction memory. You might want to do some math to the pc value to adjust for this discrepancy. There are many solutions.

Register-File Forwarding

The book shows that writes to the register file occur in the first half of the Write Back stage so a (data-)dependent instruction can read the value that is being written back. However, you most likely did not implement your register file to handle this as we did not explicitly tell you to do this. Therefore, this is actually a data hazard and you cannot expect this to happen in your processors.

There are two solutions to this problem. The first is fairly straight-forward: if you discover a data dependency, stall until the value is written back to the register file. This is easy because you are already going to have to do this, just make sure you wait enough cycles.

The second solution is how the book implies it to be done. It is called "register file forwarding", in which "the read gets the value of the write in that clock cycle" (page 367, Figure 4.53 description). We came up with some pseudocode for implementing this in your register file module. Something like this would accomplish "register file forwarding" for the second read port:

always @* begin
...
        if (write_en && write_addr == read_addr2) begin
                read_data2 = write_data_in;
        end else begin
               read_data2 = registers[read_addr2];
        end
...
end

This could be extended for the first read port as well. The idea is that, if you are doing a write in a clock cycle as indicated by the write_en signal being asserted, you can compare the write_addr with the read_addr2 and just forward the results to the read_data2 port. If not, then it is just a normal read from the register file.

Of course, in the lab instructions, we said that you did not have to implement any forwarding. However, if you do not implement register-file forwarding, you cannot overlap the Write Back and Decode stages of dependent instructions.

Testing The Whole Processor

(Optional)In order to simulate the programs you will create for this lab, you should follow the instructions from above. Namely, you should remember to set the INIT_PROGRAM parameter for the instruction ROM and data memory.

Cross-Compiler

A cross-compiler has been created for you and is located on ieng6.ucsd.edu. Many of you already have accounts to this machine, so you can use your regular username and password. If you do not have an account, please contact us on the staff mailing list.

After you have logged into the ieng6.ucsd.edu, run the following commands:

$ prep cs141s
$ mkdir cse141L
$ cd cse141L
$ wget http://cseweb.ucsd.edu/classes/sp13/cse141L-a/Media/lab5/lab5-files.zip
$ unzip lab5-files.zip
$ ls

This will download and extract the files you will need for this lab. Here is a listing of the important files. Take a look around at what is going on in each file.

The instructions for using these programs is the same from Lab 4.

Unit Testing

We will test each type of hazard in the pipelined design. If you found groups of instruction that have similar instruction format and produce the same hazard, it is sufficient to test one pair of instructions from the group. You will want to write your test programs in assembly in this stage to guarantee that the compiler produces the correct commands.

You may write programs that test a single instruction or that test multiple instructions, but be sure to test every instruction (including the instructions from the previous lab). Don't forget to add your programs to the Makefile as described in the previous section. If you do not see the memh files that are prefixed with your program name after running 'make', it is probably because you forgot to add the correct targets to the Makefile.

Question 3: Simulate your processor for instructions that produce data and control hazard. In the report, include the waveform that shows all of the internal signals in your processor module. If the stall logic is common between multiple instructions then show only one waverform for those group of instructions

When you are satisfied that your hazard unit is behaving correctly, its time to move on to testing larger programs.

Calculating Cycles Per Instruction (CPI)

To calculate Cycles Per Instruction (CPI), you need to determine how many cycles your application took to run and divide it by the number of useful instruction (non-bubbles) that were retired. Calculating the number of cycles an application takes to run is straight-forward. Determining how many useful instructions were retired in that time block is a little harder.

In the single-cycle pipeline case, the CPI was 1.0 because you always retired one instruction per cycle. However, because of stalls and bubbles, you may retire a bubble or useful instruction every cycle. You can count the number of bubbles by including a counter in your processor module that counts how many cycles your stall signal(s) were asserted. Then, you can subtract the total number of cycles by the number of bubbles to give you the number of useful instructions that were retired. Your CPI should be >= 1.0.

There are other ways to determine the CPI, but we wanted to give you one idea.

Question 4: Run the "fib.c" program contained in the zip file we provided. If you have a faster implemention of fib.c from lab4, use that to measure the CPI. Using the number of clock cycles from simulation, measure the CPI of your pipelined design. Is it different from the CPI of Single Cycle Processor?

Benchmarking

We will use gcd application to benchmark the processor in this lab.

Question 5: Run the "gcd" program contained in the zip file we provided. If you do not see the memh files, you will need to run 'make' to generate them. Simulate your processor for 400us and put the output from the messages panel (usually at the bottom of the modelsim window) in your report. Using the number of clock cycles from simulation, measure the CPI of your pipelined design.

Wrap-up

Question 6: Once your processor is working correctly, run the whole synthesis and place and route flow and include the Fmax (and minimum period in nanoseconds), Total logic elements, and the Total registers statistics from your design.
Question 7: First, try to guess what the critical path in your design is. Say where your critical path starts, where it goes to, and where it ends. For example, you may say that the critical path is from the output of the PC register, through the adder, and back to the register. Now, try to use the tools to identify the critical path in your design. Instructions are on the Wiki. If your initial guess was wrong, try to explain why the path the tools report is correct. Think in terms of the amount of logic that is happening in a path, how many gates a wire might be connecting to, etc.

Deliverable

  • Submit your lab as two files (the project PDF and a zip file) at ted.ucsd.edu. The link to upload the file is available under "Assignments"->"Lab 5: 5-Stage MIPS". The deadline for submitting your report is Tuesday, May 28th by 3:00PM . Do not send softcopies to TAs email ids.
  • The zip file needs to:
    • Be named cse141L-lab4-LastName-FirstName.zip with your last name and first name substituted for LastName and FirstName, respectively.
    • Include a single-pdf report that answers all of the questions found in the lab description - including questions, verilog source code, graphs, screen-shots, etc. There are many tools out there capable of integrating text and graphics and producing PDF files (OpenOffice does a pretty good job).
    • Your project source code and quartus project files.
    • Your source for the test programs you made.
  • You will have a lab interview for this lab. Please have your assignment set up and ready to go before your meeting in order to keep the process speedy. Sign-up using the google doc link that was emailed to you.

Due: May 28