Due: May 28
In Lab 5, you will speedup your single-cycle MIPS processor by adding pipeline stages. Specifically, you will be adding 5 pipeline stages to your design. At the end of this lab, you will be able to see your processor run faster and measure speedup using benchmark applications which you can compile using the MIPS cross-compiler toolchain, and simulate your design in Modelsim.
This lab (and the next one) will require much more work than the previous labs and will probably take longer than you expect, so start early. We advise you to add pipeline stages one at a time and have them working. Your grades will depend on the functional working of the processor. For example, a 4-stage working MIPS processor will get better grades than a 5-stage processor that fails to run the benchmarks. The key is
To add pipeline stages you need to modify the datapath and add support for stall logic. You don't need to make any changes to the control logic. We will not be providing you a schematic for this lab. You do not have to implement load delay slots in your processor. Also you don't have to implement forwarding logic.
Before you starting coding your pipeline stages
The stall logic or the hazard detection logic is a combinatorial module. It takes as input the opcode and read/write register address of the instruction begin executed in each piepline stage. When a hazard is detected, the stall signal must be asserted that pushes bubbles through the pipeline. A bubble is essential executing a NOP instruction. We will run through an example of RAW data hazard.
Consider the following two instructions
INST-1: add $(10), $(11), $(12)
INST-2: add $(13), $(10), $(10)INST-2 needs to read from register $(10) the value written by INST-1. To detect this hazard
INST-1: BNE $(10), $(11) offset
INST-3: ADD ($10), $(11), $(12)The outcome of the branch instruction is only known at the output of EX stage. If the instructions continue to execute through the pipeline, INST_2 and INST_3 will be executed even if the branch is taken. The stall logic needs to identify control dependency in the instructions and assert the stall signal until the outcome of the control is known.
In Lab 4, you did not implement branch delay slots as they are not needed in a single-cycle processor. However, in a single-issue pipelined processor, branch delay slots are useful if one has a compiler that can fill them with independent instructions, as it enables one to reduce the number of stalls. Because the cross compiler you have is capable of filling the branch delay slots, we ask that you implement branch delay slots for this lab. The only major change that this imposes is the implementation of jal and jalr, as you will now need to store pc+8 instead of pc+4 into the destination register. To enable your compiler to fill branch delay slots, remove the "-fno-delayed-branch" flag from the CCOPTS variable in your Makefile.
|Instruction Type I1/I2||R-Type (excluding branch and jump)||Branch Instructions||Jump Instructions||I-Type Instructions|
|I2.rs==I1.rd or I2.rt == I1.rd|
As a reminder, the book assumes that the instruction and data memories are asynchronous. However, the FPGAs that we are targeting cannot support asynchronous ROMs of the kind of size we need. Therefore, we have made a few small changes to the design in the book. Namely, the instruction memory is indexed with the pc_next (the current PC+4) value. You did this in Lab 2, so don't worry about it if you don't remember. Things probably wouldn't be working if you didn't do this.
Here's the problem: when a hazard is detected in the Decode stage and the stall is asserted, the PC should be stalled (i.e., the output value should remain the same while the stall is asserted), but the value of pc_next is now two instructions ahead of the instruction in the Decode stage. Therefore, when the stall is finally deasserted, the instruction that is registered in the IF/ID pipeline register is the instruction after what you really want. In short, you sometimes skip an instruction.
In terms of a solution, you are free to implement whatever you want to mitigate this problem. For example, you might want to add logic to change which address you address the instruction memory with. You might want to have some kind of stall signal in the instruction memory. You might want to do some math to the pc value to adjust for this discrepancy. There are many solutions.
The book shows that writes to the register file occur in the first half of the Write Back stage so a (data-)dependent instruction can read the value that is being written back. However, you most likely did not implement your register file to handle this as we did not explicitly tell you to do this. Therefore, this is actually a data hazard and you cannot expect this to happen in your processors.
There are two solutions to this problem. The first is fairly straight-forward: if you discover a data dependency, stall until the value is written back to the register file. This is easy because you are already going to have to do this, just make sure you wait enough cycles.
The second solution is how the book implies it to be done. It is called "register file forwarding", in which "the read gets the value of the write in that clock cycle" (page 367, Figure 4.53 description). We came up with some pseudocode for implementing this in your register file module. Something like this would accomplish "register file forwarding" for the second read port:
always @* begin
if (write_en && write_addr == read_addr2) begin
read_data2 = write_data_in;
end else begin
read_data2 = registers[read_addr2];
This could be extended for the first read port as well. The idea is that, if you are doing a write in a clock cycle as indicated by the write_en signal being asserted, you can compare the write_addr with the read_addr2 and just forward the results to the read_data2 port. If not, then it is just a normal read from the register file.
Of course, in the lab instructions, we said that you did not have to implement any forwarding. However, if you do not implement register-file forwarding, you cannot overlap the Write Back and Decode stages of dependent instructions.
(Optional)In order to simulate the programs you will create for this lab, you should follow the instructions from above. Namely, you should remember to set the INIT_PROGRAM parameter for the instruction ROM and data memory.
A cross-compiler has been created for you and is located on ieng6.ucsd.edu. Many of you already have accounts to this machine, so you can use your regular username and password. If you do not have an account, please contact us on the staff mailing list.
After you have logged into the ieng6.ucsd.edu, run the following commands:$ prep cs141s
This will download and extract the files you will need for this lab. Here is a listing of the important files. Take a look around at what is going on in each file.
The instructions for using these programs is the same from Lab 4.
We will test each type of hazard in the pipelined design. If you found groups of instruction that have similar instruction format and produce the same hazard, it is sufficient to test one pair of instructions from the group. You will want to write your test programs in assembly in this stage to guarantee that the compiler produces the correct commands.
You may write programs that test a single instruction or that test multiple instructions, but be sure to test every instruction (including the instructions from the previous lab). Don't forget to add your programs to the Makefile as described in the previous section. If you do not see the memh files that are prefixed with your program name after running 'make', it is probably because you forgot to add the correct targets to the Makefile.
When you are satisfied that your hazard unit is behaving correctly, its time to move on to testing larger programs.
To calculate Cycles Per Instruction (CPI), you need to determine how many cycles your application took to run and divide it by the number of useful instruction (non-bubbles) that were retired. Calculating the number of cycles an application takes to run is straight-forward. Determining how many useful instructions were retired in that time block is a little harder.
In the single-cycle pipeline case, the CPI was 1.0 because you always retired one instruction per cycle. However, because of stalls and bubbles, you may retire a bubble or useful instruction every cycle. You can count the number of bubbles by including a counter in your processor module that counts how many cycles your stall signal(s) were asserted. Then, you can subtract the total number of cycles by the number of bubbles to give you the number of useful instructions that were retired. Your CPI should be >= 1.0.
There are other ways to determine the CPI, but we wanted to give you one idea.
We will use gcd application to benchmark the processor in this lab.
|Due: May 28|