Added some clarifying information about the specific workings of the dmem and imem. A mini-section has been added that addresses the communication protocol behind these two modules.
As a result of the above change, the state machine now has a new state (DMEM Read). More information is available in the state machine image and in the table below it.
After some deliberation, we've decided that another (fairly simple) construct needs to be added to the Datapath that was
not explicitly mentioned in the last lab. This is described below in the "Datapath Addition" section. Hopefully this
change should make your lives much easier. This also includes a signal addition to the control module (named
We've tried to mark all of the relevant changes in red text.
Added a small change to the Datapath PDF which should fix a small bug manifested in the changes that were posted yesterday.
Also clarified the intended input placement of the instruction MUX on the Datapath PDF. Here's a summary:
The due date for lab 3 has been shifted to April 29th.
A few design changes have been made to simplify (and clean up) the TrivialScalar design. The following changes are the most notable:
Due: April 29
In Lab 3 you will extend the design demonstrated in Lab 2 to implement the control unit for TrivialScalar. As you probably realized throughout Lab 2, it wasn't very easy to test each individual module completely; however, with the addition of Lab 3 (the control unit), testing will be much more straightforward.
As in Lab 2, our aim for this lab is to help each of you to learn/master the skills necessary to succeed in the coming labs. Please use one of the many resources you have available to you (the TAs, WebBoard, and your classmates) if you become stuck.
NOTE: You must complete Lab 3 in the same groups you completed Lab 2 with.
Please remember that you must conform to the class coding standards. They are available here. If you find a bug in them (e.g., they are causing you to do something horribly ugly), please let the professor or one of the TAs know.
The greatest common divisor example is available here. Create a new project and import all the files. If you simulate using gdb_tbw.tbw, it will run a short test. This example demonstrates the coding standards for the datapath and the modules therein.
Here are some notes about how IO should work in TrivialScalar. You should read through this before starting the implementation below. Please note that all of the signals shown below (in the diagrams) should be held for an entire cycle.
READ instruction, in_req signal should be asserted for an entire cycle (as shown in the following figure) and the processor
should block for a response from the IO device. The IO device may take any number of cycles to respond to the request. However,
the earliest that the IO device may respond would be the following cycle. The figure below demonstrates the
signal being asserted and the IO device responding in the third cycle following the initial request by placing data on the
in_data line and asserting the
in_ack line for an entire cycle. The data on the
line is only valid while the
in_ack line is asserted.
WRITE instructions work similarly as the
out_req signal is asserted while the data to be passed to the IO device is
placed on the
out_data bus (as is shown in the following figure). The IO device may respond any number of cycles later
by asserting the
out_ack wire. As in the case of a
READ instruction, the earliest time for a response is the following
cycle. The figure below demonstrates the protocol for communicating with the IO device. On a
WRITE instruction, the
should be asserted while the data to be transferred to the IO device should be on the
out_data bus. The processor must
stall until the IO device has completed the request (which is signaled by a cycle-long assertion of
Here is some insight as to how the memory modules work in TrivialScalar. Some more contextual information is available in the table below.
Firstly, it is important to note that the
dmem module in TrivialScalar is positive-edge triggered and that the
module is asynchronous. What this means to you, as the architect, is that a read or a write in the
dmem can only occur at a
positive edge of the clock signal. All reads through the
imem will still occur in a single cycle. The implications
of this are that a read on a memory location (in the
dmem) require an entire cycle to show up on its output bus. The direct
impact of this observation is pointed out below and in the following section.
LD instruction, the following should take place: the
read_write_req signal should be asserted, the proper input
r_data should be selected via the
reg_sel signal, and the
regfile_write_en signal should be asserted
at the proper time. Know that an access to the
dmem takes about a cycle, so the
LD operation takes a total of two
cycles. This is part of the state machine design detailed in the next section, and is reiterated and emphasized throughout that section.
ST instruction, the protocol is more straight-forward. The following signals have specific importance during this
dmem_write_en should be asserted in order for the
dmem to latch
onto its inputs at
addr. Although this operation takes time, just as the
LD operation does,
since we've eliminated the extra propagation delay between
dmem and the
register file, we don't have to continue
for an extra cycle. The value should have been stored into the
dmem by the following cycle.
In an effort to clean up the Datapath design, we ask that you please remove any additions from between the
imem and the
The main focus of this lab is creating a single module: control. We'll go ahead and call it control.v.
// Parameters: // INST_WIDTH -> Width of an instruction module control#(parameter INST_WIDTH = 16) ( // Global clock input clk, // Global reset input reset, // Control-Datapath interface // Specifies which PC should be used next cycle output [1:0] run_stall_reset, // The entirety of the current instruction from the Decode unit input [INST_WIDTH - 1 : 0] inst, // Write enable for the dmem output dmem_write_en, // Request for a read or write from the dmem output read_write_req, // Selects the data to be written to the register file output reg_sel, // Write enable for the register file output regfile_write_en, // Specifies the op code for the ALU unit output op_code, // The control interface for IO // More details to follow input in_ack, input out_ack, output out_req, output in_req );
We'll be designing the control unit as a state machine. If this is unfamiliar to you, consult Verilog Design Examples by Krste Asanovic. If the concept of state machines is still foreign, please talk to one of the TAs.
The above image shows the different states that the control module (and therefore the processor) can have. The below table should explain the different states:
||This state is representative of the normal processor execution, where each PC should advance by one each cycle. In this state, the processor is not looking for any particular signals from the outside world. Be aware that this state will have to perform some preparation when moving from one state to the next (like stalling the processor for a cycle).|
Although we might like for a load from the
This state is in effect when the control module encounters a
This state is similar to the
This state just signifies when the processor has seen a
It is important to note the many different ways that this state machine can be designed. For more information on our requirements
for the design of this state machine, consult the document from last lab:
Verilog Design Examples by Krste Asanovic.
What I'm referencing here is the state machine design detailed throughout the above PDF. Your control module should be similar
in concept to the GCD state machine (expect somewhat more complicated). Don't forget to use only three
initialize your signals on every cycle to avoid repeated statements, and to comply with the class Verilog coding standards throughout
the design process.
Below are some tips for the generation of each signal. I won't give away much information, but just enough to get you started.
run_stall_reset signal will be choosing which of the three signals will propagate to become the Next PC.
This signal might be particularly tricky because the diagram isnt entirely specific of what inputs to the mux generate which outputs.
To be more specific, the ordering of the signals entering the PC MUX shouldn't matter in terms of design rules. If you or your group
feels that you can generate a more efficient control implementation by switching around which signal enters which terminal of the MUX,
please go ahead and do so. Your implementation must continue to adhere to the other aspects of the specified design. Be aware that
this signal (among others) depends greatly on instruction being executed and the
next_state the control unit will enter into.
On another note, be aware that your reset signals should be asynchronous throughout the "stateful" parts of your processor. This
allows for your
always blocks to trigger on the edge of a reset signal instead of relying on your reset signal to stay
high until a positive clock edge is generated.
Question 1: Make a table of all of the control signals that are generated by the Control unit and specify if each signal depends on the state of the processor, doesn't depend on the state of the processor, or a little of both.
This signal should just be asserted when it is time to write to the
This signal should just specify when there is either a read or a write occuring at the
This signal is similar to the
read_stall_reset signal in that it chooses from multiple inputs through the use of a MUX.
In this case, it is also recommended that you choose an optimal ordering for the MUX inputs that simplifies your Control code.
This is just the write enable for the register file. Beware that for one particular instruction, when you assert this signal will be crucial for proper operation of the instruction/processor.
Question 2: Which instruction is referenced at the end of the last paragraph and why would we not want to assert it immediately?
This signal should just specify which operation we would like to execute in the ALU.
Please refer to the IO explanation above for the operation of these signals.
Now that we have all the individual pieces designed and functioning, we can go ahead and put together our processor. We'd like to make a high-level file called processor.v and instantiate and connect our two large modules together. Here's the interface for that:
module processor ( // Global clock input clk, // Global reset input reset, // The data interface for IO input [7:0] in_data, output [7:0] out_data, // The control interface for IO input in_ack, input out_ack, output out_req, output in_req );
For starters, go ahead and test your TrivialScalar implementation with the provided test program and testbench.
Here's a link to a test program in (*.coe) format that you can use to verify
your processor implementation. This should just be attached to your
imem module, just as the
memory generation tutorial specifies. You can change the properties of a generated memory by double clicking
on the module (
imem in this case) and bringing up the Block Memory Generator. On
page 3 of 4, you'll have the option of changing the init file (*.coe file). This is where you'll want
to specify this new file.
Question 3: Provide a translation of the provided *.coe file into assembly.
In addition to this file, you'll need to stimulate the inputs specific to this *.coe file (speaking in terms of the IO module). Here's a *.zip which contains the test bench files that you can use in conjunction with the above test.coe file.
Question 4: How many cycles after the first IO-based instruction does the IO device respond. Please provide a screenshot of a waveform describing this.
After using the provided *.coe file and test bench above, we'd like you to generate your own set of testing tools.
We would like each group to write up a test program that takes two inputs from the IO module, operates on them,
and then sends them back down the IO path. Assuming the two inputs are X and Y (received in that order from the IO
unit) function we'd like you to perform is
((X + Y) * Y - 3). We'd like for you to do this three times
in your code (remember that this ISA doesn't have a branch--so no loops!), writing back your result to the IO device
after each computation. Please generate your own test bench that provides your processor with the proper IO responses.
I would recommend variation in the delay of the IO responses so that each of the three iterations have some differences
in timing, as this is most likely how we'll be testing this.
Note that this set of test bench will test your IO instructions, but you'll have to verify visually that your protocol matches the one listed in this write up (since this test bench cannot examine and report on the protocol itself).
Question 5: As described in the paragraph above, create your own *.coe file with the above parameters and test bench waveform (*.tbw--and its counterpart *.tfw file) file to accompany it. Feel free to use a more loosely timed clock for this portion (although your propagation time shouldn't be above 10ns, ideally). Provide both the contents of the *.coe file and the assembly you derived it from.
Now that your processor should be completely working (and these results will really only be valid in such a case), you should go ahead and perform some performance-based analysis on your design. You should use the static timing report generation tool to accomplish this. You will need to change a few settings in order to get the results that are required of you:
Question 6: Go ahead and implement your design and generate the above mentioned timing report. What is the maximum achievable frequency of your design? (Hint: The critical path is the path with the longest delay. Shortening this path increases your maximum frequency.) Describe your critical path through the circuit (e.g. PC_Mux->PC_DFF->etc...).
localparams to represent different states and typical inputs/outputs, so that your code isn't inundated with 4'b000, 4'b0001, etc. with no specific meaning tied in with those values (this is similar to using enums or #defines in C/C++).
NOTE: The Xilinx Core Generator may not function correctly in Linux. Our experiences have yielded errors late in the generation verison. You can either brave it in Linux, or you can use the Windows version (in the lab, if need be). If you've had success with the Core Generator in Linux, please let one of the TAs know.
|Due: April 29|