Spring
2006 CSE 141L
Project/Computer
Architecture
Get started early!
Do not optimize too early- just try getting something working first then optimize.
Since text parsing is outside the scope of CSE141L (covered by CSE131), we have provided a C/C++ and Java lexer that you can use. These tools are completely optional, so feel free to develop your own text processing tool if you wish. The provided code only helps you process a very simple assembly grammar with some debug messages, so you may need to enhance the grammar, and you will definitely need to provide the higher level semantic processing for your instruction set. The C/C++/Java tools run on Linux.
Some notes about building lexers are found here.
Test Cases- (May
2, 2006)
Look in the CSE141L Spring 2005 webpage. There are some test cases there that you can recycle.
Turnin Page is
Up- (May 7, 2006)
Please provide instructions on how to compile and run your assembler and iss in a "readme.txt" file. Include this in your turnin file.
Lab
3
A lot of people are running into trouble with huge numbers of pins when doing the lab3 datapath. So if this is happening to you, you're not alone.
Solutions to stop the madness (?):
1) Partition your design to minimize wiring across symbol interfaces. (Of course)
2) TA built bus port connector primitives - There are two 8-bit wide, bus port primitives in a public library, providing input and output functionality. The library is located in: /home/linux/ieng6/cs141s/public/lab3_public_library.clf. Look for IN8 and OUT8. More info below.
3) Build your own bus port connector. This is highly complex, but doable. Links on how to do this provided below, or get your TA to demo making a custom port.
4) "Just suck it up", and get lots of practice laying out wires. According to a past CSE141L TA, this approach has lower risk than messing around with busses. He also suggests learning the CTRL key pin naming trick described on page 53 of your Capilano "Logicwork5" book.
Take your pick!
TA build bus port connector
primitives
There is an example of how to use these bus port connectors to get bus pins on a symbol view of a circuit called PASSTHRU in the library. Click through the PASSTHRU symbol to see the attached circuit. The circuit demonstrates the purpose of using the bus port connectors to tell the symbol editor that you want to use a bus pin. One important restriction with bus port connectors is that signals inside the port/pin are connected by name, and if the signal is not connected you will get an error (that annoying low severity error). Fortunately this restriction is easily avoided by using the Bus Pin Info pop-up menu to map the desired signals, as demonstrated in the following two screen captures, between the pin name and bus name. A second very important restriction is naming your bus ports connector instances. You can see in the first screen capture below that both connectors are named.
Bus Pin Info

Matching signals - You must make sure that the signal's pin name maps appropriately to the bus name.

Use a probe to verify.connectivity, or hex keypad and display.
probe
![]()

Custom Bus Port Connectors
Here are some links to LogicWork5 manuals and general instructions that also describe building a custom bus port connector.
Look at page 10:
http://www.rose-hulman.edu/Class/ee/yoder/ece130/LogicWorks Tutorials.pdf
Capilano help on creating bus port connectors for LogicWork3:
http://www.capilano.com/tech_notes/portconns.html
Here's a LogicWorks5 reference, that often times is better than the Capilano book.
http://www.cs.utsa.edu/~xsu/cs2511s05/LogicWorks5_%20Reference.pdf
The PDF are useful references in general, so I've put a copy in the cs141s/public directory.
Lab 4
Here's the follow up to the D-MEM initialization from lecture that you
will need for lab4. First we create a RAM using the memory wizard. Make sure to
select "RAM- Random Access Memory"
Memory wizard button


It will ask some configuration questions, and store it in a library.
Drag that RAM from the parts palette, and place on circuit window. Now that
you've created a RAM, highlight the RAM and click on the memory wizard again,
and select "Edit selected device" button.

The following window asks if you want to provide some initial input to the RAM. You can enter it manually, or read that input from a file.
It couldn't be any simpler!
For this lab, we want you to simulate a direct-mapped instruction cache
in the ISS for performance modeling. The end goal of this mini-lab is to get a
good understanding of how caches help performance, and how one sizes a cache
for a given application.
Cache Characteristics
Your cache is direct mapped. This greatly simplifies your cache design, as every memory location will be hashed to a single memory location. While this increases conflict misses, it means no additional search through associative sets is necessary.
Building Your Cache
We suggest that you build the cache performance modeling and not cache functional modeling, because it does not add to your functionality, and may introduce bugs. Bottom line is your cache must simulate hit or miss effects with the appropriate parameters, and everything else is superfluous. (Sorry if there was confusion, as this was being debated earlier)
Start by creating an array of structs that represents the cache block. At most there will be 256 of them, but will likely be significantly fewer. The struct will keep track of the address tag, the validity of the block, and the block's data. You want to start off cache simulation by initializing all blocks to be invalid.
Here one part of the data structures you will need, as described in lecture:
struct cache_set {
bool valid;
unsigned char tag;
unsigned char data[4]; // unused, but presents max block size of data
};
You want to consider how lookup will find the cache block, and the data within the block. From your P&H book you know that the address used to lookup the cache block is divided into the bits used for tag, index, and block offset. Note all address bits are used to locate the byte of data in memory.
|
1. Block offset - Position of the byte in cache block as the cache block may span multiple bytes. Cache block lookup ignores offset these bytes. Only afterwards, once a block is found, does the cache use the offset to index into the block. |
|
2. Index - These address bits select the cache set. |
|
3. Tag - Any remaining bits goes to tag. After the set is found, the address tag is matched against the cache tag to make sure the data corresponding to the address is present in the cache. |
How do you compute the number of bit positions used? You need some parameters fixed by the dimensions of your cache.
|
1. Block offset |
Block_offset_size = log2(block size in bytes) |
|
2. Index |
Index_size = log2(set size) |
|
3. Tag |
Tag_size = addr_size - Block_offset_size - Index_size |
So given the above parameters and an address, you will need functions or macros to compute out the block offset, index, and tag.
You also now need to implement a check to see if you have a hit or miss in your cache. That check is very simply the tag and validity check described as first step in the FETCH diagram.
Each cache configuration will have a different organization of the address bits, so for each you will have to figure out size of the tag, index, and block offset. The following example diagram describes how the address bits are used:

The example's address space organization has a one bit block-offset at
address[0], three bit index at address[3:1], and four bit tag at address[7:4].
This indicates a direct mapped cache of eight sets of blocks, where each cache
block spans two bytes. The data capacity of this cache is 16 bytes.
There are three steps to instruction fetch:
|
1. Use the index to find the appropriate cache set. Do a tag lookup to verify if the correct block is found by matching the tags, and make sure the data is valid. |
|
2. If successful, use the index to find the block in the data array. |
|
3. Rotate and mask out the data byte from the block, using the block offset. |
During your simulated cache lookup you probably want to compute the tag, index, and block-offset values.
For performance modeling you only need to do the first step of the load.

If the lookup misses in the cache, you must replace the indexed cache line with corresponding block from instruction memory. Make sure to update the tag, and mark the valid bit to true. The initial miss takes ImemWaitState latency. If the block spans more than one byte, the cache continues to fetch data from memory using "burst mode" at some improved (lower) latency.
|
1. Compulsory |
Misses when
block is not in cache. |
|
2. Capacity |
Cache cannot contain all the blocks in program execution, has to evict, and that block gets executed again. |
|
3. Conflict |
Repeated access of two different addresses aliasing to same block, causing evictions and misses, even though another less used block is a better candidate for eviction. This is a common problem for direct-mapped caches. |
The speedup graph in lecture was done to normalize the data so it may appear in one slide. It is unnecessary to display speedup, as run-time is sufficient.
Example for burst mode:
Your PC is 0xA. You fetch 0xA instruction, but miss I-Cache (instruction cache) at address 0xA.
Assume your cache has 4 block, 2 bytes each (8B capacity).
I-Cache takes 10 cycles to get address 0xA. Then because the block is 2 bytes, it needs to also get the next address and takes 5 more cycles to get address 0xB using burst mode. After that it returns to processing the next instruction.
Lets assume you don't have a branch at 0xA, so the next instruction is 0xB. Instruction at 0xB will definitely hit in the cache, and is why you have blocks bigger than 1 byte, and is why you do burst mode. Say you did have a taken-branch at 0xA that goes to 0x3. Did the increased block size help you? No and the extra latency in burst mode is wasted overhead. Consequently there is a trade off in increasing the block size.
1)
> 1)
For the reset logic, our current design is that when reset is 1, all
> the
internal state registers and general-purpose registers are set to 0,
>
while our programs don't take this advantage and don't assume the register
> be
any initialized values at the start. However, in the spec, it says that
> we
should not reset the contents of the general-purpose registers to a
>
known value. So, do we need to change our logic to avoid clear the
>
general-purpose registers ' value?
Avoid a known value that allows a short cut in the operation of your program. Use your intuition as to what a short cut is. You can reset general purpose registers to 0's or 1's as long as its consistent. Same with internal registers, though if you need some other value for say a control register, you can do that, so long as you justify this in your report. PC is set to 0.
2)
> For
the instruction Memory, the spec said we need to create three
> PROM
files.
> Does
it mean we need to have three different top-level CPU schematics for
> the
three programs?
> Or
we only need one top-level CPU schematic but we will replace the PROM
> in
the schematic when showing different program during the demo?
> Or
we need to place all three PROM in the same schematic and use a 3:1 MUX
> to choose which program we demo?
It doesn't imply doing the first option, and I would not do that, since you will have three data-paths to change if you find a bug. You can mux the output of the three PROM using a program selector signal. You could do your second option also.
3) For
the cache in ISS
> (a)
the spec said when there is a hit, the time to fetch
>
instruction will be one clock. Does it mean the cycle for a
>
non-ld/st instruction will be 1 cycle? or the wait state will be
> 1, so that the cycle is 1 + 1 = 2 cycles?
Keep in mind, that Instructions are data too. Every time you fetch the instruction, it will access the instruction-cache. If it misses, you take wait-state latency to simulate the time a real machine would take to access main memory (1+10 cycles). If it hits, there is no instruction cache penalty (1 cycles) but you still take any additional penalty for executing load/store instruction.
>
(b) I am not sure the meaning of the
burst mode. Does it mean
> when
we have a miss in the cache, the cache need to read data
> from
the memory, and the first instruction during this process
> has
10 wait state, and the following instructions in the same
> cache line has 5 wait state ( while if the first instruction is a
You are basically right.
Blocks maybe 1, 2 or 4 bytes. When you miss in the cache, it predicts that you will access the next instruction also, which is usually true. This is why brining in additional data on cache miss is beneficial. So for block size 2 or 4, it brings in another 1 or 3 instructions. We assume this is a blocking cache, and the processor will not start fetching the next instruction until it is done handling the current cache miss.
Burst mode applies to blocks of greater than 1 byte. The 5 cycle wait state is still faster than 10 cycles that the cache otherwise would have taken if it did not have burst mode.
4)
> Hi, I still do not fully understand the way miss/hit
works. If I have
> an instruction 0xA as the example in the Hints page,
when there's a
> miss, I am suppose to store the data of 0xA into the
cache. If the block
> size is 2 or 4 bytes, I need to get the address of 0xB or/and 0xC 0xD
For 2 B block as you point out you need: 0xA 0xB
For 4 B block actually you need: 0x8 0x9 0xA 0xB
This is because the data from main memory must cache block
aligned i.e. block starts when: addr mod 4 = 0 for 4B
Blocks.
The reason for the alignment restriction is that this allows
us to use simple shift and mask because everything is power of 2.
For performance modeling, you do not have to worry about this too much since you just want to know when a block missed, and how long it takes to fill the block.
5)
> as well as their corresponding data and store into
the cache? In another words, if block size is 2 bytes,
> I get both data of 0xA and 0xB and store into cache?
And at this moment, I am suppose to count the 10 cycles
> for 0xA as well as 5 cycles for 0xB? And when we run
next instruction, if it is 0xB, then it will count
> as 1 cycle and directly get the data from the cache?
Correct!
6)
> Also, in the
case of if my offset is based on the least-significant
> bit, the very last bit, then if I fetch instruction
0xB with no
> instruction 0xA called previously, then the offset
bit will be 1 which
> I will need to store the data in the offset 1. Does
this mean we count
> 0xB as 10 cycles and then get the data of address 0xA
(which is
> supposedly in offset 0) and count it as 5 cycles?
Please clarify,
> thanks,
Subtle difference. For 0xB, you lookup the block at 0xA (assuming 2B block), and miss. You then take 10 cycles for 0xA and 5 for 0xB for the block fill. This is because the cache blocks are addressed starting at the block alignment, and when you have to fill data, you start from that address and continue sequentially. There is a separate structure that takes care of gets the block offset after the block is obtained from the cache..