Multiple Execution Units, VLIW vs. Superscalar, Caches

March 3 2004

from last week...

suppose i have a standard 5-stage pipeline where branches and jumps are resolved in the execute stage. 20% of my instructions are branches, and they're taken 60% of the time. 5% of my instructions are return instructions [return instructions are not considered branches].

now suppose i add a pattern history table that predicts correctly 80% of the time, a btb that hits 70% of the time, and a return address stack that predicts correctly 90% of the time. assume i have a "unified" pht and btb - the pht is accessed only on btb hits.

what's the average cpi of my system?

multiple execution units

when we introduce multiple execution units into our pipeline, we end up with a strange looking pipeline that includes several "sub-pipelines" for execution. the examples in lecture have one execution stage for integer operations, seven execution stages for fp multiplies, and four execution stages for fp additions.

when we have a pipeline with multiple execution units, a number of problems appear that we didn't have to deal with before: structural hazards, and write-after-write dependencies.

we can have structural hazards if two instructions finish execution at the same time. this means that both instructions will try to write to the register file at the same time, and this is not good because we can only write one register per clock cycle.

we have write-after-write dependencies when we have a slow instruction [such as a fp multiply] and a fast instruction [such as a load] which both write to the same register. if we run these two instructions one after the other, we need to make sure the writes happen in the correct order.

draw a pipeline timing diagram that shows how the following code executes on our multiple-execution-unit cpu. how many cycles does it take to execute this code? what can we forward data? when do we have to stall? [add.d is fp addition]. for this problem, pretend that the floating point instructions use the same registers as the integer instructions.

add.d $1, $2, $3
and   $2, $1, $3
lw    $1, 4($5)
or    $7, $5, $6

why don't we have to worry about write-after-read dependencies?

vliw vs. superscalar

one of the great debates in computer architecture is static vs. dynamic. "static" typically means "let's make our compiler take care of this", while "dynamic" typically means "let's build some hardware that takes care of this".

each side has its advantages and disadvantages. the compiler approach has the benefit of time: a compiler can spend all day analyzing the heck out of a piece of code. however, the conclusions that a compiler can reach are limited, because it doesn't know what the values of all the variables will be when the program is actually run.

as you can imagine, if we go for the hardware approach, we get the other end of the stick. there is a limit on the amount of analysis we can do in hardware, because our resources are much more limited. on the other hand, we can analyze the program when it actually runs, so we have complete knowledge of all the program's variables.

vliw approaches typically fall under the "static" category, where the compiler does all the work. superscalar approaches typically fall under the "dynamic" category, where special hardware on the processor does all the work. consider the following code sequence:

sw $7, 4($2)
lw $1, 8($5)

suppose we have a dual pipeline where we can run two memory operations in parallel [but only if they have no dependencies, of course]. are there dependencies between these two instructions? well, it depends on the values of $5 and $2. if $5 is 0, and $2 is 4, then they depend on each other: we must run the store before the load.

in a vliw approach, our compiler decides which instructions are safe to run in parallel. there's no way our compiler can tell for sure if there is a dependence here. so we must stay on the safe side, and dictate that the store must always run before the load. if this were a bigger piece of code, we could analyze the code and try to build a proof that shows there is no dependence. [modern parallelizing compilers actually do this!]

if we decide on a superscalar approach, we have a piece of hardware on our processor that decides whether we can run instructions in parallel. the problem is easier, because this dependence check will happen in a piece of hardware on our processor, as the code is run. so we will know what the values of $2 and $5 are. this means that we will always know if it is safe to run these two instructions in parallel.

hopefully you see some of the tradeoffs involved. dynamic approaches have more program information available to them, but the amount of resources available for analysis are very limited. for example, if we want our superscalar processor to search the code for independent instructions, things start to get really hairy. static approaches have less program information available to them, but they can spend lots of resources on analysis. for example, it's relatively easy for a compiler to search the code for independent instructions.

caches

caching is natural. think about this: where do you keep your textbooks? i can think of a number of places: on my desk, in my backpack, on my bookshelf, or at my mom's house.

there are several interesting trends here. first, suppose i want a textbook. it's pretty obvious that the "access time" increases as we move from desk to mom's house: i can get to a book on my desk pretty fast, but it'll take me a while if i want a book at mom's house. another interesting trend is capacity. my desk can't hold very many books; but my mom's house can hold a whole lot of books. another interesting trend is importance: if i really need a book, it'll be on my desk. if i don't care about a book very much, it gets dumped at mom's house. :)

computers work the same way. there's the hard drive, which holds gigabytes of information, but transfer times are measured in milliseconds. there's ram, which holds megabytes of information, but transfer times are measured in nanoseconds. then there's the processor's cache, which holds kilobytes of information, but transfer times are measured in clock cycles. finally there are registers, which hold 128 bytes [32 registers, 4 bytes each] of information, but transfer times are almost instantaneous.

at the "high levels" [my desk/registers], capacity is small, but access time is fast. at the "low levels" [mom's house/hard drive], capacity is big, but access time is slow. the trick is, of course, to keep frequently used items at high levels.

how do we decide what to keep at the high levels? this is where "locality" comes in. you can draw all kinds of analogies with textbooks, but i'll let you use your imagination. there are two types of locality that concern us: temporal and spatial.

temporal locality: if i access address X, it's pretty likely that i'll access address X again in the future
spatial locality: if i access address X, it's pretty likely that i'll access addresses near X in the future [X+1, X+2, etc, or maybe X-1, X-2, etc]

your typical cache has three parameters: number of sets, associativity, and block size. these parameters are related by the following formula:

cache_size = num_sets * associativity * block_size

your typical cache can be visualized like this:

tag	data	tag	data	tag	data	tag	data
...	...	...	...	...	...	...	...
...	...	...	...	...	...	...	...
...	...	...	...	...	...	...	...
...	...	...	...	...	...	...	...

the number of sets determines the number of rows in the cache: if we have 4 sets, then there are 4 rows. the associativity determines the number of columns: we add one column of tag and one column of data for each increase in associativity. if our associativity is 4, then there are 8 columns: 4 tag columns, and 4 data columns.

the block size determines how much stuff we can fit in each "data" cell. if our block size is 4 bytes, then there are four bytes of information in each cell in each "data" column.

how do we access caches? the lowest log₂(block_size) bits determine our block offset - that is, which byte in the block we're referring to. the next lowest log₂(num_sets) bits determine which row of the cache we need to search. the remaining bits are the tag - we look at all the tags in the selected row of the cache; if we find a matching tag, we've found the data we're looking for.

note that we have to check all the tags in the selected row of the cache. this is what makes high-associativity caches more powerful: there's more than one place to put the data in each row of the cache. if we have two pieces of important data that happen to map to the same row of the cache, we can store both of them. the downside of high-associativity caches is complexity - the more tags we have to check, the more complicated the tag-checking logic becomes. high associativity caches burn a lot of power on tag checks.

suppose i have the following cache: 4 sets, 2-way set associative [associativity=2], with 1 byte blocks. how much data can i store in my cache?

suppose addresses in my machine are 8 bits long, and i'm using the cache just described. how many bits are used for the block offset? the index? how many bits for the tags?

suppose my cache contains the following values [all values in decimal]:

tag	data	tag	data
10	25	32	48
59	62	60	180
18	7	6	5
4	3	2	1

suppose i access the following addresses [in binary]. does the data i'm looking for exist in my cache? if it does, what data am i looking for?

tag	data	tag	data	tag	data	tag	data
...	...	...	...	...	...	...	...
...	...	...	...	...	...	...	...
...	...	...	...	...	...	...	...
...	...	...	...	...	...	...	...

tag	data	tag	data	tag	data	tag	data
...	...	...	...	...	...	...	...
...	...	...	...	...	...	...	...
...	...	...	...	...	...	...	...
...	...	...	...	...	...	...	...

tag	data	tag	data	tag	data	tag	data
...	...	...	...	...	...	...	...
...	...	...	...	...	...	...	...
...	...	...	...	...	...	...	...
...	...	...	...	...	...	...	...