Performance, Instruction Sets, Arithmetic

January 14 2004

amdahl's law

don't let amdahl's law scare you - you already know it. consider this math problem:

we're baking a cake. it takes one person a total of 80 minutes: there are 50 minutes of baking time; the remaining minutes are preparation time. if you have a team of assistants to help you, you can finish preparations in one sixth of the time. how long will it take you to bake the cake with a team of assistants?

this problem should be easy [if not, we need to talk]. but, if we take this problem and switch some phrases around, you end up with the same problem that appeared on quiz 1. you know how to do this! don't let the big words scare you. :)

let's try a more difficult problem:

suppose 50% of all instructions executed are memory instructions, and you have some crazy ideas that will result in speedups for memory instructions. but your advisor tells you that working on these ideas won't be worthwhile unless you can get at least 1.5 overall speedup. how much speedup must you get from memory instructions to make your ideas worthwhile?

don't forget that speedup is old_execution_time / new_execution_time.

here's another problem:

you've purchased a math coprocessor for your computer ["i remember when..."]. the box says that installing it will improve the performance of floating point instructions by 10x, and that it will improve overall performance by 5x for "most programs". what percentage of "most programs" must be floating point instructions if we want 5x overall speedup?

execution time

it's a good idea to review your si prefixes [giga=10^9, mega=10^6, micro=10^-6, nano=10^-9, etc].

it's also a good idea to remember what the following units are:

name measured in...
execution time seconds
CPI cycles / instruction
IPC instructions / cycle
clock rate = frequency cycles / second = hertz
cycle time seconds / cycle

if you remember these things, solving problems involving execution time becomes very much like those annoying conversion problems you had to do back in high school. for example:

how many seconds in an hour? well, there are 60 minutes in an hour, and 60 seconds in a minute. writing this out, we see that:

 1 hour   60 minutes   60 seconds
        * ---------- * ---------- = 360 seconds
            1 hour      1 minute

notice how the units cancel out nicely: the word 'hour' appears once in the numerator and once in the denominator; the same is true of the word 'minute'. if we cancel out these units, we are left with just the word 'seconds' in the numerator. this means we're doing it right. :)

back to architecture. how long will it take to execute one billion instructions on a 100 MHz processor with an average of 2 CPI?

 1e9 instructions     2 cycles       1 second
                  * ------------- * ---------- = 20 seconds
                    1 instruction   1e8 cycles

note how the units cancel out nicely.

try this one:

how many instructions per second can we execute with our 2 CPI, 100 MHz processor?

if we increase clock rate to 200 MHz, but we also increase CPI to 3, what is the overall speedup?


stack: instructions pop operands off the stack, operate on them, and push the results back on the stack. the jvm [java virtual machine] is a stack machine. it's very easy to generate code for a stack machine - but it's tricky to design the hardware.

accumulator: a machine with one register. operations read operate on the value in the accumulator register, and write their results to the accumulator register. this results in a very simple machine - but it will take lots of instructions to get anything done.

register-memory: operations read operands from registers or memory, and write the results to registers or memory. with a machine like this, a lot can be done in a few instructions. but implementing these complex instructions in hardware can be difficult [which tends to increase CPI].

load/store: operations read and write only to registers. explicit load and store instructions are needed to read and write to memory. it takes more instructions to get things done, but implementing simple instructions in hardware is easier [which tends to decrease CPI].



we increase pc by 4 when we move to the next instruction. the target of a branch is pc+offset*4. what's the significance of these 4's?

jumps are direct, but the jump target is only 26 bits. addresses are 32-bits - where do the remaining 6 bits come from?

what's wrong with the following instruction, and how do we fix it?

addi $1, $0, 1048576

what is sign extension, and when do we need it?

ieee floating point representation

floating point numbers are represented as +/- 1.m * 2^(e-127), where m is the mantissa, and e is the exponent. the first bit is the sign bit. 8 bits are used for the exponent, and 23 bits for the mantissa.

floating point numbers are always normalized to 1.something * ..., so the leading one is assumed and is not represented.

to decode a floating point number:

  1. write out all 32 bits in binary. given 0x40c80000, we get 0100 0000 1100 1000 0000 ...
  2. locate the sign bit, the exponent, and the mantissa. our sign bit is 0, our exponent is 10000001, and our mantissa is 10010000000...
  3. write out the mantissa in binary. don't forget about the leading leading 1 that is not represented. our mantissa is 1001, so we write down 1.1001000000
  4. convert the exponent to decimal. the exponent is excess-127, so you need to subtract 127 from it. our exponent is 129, subtracting 127 we get 2.
  5. the number computed in the last step tells you how far you need to shift the binary point in the mantissa, and in which direction. positive means right, negative means left. we shift the binary point in the mantissa right two places, and we get 110.01
  6. after you've shifted the binary point, convert the mantissa to decimal. 110.01 binary is 6.25 decimal
  7. write down the negative sign if the sign bit was 1. our sign bit was zero, so we're done

to encode a floating point number:

  1. write out the number in binary. 6.25 becomes 110.01
  2. count how many times you need to shift the binary point to get 1.something. if you need to shift right, negate your count. so given 110.01, we need to shift two places to the left to get 1.1001
  3. take the shifted binary number, and drop the leading 1. this is your mantissa. we had 1.1001, so we get 1001.
  4. add 127 to your count. we shifted 2, so we get 129
  5. convert this number to binary. 129 is 10000001
  6. write down sign bit, exponent, and mantissa. encode in hex if desired. we have 0 10000001 1001000000... which becomes 0x40c80000


decode the following numbers from ieee fp representation: 0x40550000 0xc328a000

encode the following decimal numbers in ieee fp representation: 9.5 -0.1875