Engineering Stuff to Read - Henry Petroski + To Engineer is Human + The Evolution of Everyday Things + Design Paradigms + The Pencil - Mario Salvadori + Why Buildings Stand Up + Why Buildings Fall Down (I am not kidding!) - Others + Microsoft Secrets (Cusumano) + XP Explained + Code Complete + Programming Pearls OPTIMIZATION (IMPROVEMENT, Chapters 9, 10) - Want to make programmer's elegant, inefficiently coded solution really fast + but can't change algorithm + remove instructions + find cheaper implementations of instructions - Better compiler optimize than programmer (except for changing alg) + Code remains understandable + Faster + Optimizations evolve with code + Most opts work best at lower levels - Options (must try all to get most benefit) + find idioms in programmer's code and remove inefficiencies - e.g., repeated expressions + make good use of registers by keeping frequently used variables there + reorder instructions to avoid memory access stalls + exploit odd features of architecture - The desire for high-performance of translated code adds complexities + optimizing in several ways + need global information about program: many passes over code - e.g., repeated expressions, uses + pointers, function calls make more difficult + compiler will run slower, less likely correct - Use IL, BB's, CFG as basis for language-independent improvements Data Flow Analysis - Motivation: Questions we'd like to answer for optimization + How soon is this value needed again? Should I keep it in a register or write it out to memory so that I can keep or more frequently used value in that register? + Is this computation redundant with any others? Would it be beneficial to keep this result in a register instead of recomputing later? a[i] + a[j] ... a[i] + a[k] ... - Could search the control flow graph to answer each question as it arises + Very expensive, as search may be quite long + However, since want to ask same basic question for many optimizations we can combine our questions and compute all at once, sharing the cost - Basic concept: fact generation and propagation + Fact is created (generated) by a statement (becomes ``live'') + Fact is propagated to subsequent statements until ``killed'' fact: expression available in var --------------------------------- t1 := b * c create fact: {b * c available in t1} t2 := y + t1 {(b * c, t1)} t3 := b * c {(b * c, t1), (y + t1, t2)} t4 := w + t3 {(b * c, t1), (y + t1, t2), (b * c, t3)} t5 := t2 + t4 ... x := t5 Apply common sub-expression elimination... t1 := b * c t2 := y + t1 t3 := t1 t4 := w + t3 t5 := t2 + t4 x := t5 Apply copy propagation, followed by useless exprs... t1 := b * c t2 := y + t1 t4 := w + t1 t5 := t2 + t4 x := t5 + Killed by statement that makes fact potentially false after that point t1 := b * c create fact: {b * c available in t1} c := y + t1 {(b * c, t1)} t3 := b * c {(y + t1, t2)} t4 := w + t3 {(y + t1, t2), (b * c, t3)} t5 := t2 + t4 ... x := t5 + Facts also killed by uncertainties of control flow t1 := b * c create fact: {b * c available in t1} ifnot t0 goto L1 {(b * c, t1)} c := 0 {(b * c, t1)} goto L2 {} L1: d := 0 {(b * c, t1)} L2: t2 := y + t1 {} t3 := b * c {(y + t1, t2)} t4 := w + t3 {(y + t1, t2), (b * c, t3)} t5 := t2 + t4 ... x := t5 + Because of loops, we need to propagate facts several times until no new facts are generated - back edge has no facts attached to it first time - quadratic worst-case, but normally 5 times through does it | /->V--+ | BB | \ | | --+ | V + Data flow analysis can be made very fast - Identify questions statically (not made up on fly) - Come up with efficient representation (bit sets are common) + true-false assertion about a statement, statement is rep'd as num that can be used as an index into a bit-array + small, fast set operations + how would we do this conversion for available exprs? - Simple delete-insert sequence for each statement in BB Out[S] = (In[S] - Kill[S]) U Gen[S] - intersect (union) sets coming into BB from outside + ``positive'' facts are intersected: only true if true on both + ``negative'' facts are unioned: possibly true if true on one In[B] = /\ Out[pred(B)] + Can make even faster using BB/CFG distinction - Precompute aggregate effect of entire block, then treat as a statement + Gen[B] = Out[S_last], starting with In[S_first] null + Kill[B] = invert Out[S_last], starting with In[S_first] unity - instead of visiting each statement, you visit each BB: 5 times savings Out[B] = (In[B] - Kill[B]) U Gen[B] + Complications: - pointers (don't know what location is being accessed or updated) - function calls (CFG best for within function) - function pointers (don't even know what function is being called) - normal solution is to pretend worst happened, killing all opt. - super-aggressive approaches + compute aliases created by pointers + inline functions + propagate data flow sets through functions Architecture-independent optimizations - These are optimimations that would be performed on IL code before code generation (as inadvertently shown above :) - Common subexpression elimination (CSE): removing redundant computations + Caused by array indices, redundant code - Useless exprs: removing computations whose results are not used + Often created by other optimizations, like CSE - Dead code elimination: code that is not executed (if false then ...) - Copy propagation: when x := t1, replace x refs with t1 refs - Algebraic simplification: x + 1 - 1 == x + rules must hold for finite-precision; beware round-off, etc. - Reduction in strength: x * 2 == x << 1 - Induction variable elimination w/strength reduction: VAR a : array N of INTEGER; ... FOR i := 1 TO N DO a[i * 6] := a[i] + 5; END; becomes: t1 := 6 FOR i := 1 TO N DO a[t1] := a[i] + 5; t1 := t1 + 6; END; can do even more at assembly level (i is really i * 4), since can include all byte arithmetic/addressing (int is 4 bytes): t1 := 24 FOR fouri := 4 TO 4*N by 4 DO ... t1 := t1 + 24; END; - Invariant expressions and code motion FOR i := 1 TO N DO a[i * 6] := a[i] + y * 5; END; becomes t1 := y * 5; FOR i := 1 TO N DO a[i * 6] := a[i] + t1; END; - Assertion propagation: similar to common subexpressions FOR i := 1 TO N DO a[i] := a[i] + y*5 END; Each of these array refs has a bounds check on it. If one succeeds, the other must, too. By propagating the fact that the first has succeeded (you don't reach the second one if the first one fails), you can avoid the second test. This is not completely unlike CSE (i.e., the check is redundant). FOR i := 1 TO N DO if !boundscheck(a,i) then exit("bounds error a[i]"); t1 := a[i] t2 := y*5 t3 := t1 + t2 if !boundscheck(a,i) then exit("bounds error a[i]"); // redundant, a[i] := t3 // i unchanged END; - These optimizations can be done: + Locally (within BB) + Globally (CFG for a function) + Functions can be inlined to get better results, but bloat, hurt cache + Peep-hole optimization: sliding window over IL or assembler + On-the-fly (doesn't really work) - 80/20 rules of optimization + Over 50% of performance improvement can be achieved in local opts; these are also easier to implement (50/10 rule? ;) + 80% of instructions executed are in 20% of the code (due to loops); focusing optimizations in inner loops means optimization is faster yet 80% effective. - Strategies + An optimization should be a small, atomic unit, so can be combined in any order, easily debugged, reused. + Optimizations should be used frequently, or they are likely to be buggy. Avoid really unlikely cases (80/20 again). - Complications (as noted before) + Function calls + Pointers + Debugging (optimization obscures relation between assembly and source) Instruction Scheduling - RISC machines separate memory operations from rest of instructions - Allows most non-memory instructions to execute in 5 cycles cycles - Memory operations take 5 cycles; multiplies and divides, longer - However, if adjacent operations are unrelated, they can be overlapped + fetch, decode, execute, store (e.g.) + one-cycle cost + also cannot do two fetches, etc., at a time (hazards) - To get best performance, then, need to reorder instructions for performance + In isolation, get stalls/bubbles ld x, r1 f d e e s ld y, r2 f d e e s add r1, r2, r3 f d e s st r3, z f d e e s + Any ideas? Consider if these instructions are in front ld a, r4 ld b, r5 sub r4, r5, r6 st r6, c + Then can interleave instructions; no stalls/bubbles! ld a, r4 f d e e s ld b, r5 f d e e s ld x, r1 f d e e s ld y, r2 f d e e s sub r4, r5, r6 f d e s # structural hazard on s? st r6, c f d e e s add r1, r2, r3 f d e s # structural hazard on s? st r3, z f d e e s + Note that this all assumes cache hits + Can expect to see 50% performance improvement over naive - Any ideas how to implement instruction scheduling in a compiler? + Simple idea is just ``move loads back'' without creating errors - A more sophisticated approach + Have instructions in memory + Compute data dependence graph, hazard dependences + Reorder instructions to satisfy partial order of graph, but move dependent instructions apart + Heuristic algorithm - examine all members of graph with no (unscheduled) predecessors - schedule one that + can cause stalls with successors + has most successors (will allow scheduling most succs. in future) + longest path to end (on critical path to end) - delete from graph and add its 0-parent successors to ready list Global Register Allocation - As noted above, memory operations are expensive, especially if have option of not doing memory op at all - Would prefer to never have to load or store data - Registers in short supply, though, so must share with all variables in program. Who gets a register and who doesn't? Ideas? + Could just take optimistic/greedy approach - let variable loaded into register stay in register until register is needed for something else x := y + y --> ld [fp-4], r1 --> ld [fp-4], r1 ld [fp-4], r2 add r1, r1, r3 add r1, r2, r3 st r3, [fp-8] st r3, [fp-8] - Can achieve this by having a VarSTO ``remember'' what register it is in r = smartGetReg(opndSTO); // getReg records that opnd has r and // vice versa - When getReg(STO) can't find a free register, it forces some STO to give up it's register + which one? + one that is least recently used (LRU) + this is a crude estimate of what variable is least likely to be used in the future // builds on simple getreg and freereg Register smartGetReg(VarSTO var) { // getReg returns 0 if no registers to allocate if (!(reg = varFile.varsReg(var))) { if (!(reg = getReg()) { ovar = varfile.oldestVar(); // rem LRU var/reg from file oreg = varFile.varsReg(ovar); varFile.remVar(ovar); freeReg(oreg); reg = getReg(); // guaranteed to be the same } varfile.put(var,reg); emitRegLoad(var, reg); // DOES PREEMPTIVE LOAD!! } varfile.markUsed(var); // updates LRU ordering return reg; } + we now also need to take care of freeing up registers when a scope is exited and all the local variables go away + can get even fancier and do ``lazy stores'' (at the point of doing the free) - Another, more aggressive approach is global register allocation + Look at (likely) access frequency: frequent accesses -> greater savings - density of accesses (static) - loops (dynamic) + Allocating registers is like scheduling room (except that we schedule all classes; no one has to sit on the lawn ;) - Can't have two classes in room at same time - Many more classes than rooms (but can modify class times) - Even if classes overlap a little, will never reside in register (well) + Suppose we have the following program, and 3 GP registers - we do this on 3-address code as part of the process of mapping it to machine code - might end up with hybrid code: half IL, half machine x y w z [1] z := 1 [2] x := 2 * z | [3] y := 3 * z | | | [4] w := x + y | | | [5] z := y + z | | | [6] x := y * w | | | try: r1 r2 r3 ? or: r1 r2 r1 r3 - How do we avoid bad choices or trying lots of alternatives + Current algorithm of choice is graph coloring (heurstical) - Compute lifetimes of each variable (as shown) - Make each variable a node in a graph - Connect two nodes with edge if their lifetimes overlap + edge means ``cannot be in same register'' x w |\/| |/\| y--z - Algorithm: 1. Select a node 2. Look at all its neighbors 3. Use any register they are not using; record register in node 4. Remove node from candidate set 5. Repeat - Can still get in situation above; need to prioritize choices. Ideas? + Nodes with #edges < #regs are trivial to color, so ignore - note that x and w have only 2 edges, so below threshold, so schedule y and z are candidates + Schedule vars that have highest reference density first - y has slightly higher density so schedule that first - With register assignment given above, this is what the final code would look like: [1] r3 := 1 [2] r1 := 2 * r3 [3] r2 := 3 * r3 [4] r1 := r1 + r2 ; x and w ``share'' r2 [5] r3 := r2 + r3 [6] r1 := r2 * r1 + Everything is done in registers here because we started with computing everything from the constant value 1. If instead we started with an ``unknown'' variable (e.g., a parameter passed in by a call), then we'd start by reading from memory. - Note that if the 3-addr code is too different from machine code, this doesn't work, may need more or fewer registers. If that's the case, then must do this process on assembler