These notes are structured according to the topics listed in the comps syllabus. Approximate degree of completion: Comparative prog lang: 0 Language constructs: 95% Lexical and Syntactic analysis: 90% Semantics and functional prog: 0 Code Generation & runtime I: 95% Code Generation & runtime II: 90% My analysis of past questions indicates that they are drawn from three overlapping areas, in order of decreasing frequency: 1) compiler theory and practice 2) survey of real programming languages 3) PL theory I have (little or) nothing to say about the second category; if you have programmed in FORTRAN, PASCAL, C, C++, Ada, and maybe a few others, great. If (like me) you've never used some of these, it's probably too late to do much about it. Fortunately, questions having to do with these languages usually overlap into categories 1 and 3, so a solid grounding in theory ought to enable satisfactory answers "in principle" to any questions involving details of fact. My sources for PL theory are Tennent, Reade and Elkan (notes from last winter). I recommend re-reading Elkan's notes, which are saved in this directory as "elkan.lectures". For compilers I used the Dragon book and several papers covered in Griswold's compiler course last spring. COMPARATIVE PROGRAMMING LANGUAGES ================================== good luck. maybe we can come up with a table for this. LANGUAGE CONSTRUCTS ========================================== This section might be called the theory of practical programming languages: it encompasses the conceptual underpinnings of how everyday PLs are structured. The following notes are mostly from Tennent (T), with some of Elkan's lecture notes mixed in. All syllabus topics are mentioned somewhere. T Chapter 2: Syntax---------------------------- Some Definitions: Literal = an elementary expression whose value is defined by the language and cannot be changed by the programmer. Identifier = a symbolic name that can be programmer defined Command = an expression that requests an "irreversible" change in the current computational state Binding = a localized association of an identifier to a value (or, more accurately, a location) Scope = the region of program text in which a binding is effective Environment = a record of identifier bindings Store = a record of the effects of assignments (Note that a command yields a new store, while a definition yields a new environment.) Procedure Invocation = syntactic construct consisting usually of procedure name and some actual parameters Domain = a set of values which can be arguments to or results of a procedure (a "data type") Actual Parameter = the expression syntactically occuring in a parameter position in a procedure invocation; evaluates to an argument (value) Formal Parameter = the identifier occuring in a parameter position in a procedure abstraction; references in the body refer to the corresponding argument (somehow...) Procedure Abstraction = an expression whose value is a procedure that can be applied; probably requires delayed evaluation of some parts (such as formal parameters); informally, a "procedure definition" Sequencer = essentially, a goto Subtleties: - In PASCAL, a declaration yeilds both a new environment and store since it requires storage to be allocated for the variable declared. - Prog. lang. "procedures" are distinguished from mathematical "functions" because the latter are merely mappings from a domain to a codomain, while a procedure is dependent upon and may alter (side effect) computational state. - Use of sequencers (gotos) makes it very difficult to give a formal semantics for a programming language, since it's hard to say what the effect on environment or store is when expressions that may contain gotos are combined in a control construct. Abstract Syntax: From a semantics point of view, the abstract syntax of a language is simply a listing of all of the possible forms for each of the syntactic classes. It does not specify which strings are well-formed program texts, nor their phrase structures. It is pretty much just a list of the elementary classes and of the structure of non-elementary classes. According to Elkan, the abstract syntax "ignores lexical details"; it is typically tree structured. Abstract syntax is what the parser "discovers" as it processes the input text, discarding details (like brackets, keywords and semicolons) to create an intermediate (tree-like) representation. Concrete Syntax: Does indicate which strings are well formed, and what their phrase structures are. A concrete syntax can be given as a grammar (in BNF or similar Chomsky-like form), or as a syntax diagram. A syntactic description is _ambiguous_ if, for any text, it specifies more than one phrase structure. Unfortunately, there is no general method for testing whether a syntax is ambiguous, although it can be proven in particular cases. Note that BNF and similar syntax diagrams are Context Free; it is difficult to specify the context sensitive aspects of programming languages, and that is where many of the differences between PASCAL implementations arise. T Chapter 3: Data-------------------------------- In PL theory, data elements are considered to be elements of sets called domains. Domain theory is part of mathematical foundation theory, and therefore difficult and only marginally relevant: i.e., it is very difficult to mathematically construct a model of the domains in common use, and few people care. Generally, some primitive domains corresponding to primitive datatypes like integer, char and boolean are assumed, then more complex domains can be constructed through product, sum and function definition: Product of Domains: the cartesian cross-product, i.e. an ordered tuple of elements from the corresponding domains. Example: type complex = Complex of int * int Projection Function: accesses an element of a domain product (e.g., a record element) Sum of Domains: a union type, tupled with a tag that identifies which subtype is present Example: type canine = Dog | Wolf Injection Function: constructor for a sum of domains type Function domains are indicated by "A -> B" where A and B are the input and output domain, respectively. Note that domains can be recursively defined, e.g. datatype intlist = nil | Cons of int * intlist In practice a recursive domain can be problematic, since it implies the datatype may require an infinite amount of storage space. For this reason recursive structure definitions are not allowed in PASCAL or C, and pointers are used instead. LISP and other languages automatically handle pointers to allow recursive definitions. T Chapter 4: Storage--------------------------------------- Remember that evaluation of a program takes place with respect to an environment and a store. The storage model of conventional PLs is a collection of locations into which primitive values can be placed. The environment maps identifiers to locations, and the store maps locations to values. Whenever a binding is created for a program scope, the environment is augmented and a new location is established. Aliasing = when two or more identifiers are bound to the same location L-Value = a location; the value of an identifier when considered as the target of an assignment R-Value = the contents of a location; the "ordinary" value of an identifier Selective updating: when a single element of a compound structure (e.g. an array) can be updated, the structure must be modelled as a structure of locations (i.e. one location per primitive value). Increment op: this operation treats its argument as both an l-value and an r-value. Multiple targets: if L1 := L2 := ... := E is permitted, all arguments but the last are treated as l-values, and the semantics can vary depending on evaluation order and parameter passing semantics. Pointers are a mechanism for making locations "first class" in the language. They allow locations to be parts of complex values, passed as arguments, etc. Storage management issues raised by the presence of pointers make (correct) programming more difficult: allocation & freeing, sharing, update-in-place. Storage insecurities: the language (and/or program) should provide for new locations to be initialized before they can be read; the ability to reference a location that has been deallocated is called a dangling reference, and can cause strange errors. Factoid: FORTRAN (always? often?) statically allocates locations for local variables so that their values persist between procedure invocations; another reason why recursion is not permitted. T Chapter 5: Control----------------------------- Sequential Composition = two or more commands are executed in sequence, passing along the modified environment and store. The semicolon is the composition operator in PASCAL: C1; C2. Selective Composition = basically, the if-then-else construct, of which the case statement is a variant. Iterative Composition = any kind of loop. A definite iteration is one in which the number of iterations is determined beforehand (e.g. a for loop), in contrast with an indefinite iteration. In general, there are many possibilities for looping, and no single control structure captures them all. Algol68 provides this loop control structure: for I from E1 by E2 to E3 while E4 do C od where any part can be omitted, except do C od. In general, control structures are used to compose commands, but there is no reason why the same structures can't be used for expressions, especially where expressions can affect state. Non-determinate selection is also a possibility for the semantics of a selective composition structure (i.e. pick any one of the commands whose guard is satisfied). T Chapter 6: Binding---------------------------------- Scope = textual region in which a binding is effective Extent = temporal period during which a binding is effective Binding Occurrance = use of an identifier that establishes a binding Applied Occurrance = any other use (a reference) Syntactic Binding = one established in a syntactically detectable way Implicit Binding = In PASCAL the command with L do C establishes bindings for the field names of the type of L within the scope of C. Default Binding = In FORTRAN, some identifiers are bound by default to numeric type values (i.e. they can be used without declaration) Overloading = binding of multiple values (locations) to a name, that are differentiated by type and selected by context of use; can be overloaded operators, or use of function names for return value storage in PASCAL. Free identifier = an identifier is free in a construct if it is not bound in/by the construct. External bindings: an occurance can be both binding and applied if it serves to extend the scope of an externally established binding. T Chapter 7: Procedural Abstraction----------------------------- Procedural abstraction is provided in PASCAL by the procedure and function definitions, in LISP by lambda and defun, etc. The necessary parts are (as in the lambda calculus) - a formal parameter part (possibly empty) - a body: a construct whose interpretation is deferred until the resulting procedure is invoked There are two models for binding free identifiers in a procedural abstract: - static binding: the binding is determined bye the static scope, i.e. with reference to the nearest textual binding scope within which it lies; it does not change during program execution - dynamic binding: the binding is determined by the invocation environment, i.e. the most recently established binding of the identifier; can change during program execution Note that in a language with dynamic scoping, the execution of a procedure can alter not only the store but also the environment. The Principle of Abstraction says that the body of a procedure abstraction can be any semantically meaningful syntactic class, including an l-value. (Though it may not always be useful, for instance to abstract a numeric literal.) From Elkan: FUNCTION VALUES ARE CLOSURES Definition: a free variable is a variable that is not a formal parameter. Definition: a closure is a pair consisting of an expression and an environment providing values for all free variables of the expression. Note that it makes sense to talk about the closure of an expression that is not an abstraction, but most closures are of abstractions. STATIC VERSUS DYNAMIC SCOPING In a nutshell, at "definition time", when an abstraction is evaluated to give a "function value": - under static scoping the function value is a closure which saves the current environment - under dynamic scoping, only the abstraction expression is saved. At "use time", when the function value is applied: - under static scoping the saved environment is used - under dynamic scoping the current environment is used DISADVANTAGES OF DYNAMIC SCOPING Let's say "compile time" and "run time" as conventional words for what is better called "definition time" and "use time". Dynamic scoping requires keeping knowledge of names (i.e. actual identifiers) at runtime. They refer to values in the most recently entered runtime scope. Alpha-conversion is the rule that the meaning of an expression is unchanged if the name of a bound variable is changed everywhere. Dynamic scoping violates alpha-conversion: if the parameter name x in the abstraction for times was changed to z, both static and dynamic scoping would give 20. T Chapter 8: Parameters---------------------------------- The basic options for parameter passing are: Call by Value = parameters are evaluated to an r-value at invocation time, which is then stored in a new location bound to the formal parameter Call by Reference = the mechanism of PASCAL var parameters whereby the actual parameter must evaluate to an l-value (location) that is then bound to the formal parameter. In essence, the formal parameter becomes an alias for a pre-existing location. Call by Value-Result = the mechanism of Ada (supposedly) whereby a new location is allocated for the binding of formal parameters, but upon termination of the procedure the contents of those locations are copied into the locations of the actual parameters Call by Name (substitution) = the bizarre mechanism of ALGOL 60 that survives only as an academic curiosity that you need to know. From Elkan's notes: Algol 60 has a special parameter-passing mode called "call by name". Consider the pseudo-Pascal summation function function sum (lo,hi:int; name i:int; name f:real); var s: real := 0; begin for i := lo to hi do s := s + f return s end; Here the actual arguments for i and f should be evaluated again each time the formal parameter is used, so we can write for example sum(1,n,j,a[j]*b[j]) to compute the dot-product of two vectors a and b. Call-by-name is always implemented using closures. The summation function is compiled as function sum (lo,hi:int; name i:void->loc int; name f:void->real); var s: real := 0; begin for i() := lo to hi do s := s + f() return s end; Here "loc int" means "location that can hold an integer", i.e. pointer to an integer. "void" is the special type that has only one member value, designated (). The use of closures to implement function parameters ensures that actual arguments are always evaluated in the context of the calling environment. The call is translated to sum(1, n, < fn () => j, { j -> locj } >, < fn () => a[j]*b[j], { a |-> loca, b |-> locb, j |-> locj } > ) Closures into which actual name arguments are converted are called thunks. One minor complication: some thunks must return an l-value or an r-value. Aside: this illustrates the Principle of Abstraction. Factoid: some PL/I and Fortran processors apparantly use call by reference when the actual parameter is an l-expression of the same type as the formal parameter, and call by value otherwise. Tennent is suspicious of this haphazard practice. T Chapter 9: Definitions and Blocks--------------------------- All Tennent really has to say here is that blocks and block definition scopes are essentially equivalent to procedure abstractions (which is not surprising, when one considers functional languages). He then notes some possibilities for a set of definitions associated with a block structure: Mutually Recursive Definitions = a sequence of definitions which can use each other in their bodies Sequential Definitions = only prior definitions can appear on the right-hand side (in the body) Simultaneous Definitions = all are evaluated in the same environment (prior to establishment of any) Tennent then discusses a mechanism for making private definitions (which actually shows up as one of the comps questions), and suggests that the Principle of Abstraction anticipates OO objects as procedures whose bodies are definitions. T Chapter 10: Jumps------------------------------------ "The continuation for some computation is whatever comes after it, expressed as a function of the results expected for that computation." Consider C1; C2 The continuation for the execution of C1 starts with an execution of C2 relative to the store resulting from the execution of C1. The execution of C2 inherits the continuation of the whole construct as its continuation. The notion of a continuation provides a means for reasoning about the effect of jumps in control flow; otherwise it's very difficult to say anything about constructs which essentially behave like a goto. I once was involved with a lisp system that allowed continuations to be captured in a closure, and returned to. T Chapter 11: Concurrent Processes----------------------- Now you can see why the last chapter dealt with continuations. T Chapter 12: Types--------------------------------------- Domain Incompatibility = applying an operation to a value not in its domain. Domain incompatibilities can be handled by (some mixture of) the following three methods. - Domain Testing: an operation might only be valid on a sub-domain of a primitive datatype. In that case domain testing ought to be done for safety (for example, to catch uninitialized values). - Coercion: when an argument is domain incompatible, either an error can be signaled, or a coercion can be performed. Signaling is troublesome to programmers, but coercion is dangerous in that it might give rise to subtle errors. - Type Checking: can be done dynamically, or better yet, statically, to improve efficiency of compiled code and maybe clarity of program. But, static typing does imply a more complex and restrictive program syntax. Type Equivalence: deciding when two types are the same. - structural equivalence: if two types have the same number, type and construction of components, they are the same type. - occurrence equivalence: two types are the same only if they are expressed by the same occurrence of a non-trivial type expression (i.e. only if they are defined by the same instance of a definition) Polymorphism: generally an operator is only defined for one domain, but when it is desireable to define similar functions over several domains the ability to define a polymorphic funcition is of benefit. - parametric polymorphism: the list functions in ML can be applied to lists of anything, their domain type can be systematically made specific. - ad hoc polymorphism: having a few fixed alternative types, also called "overloading" New Types: When a language supports type definitions, we can distinguish two kinds, as follows. - Concrete datatype: a new representation. In PASCAL a record, in C a struct, in ML a datatype. The representation usually comes with constructor and access functions and sometimes a predicate to identify instances. A concrete datatype abstracts only some representation details (e.g. pointers and integers, etc.) - Abstract datatype: a collection of interface functions that provides a datatype while shielding its concrete representation. "Concrete types emphasise the form (structure) of values as composite objects built with constructors. In constrast, abstract types emphasise the logical behavior of values." (Reade p226) An abstract datatype is typically a module that provides function definitions while abstracting the concrete representation. From Elkan: An abstract data type introduces a new type indirectly, by giving a set of operations that can produce and operate on a new set of values. The new set of values is characterized by specifying the relationships between the operations introduced. THE DEFINITION OF AN ADT An ADT has two parts: a specification part and an implementation part. A specification has two parts: a signature and a semantic part. An implementation has two parts: a concrete datatype and a signature implementation. Informally: - the signature specifies constructor/destructor function types - the semantics part specifies relationships between these - the concrete datatype specifies the actual representation of values - the implementation satisfies the signature and semantics part. AN EXAMPLE ADT FOR ORDERED LISTS OF INTEGERS First here's a signature: signature ordintlistsig = sig type ordintlist val onil: ordintlist val ohd: ordintlist -> int val otl: ordintlist -> ordintlist val null: ordintlist -> bool val insert: int -> ordintlist -> ordintlist val sort: int list -> ordintlist end; Next here is a sketch of a semantic part null nil = true; forall i:int x:ordintlist null (insert i x) = false forall i,j:int x:ordintlist insert(i,insert(j,x)) = insert(j,insert(i,x)) ... The semantic part is conceptually important but real-world programming languages do not require it. There are two reasons for this - verification is not practically feasible - the semantics of specification languages is an open question (Joe Goguen has done a lot of research on this) Finally, here's an appropriate (incomplete) implementation: structure ordintliststruct : ordintlistsig = struct datatype ordintlist = Ordered of int list fun doinsert (n: int) [] = [n] | doinsert n (a::x) = if n <= a then n::(a::x) else a::(doinsert n x) val onil = (Ordered []) fun ohd (Ordered (a::x)) = ... fun otl (Ordered (a::x)) = ... fun null (Ordered []) = true | null (Ordered (a::x)) = false fun insert n (Ordered x) = Ordered (doinsert n x) fun sort x = (Ordered x) end Notice that the "sort" function is implemented as a "no-op". This implies that the "ohd" and "otl" functions must have non-trivial implementations. Note: having never used an OO language myself, these notes are weak on OO concepts. INHERITANCE (also from elkan) An ADT is a collection of values defined implicitly by a set of primitive operations whose implementation is hidden from the user. So is a class. The main difference is that objects are almost always mutable values. From a programming language (as opposed to programming methodology) point of view, the major innovation in OOP is inheritance. A child class that inherits from a parent class is also called a derived class. In order for inheritance to work properly, identifiers must resolve to methods and variables of the child, where they exist, in other words they must be dynamically scoped. OOP KEYWORDS (1) "Object identity". Even if two objects have the same internal value, they are still different objects, just like two variables are different even if temporarily they have the same value. (2) "Encapsulation". The state of an object is private to the object: the only way the state can be changed or observed is by calling one of the object's methods. (3) "Overloading". Two objects of different classes can have a method with the same name. If these methods perform conceptually similar operations, the caller of the method does not need to know the exact class of the called object. This is a programmer-defined ad hoc polymorphism. LEXICAL AND SYNTACTIC ANALYSIS ==================================== In the real world, these tasks are usually handled by lex and yacc and similar tools. So probably what one ought to understand is how these two tools work (in theory, not how to write their input specifications, which I don't know). Lexical Analysis::: The lexical analyzer is the first phase of a compiler. It gobbles up a continuous string of characters (the raw source file) and spits tokens out the other end. Tokens are the linguistic elements in which the source language's concrete syntax is defined (identifiers, keywords, special symbols, etc.) As luck would have it, the "lexical syntax" of tokens is almost always describable by regular expressions, so tokens can be identified by a pattern matching Deterministic Finite Automaton, which is exactly how a typical lexical analyzer is structured. A lexical analyzer is sometimes called a scanner or a lexer. Theoretically, this is probably the most totally-solved part of a compiler. Real world details: the lexical analyzer usually doesn't run amok over the input file churning out a spray of tokens, but responds to explicit calls from the parser to supply one token at a time. The token is usually entered in the symbol table by the lexer, then every compiler phase from the parser onwards deals exclusively with symbols (pointers to symbol table entries). The lexer is usually responsible for discarding any extraneous lexical clutter from the input file such as comments and extra whitespace, and eliminating case distinctions in case-insensitive languages (e.g. FORTRAN). Buffering: because the lexer does real character i/o to read the file, it can be a bit slow and buffering tricks to reduce i/o are useful. For example, a buffer twice the size of a disk block can be used, with a head and tail pointer marking the region of interest. As that region moves forward, half the buffer can be refreshed from the disk in one shot. Specification of tokens: tokens can be described by regular expressions. Remember that regular expressions are closed under union and concatenation, so a token specified by a regular expression can then be used in the specification of other tokens. Lex appears to allow the programmer to make use of this trick. Hence the specification input to lex is essentially a set of regular expressions describing each token type, although it includes regular definitions to make the task easier, and auxillary routines to be called when a token has been discovered. A token type might have only one instance, like "begin", or many possible instances, like a literal floating point number. In the latter case, the auxillary routine will make sure that the actual number gets entered in the symbol table, although a generic token type like "float-literal" will be delivered to the parser. Finite automata: an NFA that recognizes the specified set of regular expressions can be described by a transition diagram that is easy to mechanically generate from the specification (well, a table equivalent to the diagram). Each arrow in the transition diagram corresponds to reading one character from the input, and following the edge labeled by that character. Using the standard algorithm (substitution of new states for sets of reachable states), the NFA can be converted to a DFA that recognizes the same language. Or a two-stack machine that simulates the NFA can be used. The tradeoffs are as follows: building the NFA can be done in time and space linear in the regular expression specification, O(|r|), then scanning the input can be done in time O(|r| * |x|), where |x| is the length of the input. Faster scanning time (linear: O(|x|)) can be achieved by converting the NFA to a DFA, but its worst case size (and hence construction time) can be exponential. For a real compiler, of course we want the DFA. But for an interactively programable lexer for some kind of user interface, the interpreted NFA might be better. Optimizations: algorithms exist to go directly to a DFA without building the NFA, to minimize the number of states of the DFA, and to represent the DFA in space smaller than the full 2D table (when it's sparse). I can't imagine we need to know these algorithms. Syntax Analysis::: The lexer is called by the parser which accepts a sequence of tokens and builds an abstract syntax tree. (Conceptually; it may just generate a parse tree which is reduced to an AST by another phase, or it may directly generate some kind of linear three-address code. In fact, for a simple language with no optimization, the parser can be designed to output executable code. But in any case, it essentially discovers the abstract syntactic structure that guides the "syntax directed" generation of the compiler's intermediate program representation.) As luck would have it, the concrete syntax of most programming languages can be described by a context free language, so a parser is typically just a CFG recognizing, stack-state automaton. As the automaton recognizes each syntactic construct, a "semantic action" defined by the compiler writer is executed to build part of the abstract syntax tree, generate IL code, or whatever. Review of CFG theory: A CFG consists of a set of productions, which are all of the form A -> beta where A is a member of the set of non-terminals and beta is a string of 0 or more terminals and/or non-terminals. (For 0, we use the distinguished terminal epsilon...) Clearly the relevant sets of terminals and non-terminals must be defined, and also there is a special non-terminal 'S' used as the start symbol. A Derivation of a string of terminals is a demonstration of a sequence of productions in the grammar, beginning with S, that expands to that string. A derivation involves choosing, at each step, which non-terminal to replace, and which production to replace it with (there can be more than one for any non-terminal). One derivation algorithm is to always replace the leftmost non-terminal, another is to always replace the rightmost. A Sentential Form is a sentance (possibly containing nonterminals) that can be derived from the start symbol. A Parse Tree is a graphical representation of a derivation that omits information about replacement order. A CFG is Ambiguous if more than one (distinct) parse tree exists for some string in the language. Ambiguity can result when operator binding precedence rules are not incorporated. Example ambiguous grammar: stmt -> if expr then stmt | if expr then stmt else stmt | other disambiguated version: stmt -> matched | unmatched matched -> if expr then matched else matched | other unmatched -> if expr then stmt | if expr then matched else unmatched Try parsing the string if E1 then if E2 then S1 else S2 to see how these two grammars differ. Left Recursion: a grammar is left-recursive if there exists a derivation A =>+ A alpha for some string alpha. Top down parsing methods cannot handle left recursion, but the following technique can be used to eliminate left-recursion. A -> A alpha | beta can be replaced by productions A -> beta A' A' -> alpha A' | epsilon Left Factoring: In order to do top-down parsing it is necessary to be able to decide which production to use to expand a non-terminal on the basis of the first token seen. Algorithm: - For each nonterminal A find the longest prefix alpha common to two or more of its alternatives. If alpha is not epsilon, i.e., there is a nontrivial common prefix, replace all the A productions A -> alpha beta1 | ... | alpha beta_n | gamma by A -> alpha A' | gamma A' -> beta1 | ... | beta_n until no two alternatives for a nonterminal have a common prefix. Top Down Parsing:: Recursive Descent Parsing: Predictive Parsing: are special cases of one another, as indicated. Top-down parsing can be viewed as an attempt to find a left-most derivation; recursive descent is a TDP technique that may involve backtracking; predictive parsing is the special case of recursive descent where backtracking is never needed. The basic recursive descent techique seems to be: maintain a pointer to the input string; start with a parse tree consisting only of the start symbol; expand the leftmost non-terminal in the parse tree into something begining with the same non-terminal as the input & advance the input pointer; when the result doesn't match the input, or can't be expanded to match, back up to the last untried possibility and try again. Clearly, the requirement for a predictive parser is that, looking at the next input symbol and the leftmost non-terminal in the tree, we can know which one production ought to be used to replace it. An LL(1) grammar is one for which a simple, predictive, top-down parser can be constructed automatically. LL(1) stands for Left-to-right, Leftmost, 1-token lookahead: the input is parsed left to right, the leftmost non-terminal is always expanded, and at any time the decision of what to do next is taken on the basis of the top symbol on the stack (the recursive descent state) and the next token in the input. In order to be LL(1) a grammar must be unambiguous and non-left-recursive. But those aren't sufficient conditions: I think left-factoring is the technique that handles the rest of the cases. The ultimate definition is whether the parsing table has any multiply defined entries. Designing a recursive-descent parser: In general, the parser is a subroutine that takes a grammar symbol and a next input pointer as arguments, tries to match the grammar symbol to the input, and either reports success and the updated input pointer, or failure to its caller. It calls itself whenever a production involves any non-terminals on the right-hand side. The body of the routine is a big case statement with a clause for each production. (This is very rough, I haven't really done it myself.) Backing up occurs when the subcall reports failure, so the routine can try an alternative. Using transition diagrams (syntax diagrams): Each grammar production can be represented as a finite-state diagram, where the edges connecting states are labeled by the terminals and nonterminals on the righthand side (see Dragon book). The transition diagrams can be used to aid in design of a recursive descent parser, or they can be directly interpreted by a general RD parser program. In that case, assuming no ambiguity, the parser follows this algorithm: Begin in the start state of the transition diagram of the grammar start symbol. When the final state of that diagram is reached, parsing is complete. Look at the next input symbol, and take the diagram edge corresponding to it, if one exists; if not, take an edge labeled by a nonterminal, and recursively call the parser with the start state of the diagram corresponding to that nonterminal, without consuming any input. Return to the caller when that diagram has been completely traversed, possibly by recurring on others. Etc. Of course, a stack can be used instead of recursion. Designing a predictive parser: An easy way is just to write out the transition diagrams, then look at them. If there is never more than one choice of what to do on a given input in any diagram, then a predictive parser can be built by the above methods. (If epsilon transitions exist, they are ok if they are alternatives to terminals, and are interpreted as to be taken only if the next input token doesn't match the terminal.) A standard LL(1) table driven parser can be built using a table constructed by calculating the FIRST and FOLLOW sets for the grammar. If this table (rows are labeled by nonterminals, columns by terminals + '$') has no more than one action in any square, then the grammar is LL(1). The FIRST and FOLLOW sets of each nonterminal are calculated as follows. 0) for each production of the form A -> b where b is a terminal or epsilon, b is in FIRST(A) 1) for each production of the form A -> Y1 Y2 ... Yk (FIRST(Y1) - epsilon) is a subset of FIRST(A) if epsilon is in FIRST(Y1) then (FIRST(Y2) - epsilon) is a subset of FIRST(A) if epsilon is in FIRST(Y1) and FIRST(Y2) then (FIRST(Y3) - epsilon) is a subset of FIRST(A) ... if epsilon is in FIRST of all Y1...Yk, then epsilon is in FIRST(A) 0) $ is in FOLLOW(S) 1) if A appears in a production of the form B -> alpha A beta then FIRST(beta) - epsilon is in FOLLOW(A) if epsilon is in FIRST(beta) then FOLLOW(B) is in FOLLOW(A) 2) if A appears in a production of the form B -> alpha A then FOLLOW(B) is in FOLLOW(A) Given these two sets, the table is filled in according by iterating through the productions. For each production of the form A -> alpha place that production in M[A,a] iff a is in FIRST(alpha). If epsilon is in FIRST(alpha), place the production in M[A,b] for each member b of FOLLOW(A). The result is the transition table for a stack-based, top-down, predictive parser. Bottom-Up Parsing:: The three varieties of bottom-up parser are SLR, LALR(k), and LR(k), the first two of which are really just special cases of LR parsers. The LR stands for Left-to-right scanning to discover a Rightmost derivation; these parsers are called "bottom-up" even though they work only on a prefix of the total input, like a top-down parser, because they work left-to-right to find a rightmost derivation *in reverse*. - "There is a significant difference between LL and LR grammars. For a grammar to be LR(k), we must be able to recognize the occurrence of the right side of a production, having seen all of what is derived from that right side with k input symbols of lookahead. This requirement is far less stringent than that for LL(k) grammars where we must be able to recognize the use of a production seeing only the first k symbols of what its right side derives. Thus LR grammars can describe more languages than LL grammars." (p 221) - "Intuitively, in order for a grammar to be LR it is sufficient that a left-to-right shift-reduce parser be able to recognize handles when they appear on top of the stack."(p 220) - an ambiguous grammar can never be LR. Right. So the key idea in LR parsing is that at each step we move a token from the input stream to the stack until we recognize that the righthand side of some production lies on the stack, at which point we pop the whole RHS and push the LHS nonterminal. To keep track of what is on the stack, a state is pushed after each token, so token/state/token/state... pairs are interleaved on the stack, with each state symbol essentially encoding what lies below it on the stack. Thus the parser can decide what to do at each point by consulting the state stored at the top of the stack, the next input token (or several...), and a precomputed table that encodes what action to take in each contingency. Pushing an input token and a new state onto the stack is known as "shifting" (state & input), while replacing a RHS string on the top of the stack with a single non-terminal is known as "reducing". Hence the bottom-up LR parsing method is also known as shift-reduce parsing. All good things come at a price: LR parsers can handle a wider class of grammars than LL parsers, but a non-trivial LR parser table is almost impossible to derive by hand. An SLR (Simple LR) table is easiest to generate, but can't handle some common gramatical structures. An LALR(1) (1 token Look Ahead LR) table isn't too much harder, and can handle most programming language constructs, so that's the most common type generated by tools (such as yacc). An LR(k) parser can handle the widest class of grammars, but the table can be much bigger than an LALR(1) table for the same grammar (several hundred states versus several thousand; size of parser table is proportional to states * tokens). Construction of an SLR table begins with construction of the LR(0) canonical item sets. Construction of an LALR table begins with construction of the LR(1) canonical item sets, which are a bit more complex. The "items" in these sets are productions in the grammar annotated with partially recognized RHS and (possibly) look-ahead characters to correspond to handles that might be seen during parsing. For now, I've decided to draw the line at learning how to build an SLR table, and forget about the other tables. These notes are no substitution for studying the Dragon book. Algorithm: Start by augmenting grammar G to obtain G' by adding a new production S' -> S where S is the start state. Def: the Closure of a set of items I is formed by adding new members to I by the following rule. - if [A -> alpha . B beta] is a member of I, and B -> gamma is a production in G', then add the item [B -> . gamma] to closure(I). Intuitively, the closure gets filled out with every item whose RHS corresponds to something that might appear next. Item set closures are used as the States of the generated parser. Def: the Goto operation computes state transitions. goto(I,X) = I' where I' is the closure of the set of items [A -> alpha X . beta] such that [A -> alpha . X beta] is in I. Intuitively, it is the state that would be reached by recognizing (reducing) the construct represented by X (which can be a terminal or nonterminal). 1) Start with I0 = closure([S' -> S]). 2) For every set of items I and each grammar symbol X such that goto(I,X) is not empty, add goto(I,X) to the canonical collection. The SLR table can now be constructed from this collection of item sets. Label the rows of the table with the canonical item set numbers 0...n, which are the states. Label the columns with the symbols of the (unaugmented) grammar, plus the distinguished symbol '$' that denotes end of input. Arrange all terminals to the left of and all nonterminals. Entries in terminal (+$) columns are shift/reduce actions to be taken when looking at that token in the input, while entries in nonterminal columns are goto actions to be taken after a reduce depending on the resulting nonterminal. Actions are computed by the rules: - if [A -> alpha . a beta] is in Ii and goto(Ii,a) = Ij, then action[i,a] is "shift j" - if [A -> alpha .] is in Ii, then action[i,b] is "reduce A -> alpha" for all b in Follow(A) (except if A = S'). - if [S' -> S] is in Ii, then action[i,$] is "accept". The goto columns are filled by the rule if goto(Ii,A) = Ij, then goto[i,A] = j. If there is are any multiply defined actions, then the parser is said to have a conflict. The SLR parser works by starting in state I0, then shifting input to the stack and reducing it according to the shift, reduce and goto's defined, until it "accept"s its input, or an undefined entry is called for, in which case a syntactic input error has been detected. When does a conflict arise in a shift-reduce SLR parsing table? When within one of the canonical item sets there is more than one production with some non-terminal b indicating a shift action (to different states, a shift/shift conflict), or there is only one, but there is also a reduce action where the follow set of the production includes b (a shift/reduce conflict). (A reduce/reduce conflict is also possible.) (Not too hard to see, with a little practice.) A section in the book points out that sometimes ambiguous grammars (which give rise to shift/reduce conflicts) are convienient, and their parser tables can have the conflicts eliminated by simply picking one of the conflicting actions (reduce the higher-precedence production) without changing the grammar. Note that yacc automatically resolves such conflicts in the order of actions listed by the programmer, (picking a shift over a reduce, in case of a shift/reduce conflict), and provides a mechanism for the programmer to override the default resolution behavior. Error Handling:: "Compared with operator-precedence parsers, the design of specific error-handling routines for an LR parser is relatively easy. In particular, we do not have to worry about faulty reductions; any reduction called for by an LR parser is surely correct. Thus we may fill in each blank entry in the action field with a pointer to an error routine that will take an appropriate action selected by the compiler designer." Syntax-Directed Translation::: The following is a rather abreviated treatment that summarizes results, but not techniques. Essentially, syntax directed translation is the scheme by which a source language is transformed according to actions whose definitions correspond to the syntactic structure of the source. In concrete terms, given an automaton that recognizes a context-free grammar, the recognition states can be augmented with actions to be taken to do any of these kinds of common tasks: - generate a parse (or abstract syntax) tree - reduce the abstract syntax to a semantically equivalent form (eg. evaluate expressions) - construct a semantically equivalent alternate representation (eg. generate intermediate code) Many parser generators allow grammar definitions to be augmented with _Syntax Directed Definitions_ which allow the attributes of symbols appearing on either the left of right of a production to be defined in either a synthesised or inherited fashion. - Synthesised attribute: one which depends only on attributes of children of the node, in the abstract syntax tree. The synthesised attributes of a non-terminal can be determined as soon as the attributes of all symbols on the production RHS have been set. - a definition is S-Attributed if it uses only synthesised attributes. It is easy to see how an S-attributed definition lends itself naturally to implementation by a bottom-up parser. - Inherited attribute: one which depends only on attributes of the parent and/or siblings of the node. Note that a subclass of inherited attributes are those that depend only on siblings to the left, ie. nodes preceding in a depth-first-traversal of the syntax tree. - An L-Attributed definition is one whose inherited attribute definitions fall in that subclass and hence can be implemented by an automaton that recognizes constructs in DFS order. "Although it is always possible to rewrite a syntax-directed definition to use only synthesized attributes, it is often more natural to use syntax-directed definitions with inherited attributes." (p 283) Nitty gritty part of chapter 5 shows that L-Attributed translation schemes (more general than S-attributed) can be implemented by both top-down (predictive) and bottom up (LL(1) and LR(1)) parsers. Apparantly only a subclass of LR(1) is practical...but what else is new. Factoid: declarations in languages like C tend to be more naturally expressed as L-attributed definitions, while ordinary expressions can usually be handled by S-attributed definitions. SEMANTICS AND FUNCTIONAL PROGRAMMING ============================== To be drawn from Elkan, Reade and chapter 13 of Tennent. Skip for now, since maybe someone else will do a good job. CODE GENERATION AND RUNTIME SYSTEMS I ============================= This and the following section are mostly just dumps of my (becoming more) foggy recollections of how it all works, at a fairly high level. If you notice any errors, I'd appreciate hearing. Symbol Tables:: Typically all tokens are inserted in the symbol table by the lexer, then pointers to symbols are handled by the rest of the compiler, though it is possible for symbols to be inserted by any phase (temporary variables, pseudo-registers, etc.) The symbol table encodes not the text of symbol names, but also any attributes they may incidentally have, eg. lexical class (keyword, variable, integer, float), and if a variable its type, scope, storage requirements, associated register, etc. The symbol table acts kind of like a library of attribute information used in common by all phases of the compiler. The typical implementation is a hash table, of either bucket-chaining or open-insertion type. Code Generation:: The form and details depend on the intermediate representation, and how much optimization is to be done. In a minimal compiler, it is possible to generate (lousy) object code directly from the abstract syntax tree, in a single syntax-directed pass (down and up, to get control structures right; think DFS). In such a case not much can be done with register allocation. The basic algorithm is to start at the leaves of expressions, generating code that evaluates them and stores the results in temporaries, then work your way up the tree, retrieving the values of operator arguments from temporaries back into registers, doing the operation, then dumping back into (new) temporaries. The major issue, perhaps, is instruction selection which is difficult for CISC and simple for RISC (because there's usually only one choice). The dragon book spends some space showing how crude register allocation can be done on an AST intermediate representation, by doing a pre-pass that calculates how many temporaries are needed at each node, and picking those that will reside in registers before doing the code generation. For example, registers might be liberally assigned to values needed at leaves, then values generated above might be spilled to temporaries when available registers run ought. A little better analysis might allow reuse of registers when value lifetimes don't overlap, and making sure that subtrees leave their results in different registers for use by the parent. Note that register allocation is not the last word in optimization, especially when the compiler doesn't do anything else, just an example of something that is typically done late in a compiler and difficult to do in a syntax-directed control structure. Intermediate Representations: In a relatively simple compiler, likely a tree of some kind (an AST, of course). Early PASCAL and C compilers seemed to use this kind of IL, and the dragon book spends a lot of space describing baroque algorithms for doing optimization and code generation on trees. Trees are advantageous when space is tight and you want the whole compiler to run in one or two passes. It's amazing how much you can squeeze into a couple of tree-traversing passes if you don't care about how unintelligible your compiler becomes. In a contemporary compiler, the IL is much more likely to be some kind of three-address code. (A linear sequence of instructions, each of which consists of an operator and 0 to 3 arguments, no more than one of which can be a location to be updated.) 3-address code can be straightforwardly generated in a syntax-directed fashion, by extravagently generating new temporaries whereever needed, then optimized by subsequent phases to eliminate unnecessary temporaries (and instructions), and map the rest to registers and real storage locations. Simple Register Allocation: First it should be noted that register allocation is only worthwhile if you have some registers to allocate. In the old days there were a lot of machines with very few general purpose registers, so there wasn't much point; instead the focus was on complex instructions whose arguments were memory locations. The simplest scheme is no allocation: for each instruction the arguments (if they must be in registers) are loaded into the same 2 or 3 registers, the result is computed into the same place, then stored back into memory. The next step up in sophistication is to allocate registers on the fly (in the compiler), in numeric or least-recently-referenced order, keeping track of what is in each register, so it can be reused without reloading if the opportunity comes up. This scheme doesn't require any prior use analysis. Optimization:: The following optimizations are all optional, in the sense that they are not supposed to change the meaning of (correct) code, and probably the code generator will work even if they are left out. There are versions of most of these that work on a tree-structured IL, but it is easier to assume a 3-address code implementation with basic-block analysis. Basic Blocks: A basic block is a sequence of instructions where once control enters, it is guaranteed to to continue execution through the last one. So, in a sequence of TAC, each label or instruction following a branch begins a BB, and each branch or instruction preceding a label ends a BB. If TAC is in a list or array, a BB can be represented by a head and tail pointer (index), or a head and an offset. Common Subexpression Elimination: Sometimes two or more variables are assigned equivalent values, e.g. a := b * c; ... d := b * c; In such a case, if it can be proven that the values of b and c don't change between these two statements, then the second instance of (b * c) can be replaced by (a). In general, arbitrarily large subexpressions can be computed once, and the result substituted, whenever none of the variables involved in the expression occurs on the LHS of an assignment between the two statements concerned (and they are both in the same basic block...). Common subX's can be recognized by building a DAG for a BB that represents each variable as a node in the dag, with interior nodes representing arithmetic operations, and with as much shared substructure as possible. Warning: when aliasing is possible (i.e. two variables refer to the same location), or when array updates occur, some kinds of analysis like the above don't work, or need extra care. Copy Propagation: When common subX and other optimizations are performed, the resulting code sometimes contains a lot of trivial assigments like f := g; these are "copies". The idea behind copy propagation is to substitute uses of g for subsequent uses of f (until f is assigned a different value), so that the copy instruction can be eliminated. Algebraic Identities: (Also known as strength reduction in some contexts.) Sometimes it is possible to substitute a weaker (i.e. easier to compute) operation for the one specified by the programmer. For example, multiplication by a power of 2 can be replaced by a shift, addition of 1 by an increment (on some architectures), multiplication by 1 can be eliminated, etc. Most trivial instances of reducible algebraic identities can be discovered and converted by a peephole optimizer, which slides a window of 1 to several instructions down the code, looking for patterns that match its action rules. Functional languages allow much greater opportunity for transformation based on algebraic identities, since they have referential transparancy and thus support a much richer set of identities. Loop optimization: The basic idea with loops is: 1. get things out of the loop 2. take advantage of the loop control to simplify operations inside The rule of thumb is that an average loop executes about 10 times, so instructions that precede a loop are only 1/10 the cost of those inside. Also, any optimizations that can be done inside the loop are 10 times more valuable than anything done outside. Therefore, when thinking cost/benefit, it's worthwhile to put most of your effort into optimizing the most deeply nested loops, and only worry about the rest if you're really being thorough. - Code Motion If an expression within a loop does not reference any variable whose value may change within the loop, then that expression is a "loop invariant" and can be moved before the loop and computed only once (substituting the var holding the result for its instance(s)). - Induction variables and strength reduction Any variable whose value systematically changes each time the loop iterates is called an induction variable (well...usually just those whose value increments or decrements by a constant), for example the variable i in for (i:=1, i<10, i++) {...} When induction variables are identified, strength reduction can often be done on expressions mentioning them. For example, the array reference a[i*j] would naively require a multiplication, but if we know that i is an induction variable and j is a loop invariant, this reference could be replaced by t := 0; loop: t := t + j; ... a[t]; - Identification of loops Loops are found by first doing basic block analysis, then flow analysis, and looking for dominators. BB d dominates n if every path from the top of the graph to n passes through d. The entry block of a loop dominates all other blocks in the loop. This is actually the definition of a loop; if there is no one dominator, there is no loop (even if the flow graph is cyclic). Runtime Organization:: Hand-in-glove with the compiler is the runtime environment; generated code creates some of it (e.g. stack frames) and interfaces with the services it provides (e.g. heap management). The responsibilities of the runtime are: - static and dynamic storage organization - flow of control mechanisms - library routines Static allocation: Constants, etc. can be laid out in some fixed area of storage and refered to with absolute addresses or offsets to some register that points to the head of the static area. Dynamic allocation: Stacks: Just about every language uses a runtime stack, if for no other reason, to record the chain of procedure calls, however it's also a convenient place to allocate dynamic local storage, making recursion possible. The stack usually grows and shrinks linearly from one end of the address space, but in languages with bizarre control structures it's possible for stack frames to be heap allocated and garbage collected. - an invocation of a procedure results in the creation of an Activation Record on that stack, also known as a Call Frame. The dragon book identifies these common fields: - temporary values (undeclared local variables) - local variables (scalar values, arrays maybe up to some limit) - saved status (contents of registers, etc. when proc was called; this status must be restored upon return; sometimes a proc is required to save its own status before calling, rather than the status of its caller) - access link (similar to the the "display" of pascal; provides access to nonlocal data in other activation records; not required for fortran, but something like it is required for any language with nested procedures, or dynamic scoping) - control link (link to caller) - actual parameters (argument values or whatever; may be passed in registers rather than call frame) - return value (often returned in a register) A Display is an array of pointers to activation records. Conceptually, if procedure p is at nesting depth j, then elements 1 though j-1 of display d point to the most recent activations of the procedures enclosing p. The size of display needed can be precomputed and statically allocated (the maximum nesting depth). Then the display pointers can be kept up to date as activation records are built and popped. The display provides faster access to non-local storage (2 pointers to any frame, rather than up to j-1). Dynamic binding: Deep Access uses control links (rather than access links, since they are the same in this case) to search for the most recent binding: it may go deep. Shallow Access always keeps the current value accessible from some fixed place in static storage. Heap: Any language that allows data objects to be explicitly allocated and deallocated independent of procedure invocations needs some kind of runtime heap. Some languages like Lisp provide heap storage transparantly, and automatically arrange for recovery of unreachable storage through garbage collection. Others, like C, require the programmer to explicitly allocate & deallocate. Garbage collection is a complex subject. The possibilities include variations on the general strategies Mark and Copy (heap space is divided into two; mark all live objects then copy to head of second space when first is exhausted) and Reference Counting (when count goes to 0, return to heap). CODE GENERATION AND RUNTIME SYSTEMS II ============================ In today's RISC world optimization is critical for a compiler, especially things like loop optimization, register allocation and instruction scheduling that can reduce run time by hefty constant factors. A serious optimizing compiler will probably be structured so that optimizations are as independent of one another, and the front-end and back-end, as possible. This allows each optimization to be tweaked individually without breaking the others, and the optimization core to be reused for different source languages and target machines. Also, as memory is cheap and computers are faster, we're more willing to let a compiler be a monster program that grinds through multiple phases. Data Flow Analysis:: The key to a lot of optimizations is correct understanding of control and data flow dependencies, allowing instructions to be reordered, combined & eliminated. In general, dependency relations within a program are often represented as a DAG. Data flow info can be calculated globally, on a per-statement basis (expensive). More practically, it can be calculated locally (for a single procedure). For some purposes (eg. register allocation) basic-block granularity is sufficient while for others (eg. instruction scheduling) instruction granularity is necessary. Control Flow Graph: Is not too hard to figure out. Once TAC has been segmented into basic blocks, a graph can be built out of the BBs by connecting them with a directed edge whenever there is a branch (or sequential fall-through) connecting two of them. That graph then encodes the control flow, and can be used to find loops (by looking for dominators among the BBs), and to calculate data flow. Gen, Kill and Point: - Gen: a definition (assigment) of a variable. The gen is the statement (instruction). - Kill: a re-definition; all definitions that reached the point before the redefinition are said to be killed. - Point: a place immediately before or after an instruction. Reaching Definitions: A definition of a variable is an instruction that sets it (an assignment), or that might set it (a procedure call, an array or pointer assignment or a possible alias that can't be analyzed more accurately). A definition is said to reach a point in a program if there is a path from the definition to the point without an unambiguous kill of the definition. Note that multiple definitions of a variable may reach the same point if there are branches in control flow, or ambiguous assignments. Note on ambiguous assignments: when the location identified by a pointer is assigned to, as in the C statement *p := v; unless we can prove otherwise, we must make the conservative assumption that any (every) variable is assigned to. Similarly, when an array assignment is made, the conservative assumption is that every element has been changed. When another procedure is called, every bit of memory that it can access... Hence, "we allow paths that may never be traversed in any execution of the progrm, and we allow definitions to pass through ambiguous definitions of the same variable." (p 611) Suppose, for simplicity, that a control flow graph has been constructed for a program such that each instruction is a node. Moreover, every instruction that sets a variable is given a unique number, identifying it as a definition. Reaching definitions can then be determined by calculating the gen, kill, in and out sets for every statement in the graph. Each of these sets is a set of definition numbers. The sets are calculated according to the following equations. I. For an individual statement S consisting of definition d, that defines a variable "a", gen[S] = {d} kill[S] = D_a - {d} out[S] = gen[S] U (in[S] - kill[S]) Note that d kills _all_ other definitions of variable a. II. For a compound S consisting of two sequential statements S1 and S2, gen[S] = gen[S2] U (gen[S1] - kill[S2]) kill[S] = kill[S2] U (kill[S1] - gen[S2]) in[S1] = in[S] in[S2] = out[S1] out[S] = out[S2] III. For a compound S consisting of two parallel statements gen[S] = gen[S1] U gen[S2] kill[S] = kill[S1] int kill[S2] in[S1] = in[S] in[S2] = in[S] out[S] = out[S1] U out[S2] IV. For S consisting of S1 looping back into itself, gen[S] = gen[S1] kill[S] = kill[S1] in[S1] = in[S] U out[S1] out[S] = out[S1] Note that these are inductive defintions based on control flow; actual solution may require iteration (especially for loops) until contents of sets become stable. Also, they reflect conservative assumptions (ie. they don't worry about when branches are taken). Algorithm: depends on representation. Basically, gen and kill are calculated bottom up (from individual statements), while in and out are calculated top down. A simple method: - construct a basic block control flow graph - calculate gen and kill for each block using rules I and II above. - iterate over all blocks, until no change happens, setting in and out by the rules: in[B] = out[P] out[B] = gen[B] U (in[B] - kill[B]) Use-Definition (UD) chains: A list for each use of a variable of all definitions of that variable that reach that use. Available Expressions: An expression x+y is available at point p if every path leading to p computes x+y (presumably storing the value somewhere) and does not assign to either x or y, unless it then recomputes x+y. AE can be calculated analygous to reaching defs. Live Variables: Given a variable v and a point p, could the value of v be used at some point subsequent to p? If so, v is live at p. (Needed for register allocation.) Can be computed similarly to reaching defs, by altering a couple of details. Let in[B] be all variables live at the entry to B, and out[B] be all variables live at the exit of B. Let def[B] be the variables definitely assigned a value before any use in B, and let use[B] be the variables that may be used before being defined in B. in[B] = use[B] U (out[B] - def[B]) out[B] = in[S] Graph Coloring Register Allocation:: Simple in principle, complex in detail. Begin by determining the lifetime of values. In low-level TAC, the relevant values are variables (declared and temporaries). The lifetimes are determined by live-variable analysis at the appropriate scope and granularity. Since the algorithm is at least O(N^2), analysis per procedure, allocating registers at the basic block granularity is the usual method. Each variable thus has a live range. - Construct a graph in which the nodes are variable live ranges, and there is an edge between two nodes iff their ranges intersect (temporily). This is an Interference Graph. - "Color" the graph, assigning a different color to any two nodes connected by an edge, using the fewest colors possible. - If you used fewer than R colors, where R is the number of general purpose registers available, great! go ahead and allocate the corresponding register to each live range. - Otherwise, allocate as many registers as you can, giving priority to live ranges in deeply nested loops, and which contain a lot of references (as opposed to those which are merely long but don't represent a lot of use), then prepare to spill the rest (ie. shuffle variable values in and out of memory before and after use & def). - If you're aggressive, you can try to split live ranges to reduce their interference and get a graph that can be colored with fewer colors, though it will take more time to do so. Backend Generators:: Why not reuse your compiler for lots of target machines? Why not make the code generator table driven, much like the parser on the front end? This is the inspiration for backend generators. The plan is to supply a description of the target instruction set and the intermediate language, and allow a general purpose tool to kick out the code generator, and maybe instruction scheduling and register allocation as well, since these two optimizations are often performed on the (semi-) final code. Unfortunately, real machine architectures can't be captured by any theory as elegant as Context Free Grammars, and all of the tools like this seem to be full of semi-special-purpose mechanisms that don't necessarily generalize to the next new machine. Nonetheless, this approach has enjoyed some success. The GNU C compiler includes a backend generator that has been used to port it to many architectures. Runtime Organization:: Garbage collection: see previous section. Object representation: I have no idea. Use some kind of structure with pointers to closures implementing the methods, I suppose. I think this is how CLOS (common lisp object system) does it. Closures: You've got to have code (easy) and environment, which is a little harder, and it all depends on the language. Bindings might be stored with the closure (if their values are fixed at closure creation time), or you might have to use pointers to whereever the bindings are (in the stack, or elsewhere).