CSE 230: Principles of Programming Languages
Notes for Chapter 2 of Sethi (syntax, grammar, expressions)

Sethi first defines context free grammars using the notation of Backus-Naur forms (BNF), and then later also discusses extended BNF (EBNF) and syntax charts. These three formalisms are equivalent, in the sense that exactly the same set of languages can be defined using any of them as using the others; moreover, there are several other formalisms that are equivalent to these, including context free grammars in the sense of Chomsky, which replace the symbol ::= by ->, and do not use the | symbol. Moreover, there are several other kinds of grammar, including recursive grammars and Post systems, which are more expressive than context free grammars, in the sense that the set of languages that can be defined using them is strictly larger than the set that can be defined using context free grammars; and there are also grammars that are less expressive than the context free grammars, including regular expressions.

Perhaps it will help to be really formal about some of this, since Sethi is not very formal about BNF. So here we go:-

Mathematically, a language is a set of (finite) strings, where a string is a sequence of letters, from another set that is usually called the alphabet of the language. Formally, we write A* for the set of all finite strings from the alphabet A, including the empty string, which we will usually write [] - but beware, because many other notations are used for it, including the Greek letter lambda and the Greek letter epsilon, and in Sethi, <empty>.

For programming, context free languages are especially important, and as remarked above, BNF is just one kind of grammar for defining these languages. Here is Chomsky's original version: A context free grammar or CFG is a 4-tuple (N,T,P,S), where: N is the set of nonterminal symbols which serve as names for the grammatical categories; T is the set of terminal symbols (the ones that will end up in the sentences of the language), such that T and N are disjoint; P is a finite set of productions or grammatical rules, each of which has the form

     N0 -> w0 N1 w1 N2 ... wn-1 Nn wn 
where each Ni is a non-terminal, and each wi is a string of terminals (possibly empty); and finally, S is a particular nonterminal, called the start symbol. The language defined by (N,T,P,S) is the set of all terminal strings that can be derived using the rules, starting from S. The formal definition of derivation is a little complex, and is given in the next paragraph. If G is a grammar, then its language is usually denoted L(G).

Given a CFG G, let N be a nonterminal, p a production N -> r, and e some string in A*. Then we write

     e => e'  with p 
to indicate that some single instance of N in e is replaced by r, and we may also omit "with p" when the above holds for some p in P. Next, we write
     e =*=> e' 
when e => e1 => ... => en = e' for some n. Finally, we can now define
     L(G) = { t | S =*=> t and t in A*} . 

Regular Expressions were invented by the logician Steven Kleene to describe a certain simple class of languages, the regular languages, which we now know are also those defined by regular grammars and accepted by finite state automata; the "*" operation in regular expressions is called the "Kleene star" in his honor. We can give the following BNF grammar for regular expressions:

    <rexp> ::= Ø
    <rexp> ::= (<rexp> <rexp>)
    <rexp> ::= (<rexp> + <rexp>)
    <rexp> ::= <rexp>*
    <rexp> ::= a                  for each a in A
where <rexp> is the nonterminal for regular expressions. In the first line, "Ø" is a Danish letter, standing for the empty set, as usual in mathematics. The second line has an implicit binary infix operator, that is usually called juxtaposition or concatenation. The last line is really an abbreviation for a finite number of lines, one for each "letter" (or terminal symbol) in the alphabet A. It is usual to leave out the parentheses when they can be inferred by the associativity of + or of justaxposition, or by assuming a stronger binding for juxtaposition than for +. For a simple example of a regular expression,
    (a b)(a b)* 
defines the language consisting of finite nonempty strings of ab. Here is another example,
    (a* b)(a* b)* + (b* a)(b* a)* 
defining the language of nonempty strings of b's with some number (possibly zero) of a's in front of each b, and of nonempty strings of a's with some number (possibly zero) of b's in front of each a.

Many computer scientists's ideas about natural language seem to have been influenced by the concepts like the above, which take a formal view along the lines Noam Chomsky, rather than that of linguists who study what real people actually write and say. An important point is that in natural language, context determines whether or not something makes sense; formal syntax and semantics are very far from adequate, and indeed, the distinction among syntax, semantics and pragmatics does not hold up under close examination. On the other hand, the formal linguists' way of looking at syntax and semantics works well for programming languages, because we can define things to work that way, and because traditionally, programming language designers want to achieve as much independence from context as possible (though this might change for future computer systems!).

It follows that the above definition of "language" is not suitable for natural language, where context determines whether something is acceptable or not, and where even then, there may be degrees of acceptability, and these may vary among different subcommunities, and even individuals, who speak a given langauge; morever, all of this changes (usually slowly) over time. For example, "yeah" is listed as a word in some dictionaries but not in others; perhaps it will gradually become more and more accepted, as have many other words over the course of time. At the level of syntax, English as spoken in black Harlem (a neighborhood of New York City) differs in some significant ways from "standard English" (if there is such a thing), in its lexicon, its syntax, and its pronunciation; this dialect has been studied under the name "Black English" by William Labov and others, who have shown that in some ways it is more coherent than so called "standard English".

It may be interesting to note that the first (known) grammar was given by Panini for Sanskrit, a sacred language of ancient India, more than 2,500 years ago. This work included very sophisticated components for phonetics, lexicon, and syntax, and was in many ways similar to modern context free grammars. The motivation for this work was to preserve the ancient sacred texts of Hinduism.

If we denote the empty set by "{}" and the empty string by the empty character, then there will be no way to tell the difference between the empty set and the set containing the empty string. So this is a bad idea, and it helps to explain why we use [] or maybe the Greek letter epsilon for the empty string, and Ø for the empty set. I am afraid you will have to get used to there being a lot of notational variantion in mathematics, just as there is a lot of notational variation for programming languages; in fact, I am afraid that notational variation will be an ongoing part of the life of every computer science professional for the foreseeable future, so that developing a tolerance for it should be part of your professional training.

A favorite question of philosophers is "What is meaning?" and a favorite way to try to answer it is to provide a denotation for some class of sentences; this approach is called denotational semantics. As one simple example, we can define the "meanings" or denotations of regular expressions with a denotation function, which maps the expression to the language that it defines. If we let R be the set of regular expressions over a finite alphabet A, then denotation for R is given by a function

   [[_]] : R -> P(A*) , 
where A* indicates the set of all finite strings over A and P(A*) indicates the "power set" of A*, i.e., the set of all subsets of A (another notation uses exponentiation, so that 2 ** X indicates the set of all subsets of X). Then we can actually define this denotation function recursively as follows:
    [[Ø]]      =  Ø
    [[E E']]   = [[E]] º [[E']]
    [[E + E']] = [[E]] U [[E']]
    [[E*]]     = [[E]]*
    [[a]]      = {a}              for each a in A
where the operations º and * on sets of strings are defined as follows:
     A º B = { ab | a in A  and  b in B }  
     A* = { [] } U { a1 a2...an | ai in A, n > 0 }  
Using all these equations, we can compute the denotation of any regular expression, just by applying them recursively until all [[_]] pairs have been eliminated. Notice that this semantics is compositional, in the sense that the denotation of each part is "composed from" (i.e., computed from) the denotations of its parts. For example,
     [[ (a b) (a b)* ]] = [[ (a b) ]] [[ (a b)* ]]
                        = ([[ a ]] [[ b ]]) [[ (a b) ]]* 
                        = ({ a } { b }) ({ a } { b })*
                        = { ab } { ab }* 
                        = { (ab)n | n > 0 } 

Although there are clever ways to give a compositional semantics the property that the meaning of a part depends on the context within which it occurs, in the sense of the other parts of which it is a part, a mathematical semantics for a programming language is not going to give us what we usually intuitively think of as the meaning of programs written in it. For example, the fact that a certain FORTRAN program computes a certain formula does not tell us that it gives the estimated yield of a certain variety of corn under various weather conditions, let alone that this figure is only accurate under certain assumptions about the soil, and even then it is only accurate to within about 5 percent. Yet all this is an important part of the meaning of the program.

For another example, just a little more complex, here is a context free grammar G for a very simple class of expressions:

    E -> 0
    E -> 1
    E -> (E + E)
    E -> (E * E) 
where N = {E} with E as the start symbol, and T = {0, 1, (, ), +, *}. We can define a denotation function
   [[_]] : L(G) -> NAT 
for these expressions, where NAT is the set of natural numbers, with the following:
    [[ 0 ]] = 0
    [[ 1 ]] = 1
    [[ E + E' ]] = [[ E ]] + [[ E' ]]
    [[ E * E' ]] = [[ E ]] * [[ E' ]]
From this, we can compute, for example, that
    [[ (1 + 1) * (1 + 1) ]] = 4  

Attribute grammars were invented by Don Knuth. They provide another way to give meanings to expressions. Although they can be pretty difficult to understand, the special case of grammars with just "synthesized attributes" is much easier, and can be explained without too much notation, and can also be illustrated rather simply. For example, we can define the values of expressions over the above grammar. But first, it is convenient to add to the grammar one non-terminal, S, so that now N = {S, E}, and one more production, S -> E. The attribute grammar has just one (synthesized) attribute, val, and associates the following equations with the productions of the grammar:

    S.val = E.val
    E.val = 0
    E.val = 1
    E1.val = E2.val + E3.val
    E1.val = E2.val * E3.val 
Then a parse tree for the expression (1 + 1) * (1 + 0) looks as follows
              S
              |
              E
            / | \
           /  |  \
          /   *   \
         /         \
        E           E
      / | \       / | \
     E  +  E     E  +  E
     |     |     |     |
     1     1     1     0  
where for simplicity the parentheses are left out. Then the synthesized attribute will percolate up the tree, by applying the equations from the bottom up, producing the values that are shown here
              2
              |
              2
            / | \
           /  |  \
          /   *   \
         /         \
        2           1
      / | \       / | \
     1  +  1     1  +  0  
where we see that the final value is 2.

For another illustration, we can do the semantics of binary digits; the grammar here is

    B -> D
    B -> DB
    D -> 0
    D -> 1 
and the equations associated with these four rules are
    B.pos = 0              B.val  = D.val
    B1.pos = B2.pos + 1    B1.val = D.val *(2 ** B1.pos) + B2.val
    D.val = 0
    D.val = 1 
It is now a good exercise to compute the value of the binary number 1010 by first writing its parse tree and then computing how the values of the two (synthesized) attributes perculate up the tree.

We will see later than the parse trees of a context grammar form a very nice little algebra, such that the values of synthesized attributes are given by a (unique) homomorphism into another algebra of all possible values.


To CSE 230 homepage
To CSE 230 notes page
Maintained by Joseph Goguen
© 2000, 2001, 2002, 2003 Joseph Goguen
Last modified: Tue Jan 13 20:23:27 PST 2004