CSE 131A -- Fall 2000
Compiler Project Part II -- Syntax Analysis
Due: Midnight on Sunday, 11/12/00

The Assignment

In this assignment we will be interfacing our lexical analyzer from the last project with a syntax analyzer (better known as a parser). The parser will be built using a parser builder, either yacc (for C++ users) or CUP (for Java users). (bison, a GNU version of yacc, is allowable too -- for our purposes it is virtually identical to yacc). Specifically, your tasks are:

1.1. Grammar Modifications

The grammar file that we will supply to you (oberon.y or oberon.cup) is taken from an Oberon language specification document. It is essentially a complete grammar, but it may require some modifications in order to be completely compatible with the parser builders. You must rewrite the offending rules to remove the conflicts. For full credit, no grammar errors (conflicts) must be reported when your grammar file is run through yacc/CUP. Note that the yacc constructs %left, %right, and %prec are not allowed in this assignment.

1.2 Parser/Lexer Interface

1.2.1 C/C++

yacc produces a C file (compatible with C++) called y.tab.c, which contains a single function yyparse(), which is the parser itself. yyparse() expects the lexer function to be named yylex(). Your task is to extend modify your lexer so that it properly transmits token types and values to the parser. You will need to change your main() as well, replacing the instantiation of the LexerTester() object with (most likely) the instantiation of a parser object. The LexerTester object can be removed, it will not be used in this or any future project.

1.2.2 Java

CUP outputs a file called parser.java, containing the declaration for the parser class. The parser expects the lexer function to be called scan(). It is probably best to derive a new class that extends parser and redefined scan() to call your lexer GetToken(). Your application class (specifically, main()) will likely want to instantiate an instance of your parser class and then call parser.parse(). The LexerTester object can be removed, it will not be used in this or any future project.

1.3 Parser Output

Once the above is done, your program should now accept legal Oberon programs and reject illegal programs. If the input is a legal Oberon program, your program should print (to standard output):

    Parsing: success.
and if the input is illegal, it should print

    Parsing: failure.
Note that the latter message should be printed even if your parser correctly recovers from errors and finishes. In addition, your parser should output the error messages specified in section 1.4.

1.4 Error Messages/Error Recovery

Normally, a basic parser will stop processing when a syntax error is detected. If this occurs in your program, you should print:

    Error: "file", line N: syntax error near "xxx".
where xxx is the last lexeme returned by the lexer. However, this is usually not desired -- rather, we would like a parser to print a more descriptive error message, and continue processing. So, for this assignment, we will add some error recovery. Specifically, you must modify your parser so that it prints an appropriate error message, and continues to parse, for each of the following errors: Remember that error recovery means printing the appropriate error message (and NOT the default "syntax error" message), and continuing to parse (looking for more errors).

1.5 Main Statements Parsing

The current Oberon grammar specifies that an entire program consist of one module (a module is vaguely like a class). There is no provision for multiple modules, or non-module procedures, or even a main program. Thus, we wish to extend the Oberon language to allow for this. In other words, you are to add rules to the grammar to allow inputs specified as follows.

A legal Oberon input should consist of a program. A program should consist of the string PROGRAM, followed by an identifier (the program name), followed by an optional declarations section followed by the main statements section The declarations section is optional and consists of zero or more declarations. A declaration is either a constant declaration, variable declaration, a module declaration, a type declaration, or a procedure declaration (including forward declarations). These declarations can appear in any order within a declarations section. Note that all of the declarations currently end in a semicolon except the module declaration, which ends in a dot. This convention should be kept in this new declarations section.

Following the declarations section is the main statements section. This section consists of the keyword BEGIN, followed by zero or more statements (as currently found in the grammar), followed by the END keyword, followed by a dot. Nothing can follow the main section in a legal input. Note that the string PROGRAM should now be treated as a keyword (it can no longer be an identifier -- your lexer must be modified to accomodate this requirement).

1.6 Procedure Prototype Output

If the -p option is found on the command line, we wish to print (to standard output) information about procedures that are parsed. The -p option can be either before or after the input file on the command line. for each procedure declaration found, you are to output the following information:


Here is an example output:

Compute             "testfile" (5)      INTEGER, INTEGER -> BOOLEAN
Stack.Push          "testfile" (20)     Object -> (nothing)
Stack.Pop           "testfile" (32)     (none) -> Object


1.7 I/O Statements

Some languages define input/output statements as part of the grammar (for example, Pascal), while others leave them to external libraries (C++). The Oberon reference does not mention I/O in detail and the original grammar does not support it. Therefore, we wish to extend the grammar to be able to parse input/output statements.

The following specify the syntax for input and output statements. We are patterning our syntax after C++-style streams. We will require three new keywords for these statements, these being CIN, COUT, and NL (which stands for newline), and two new punctuation tokens, which we will call T_LTLT (for the two-character sequence <<) and T_GTGT (for >>). Both of these constructs are statements, which mean they should be valid anywhere other Oberon statements are valid.

The input statement consists of the T_CIN keyword, followed by the T_GTGT token, followed by a designator (the designator is as currently defined in the grammar). The output statement consists of the T_COUT keyword, followed by a list of output pairs, each pair consisting of the T_LTLT token followed by either an expression (from the grammar) or the T_NL keyword. At least one such pair must follow T_COUT.

Examples:

CIN >> x

COUT << x << y
COUT << "hello, " << strName << NL
COUT << NL
COUT << "x + y = " << NL << x + y