CSE 131A -- Fall 2000
Compiler Project Part II -- Syntax Analysis
Due: Midnight on Sunday, 11/12/00
The Assignment
In this assignment we will be interfacing our lexical
analyzer from the last project with a syntax analyzer
(better known as a parser).
The parser will be built using a parser builder, either
yacc (for C++ users) or CUP (for Java users).
(bison, a GNU version of yacc, is allowable too -- for
our purposes it is virtually identical to yacc).
Specifically, your tasks are:
-
Run your Oberon grammar file (supplied by us) through
yacc/CUP to
produce a parser.
-
Interface your parser with your lexer, producing a program that will
accept syntactically legal Oberon programs (with
the additions specified below) and reject illegal programs.
-
Add error recovery actions to your parser to handle specific
illegal inputs, as specified below.
-
Add productions to parse the main statements of an Oberon program.
-
Output prototype information (parameters, return type) for procedures.
-
Add productions to parse I/O statements.
1.1. Grammar Modifications
The grammar file that we will supply to you (oberon.y
or oberon.cup)
is taken from an Oberon language specification document.
It is essentially a complete grammar, but it may require
some modifications in order to be completely compatible with
the parser builders.
You must rewrite the offending rules to remove the conflicts.
For full credit, no grammar errors (conflicts) must be reported
when your grammar file is run through yacc/CUP.
Note that the yacc constructs
%left,
%right, and
%prec are not allowed in this assignment.
1.2 Parser/Lexer Interface
1.2.1 C/C++
yacc produces a C file (compatible with C++) called
y.tab.c, which contains a single function yyparse(),
which is the parser itself.
yyparse() expects the lexer function to be named
yylex().
Your task is to extend modify your lexer so that it properly transmits
token types and values to the parser.
You will need to change your main() as well,
replacing the instantiation of the LexerTester() object
with (most likely) the instantiation of a parser object.
The LexerTester object can be removed, it will not be used
in this or any future project.
1.2.2 Java
CUP outputs a file called parser.java, containing the
declaration for the parser class.
The parser expects the lexer function to be called scan().
It is probably best to derive a new class that extends parser
and redefined scan() to call your lexer GetToken().
Your application class (specifically, main()) will
likely want to instantiate an instance of your parser class
and then call parser.parse().
The LexerTester object can be removed, it will not be used
in this or any future project.
1.3 Parser Output
Once the above is done, your
program should now accept legal Oberon
programs and reject illegal programs.
If the input is a legal Oberon program, your program should print
(to standard output):
Parsing: success.
and if the input is illegal, it should print
Parsing: failure.
Note that the latter message should be printed even if your parser
correctly recovers from errors and finishes.
In addition, your parser should output the error messages
specified in section 1.4.
1.4 Error Messages/Error Recovery
Normally, a basic parser will stop processing
when a syntax error is detected.
If this occurs in your program, you should print:
Error: "file", line N: syntax error near "xxx".
where xxx is the last lexeme returned by the lexer.
However, this is usually not
desired -- rather, we would like a parser to print a
more descriptive error
message, and continue processing.
So, for this assignment,
we will add some error recovery.
Specifically, you must modify your parser so that it
prints an appropriate error message, and continues to parse,
for each of the following errors:
-
Constant declarations with a missing T_EQU token.
The error message shall be:
Error: "file", line N: missing := in constant declaration.
-
IF statements with a missing THEN keyword.
(only concern yourself with the initial THEN in the IF part,
not the THENs in the ELSIF parts.)
The error message shall be:
Error: "file", line N: missing THEN in IF statement.
-
Variable declarations that are missing a single comma.
The error message shall be:
Error: "file", line N: missing comma in identifier list.
-
Procedure declarations that contain a return specifier,
but one that is missing the colon.
The error message shall be:
Error: "file", line N: missing : in procedure declaration.
Remember that error recovery means printing the appropriate
error message (and NOT the default "syntax error" message),
and continuing to parse (looking for more errors).
1.5 Main Statements Parsing
The current Oberon grammar specifies that an entire program consist
of one module (a module is vaguely like a class).
There is no provision for multiple modules, or
non-module procedures, or even
a main program.
Thus, we wish to extend the Oberon language to allow for this.
In other words, you are to add rules to the grammar to allow inputs
specified as follows.
A legal Oberon input should consist of a
program.
A program should consist of the string PROGRAM,
followed by an identifier (the program name), followed by an optional
declarations section
followed by the
main statements section
The
declarations section
is optional and consists of zero or more declarations.
A declaration is either a
constant declaration,
variable declaration,
a module declaration, a type declaration, or a
procedure declaration (including forward declarations).
These declarations can appear in any order within a declarations
section.
Note that all of the declarations currently end in a semicolon except
the module declaration, which ends in a dot.
This convention should be kept in this new declarations section.
Following the declarations section is the
main statements section.
This section consists of the
keyword BEGIN,
followed by zero or more statements (as currently found in the
grammar), followed by the END keyword, followed by a dot.
Nothing can follow the main section in a legal input.
Note that the string PROGRAM should now be treated as a keyword
(it can no longer be an identifier -- your lexer must be modified to
accomodate this requirement).
1.6 Procedure Prototype Output
If the -p option is found on the command line, we wish to
print (to standard output) information about procedures that are
parsed.
The -p option can be either before or after the input file on the command line.
for each procedure declaration found,
you are to output the following
information:
-
The procedure name.
If the procedure is declared inside a module, output the
module name, followed by a dot, followed by a procedure name
(like ModuleName.ProcedureName), otherwise output just the
procedure name.
Pad (left-justify) the output to 20 spaces.
-
The current file name (same format as for error messages),
followed by a space, followed by the current line number in
parentheses.
Pad to 20 spaces.
-
The parameter types (NOT the names), separated by commas.
If the procedure takes no parameters, output the string
(none).
There should be a single space following each comma.
-
output an arrow (->),
followed by the procedure's return type.
If the procedure returns nothing, output the string
(nothing).
There should be a single space on either side of the arrow.
Here is an example output:
Compute "testfile" (5) INTEGER, INTEGER -> BOOLEAN
Stack.Push "testfile" (20) Object -> (nothing)
Stack.Pop "testfile" (32) (none) -> Object
1.7 I/O Statements
Some languages define input/output statements as part of the grammar
(for example, Pascal), while others leave them to external libraries
(C++).
The Oberon reference does not mention I/O in detail and the original
grammar does not support it.
Therefore, we wish to extend the grammar to be able to parse input/output
statements.
The following specify the syntax for input and output
statements.
We are patterning our syntax after C++-style streams.
We will require three new keywords for these statements, these being
CIN, COUT, and NL (which stands for newline),
and two new punctuation tokens, which we will call
T_LTLT
(for the two-character sequence <<)
and
T_GTGT
(for >>).
Both of these constructs are statements, which mean they should be valid
anywhere other Oberon statements are valid.
The input statement consists of the T_CIN keyword, followed by the
T_GTGT token, followed by a designator (the designator is as currently
defined in the grammar).
The output statement consists of the T_COUT keyword, followed by a list of
output pairs,
each pair consisting of the T_LTLT token followed by either an
expression (from the grammar) or the T_NL keyword.
At least one such pair must follow T_COUT.
Examples:
CIN >> x
COUT << x << y
COUT << "hello, " << strName << NL
COUT << NL
COUT << "x + y = " << NL << x + y