CSE 131A -- Fall 2000
Compiler Project #1 -- Lexical Analysis
Due: Midnight on Sunday, October 15, 2000

For this assignment, we will be building the lexical analysis component of a compiler, often referred to as a lexer. A lexer performs three basic functions:

  1. Recognize patterns in the input stream (such patterns are called lexemes), and pass internal representations of lexemes (called tokens) on for later processing. Examples of tokens are keywords (such as IF, ELSE, RETURN, etc), punctuation symbols (such as semicolons, periods, commas, also two-character symbols like assignment ( := ) or greater-than-or-equal ( >= )), literals (such as 44, 7.0, or "Hello"), and user-defined identifiers. The Oberon Token Specification specifies the complete set of Oberon tokens and their patterns.
  2. Discard input characters that are not part of any lexeme. (In other words, ignore characters that we don't care about). This would include comments, whitespace characters (spaces, tabs, newlines) and any illegal characters that may be read.
  3. Report any lexical errors encountered.

Specification

THe lexer must be implemented as a class called Lexer which contains a public method called GetToken(). This method returns an object of type Token.

C/C++

The specification for GetToken is:

Token* GetToken ()

The return value is a pointer to a Token structure, which contains a code (token type) and a lexeme field, or NULL if the end of file is reached. It is assumed that the caller of GetToken() will free the Token pointers returned to it.

We provide three files -- Token.hpp, containing the Token structure definition, y.tab.h, containing the valid Token values to be returned, and LexerTester.cpp, a test program for your lexer. In addition to implementing the Lexer class, you should write a main program that instantiates a Lexer object, then instantiates a LexerTexter object, and then calls the LexerTexter's Run method.

Special note: for reasons that will be made clear in project 2, the Lexer object must be global (declared outside of main()).

Java

The specification for GetToken is:

Token GetToken ()

We provide three files -- sym.java, containing the valid token codes, Token.java, containing the Token class to be returned by GetToken(), and LexerTester.java, a test program for your lexer. You should implement a program class (say, Oberon), whose main() method instantiates a Lexer instance and then passes that object to a newly instantiated LexerTester instance (as shown in LexerTester.java).

Notes

Error Messages

All error messages should be printed to standard output, not to standard error. This bears repeating: all error messages should be printed to standard output, not standard error. All error messages must be one of the following two forms:

Error, "file", line N: <message>
Error, (stdin), line N: <message>

depending on whether the lexer is currently reading from a file or from standard input. Note that you must keep track of the line number (most likely, this would be a private member of the Lexer class).

<message> is one of the following:

For the last three cases, both the INCLUDE and the offending token are discarded, and the lexer should continue looking for a valid token. Note that this means that the string INCLUDE can never be returned by the lexer as any token.

Turnin Instructions

Finalized turnin instructions will be posted at a later date. You will eventually turn in your source code (not your executables) to us electronically, and we will build and run your project. Therefore, it is very important that you adhere to the following.