Oberon Tokens Specification

The following specifies the set of tokens for our lexer. All token values are found in either y.tab.h or Token.java.

Oberon Keywords

The following is the complete list of Oberon reserved words (also known as keywords). For each of these, the GetToken value is the name of the keyword, prepended by T_, as shown below. Note that only uppercase strings qualify as keywords.
    ARRAY         (T_ARRAY)
    BEGIN         (T_BEGIN)
    BY            (T_BY)
    CASE          (etc.)
    CONST
    DIV
    DO
    ELSE
    ELSIF
    END
    EXIT
    FOR
    IF
    IMPORT
    IN
    IS
    LOOP
    MOD
    MODULE
    NIL
    OF
    OR
    POINTER
    PROCEDURE
    RECORD
    REPEAT
    RETURN
    THEN
    TO
    TYPE
    UNTIL
    VAR
    WHILE
    WITH
Oberon calls the following "predeclared identifiers", but they are for our purposes equivalent to keywords:
    BOOLEAN
    CHAR
    FALSE
    INTEGER
    NEW
    REAL
    TRUE

Punctuation Tokens

The following are punctuation symbols that our lexer must recognize, followed by the token value. Any printable character that doesn't appear in the following list (such as @, $, etc) is considered to be an illegal character.

    &     T_AMPERSAND
    ^     T_ARROW
    :=    T_ASSIGN
    |     T_BAR
    :     T_COLON
    ,     T_COMMA
    ..    T_DOTDOT
    .     T_DOT
    =     T_EQU
    >     T_GT
    >=    T_GTE
    {     T_LBRACE
    [     T_LBRACKET
    (     T_LPAREN
    <     T_LT
    <=    T_LTE
    -     T_MINUS
    #     T_NEQ
    +     T_PLUS
    }     T_RBRACE
    ]     T_RBRACKET
    )     T_RPAREN
    ;     T_SEMI
    ~     T_TILDE
    /     T_SLASH
    *     T_STAR

Special Tokens

The following are the five special tokens -- their patterns are described below.

T_ID

A sequence of up to 40 letters and digits, starting with a letter. (Unlike many other languages, the underscore character cannot be part of an identifier and is in fact an illegal character.)



T_INT_LITERAL

A sequence of up to 10 digits, OR a sequence of up to 10 hex digits (0-9, uppercase A-F) followed by the character H.

T_STR_LITERAL

A sequence of up to 80 characters, surrounded by either single or double quotes. The ending quote must be the same as the starting quote. It is legal for single quotes to be part of (inside) a double-quoted string, and vice versa. Note that the surrounding quote characters themselves aren't considered when considering the length of the literal -- the largest string literal is 80 characters, PLUS the two surrounding quote characters. In all cases (whether or not a too-long condition has occurred), the surrounding quotes are not to be returned as part of the lexeme.

T_REAL_LITERAL

A real literal consists of a mantissa, optionally followed by an exponent. A mantissa is a sequence of 1 to 9 digits and a period (not counting leading zeroes, which are discarded). The period can be in the middle or end of the mantissa, but NOT in the beginning (this is according to the Oberon spec). (Note that a single period, by itself, is not a legal mantissa, but instead is a punctuation token.) The exponent, if present, consists of an uppercase E or D, an optional plus or minus sign, and up to 3 digits.



T_CHAR_LITERAL

A sequence of up to 3 hex digits, followed by the letter X.