Spring 2006        CSE 141L

 

Project/Computer Architecture

 

 


 

Lexer Suggested Code

 

This covers how to build an assembly file scanner. Feel free to use it, or not to use it- using it does not help or hinder your grade. Even if you do, you will need to add all of the semantic analysis that understands your assembly format. The following description on lexing is very high level. If you need more information follow these links:

 

GNU C/C++ flex

 

Java version jflex

 

Though the description is written for GNU C/C++ flex tool, the Java version jflex is almost identical, and you should not have any trouble adapting the description for Java.

 

 


 FLEX

 

A parser is a tool that reads in a string of text, and tries to match a set of rules to the string. Since this is required for many applications, there are compilers to generate parsers from a concise description of the text matching rules. They are differentiated by feature set. The open source flex and jflex tools are examples of lexical analysis scanners. A lexical analyzer means that it can understand languages recognized by a finite state machine, which will be sufficient for your assembler. The input file for flex has the form:

 

definitions
%%
rules
%%
user code

 

Definitions allow one to assign regular expressions to some name, for use in the rules section. One can also place regular expression in the rules too. The following are examples of definitions, where the name DIGIT is assigned to a regular expression matching one number digit, and the name ID is assigned to a regular expression that matches variables that start with a letter followed by any combination of letters and numbers. There is a description of regular expressions in the following section.

 

DIGIT

[0-9]

ID

[a-z][a-z0-9]*

 

 

Rules have the form

Pattern

Action

 

where pattern is a regular expression or name, followed by the action which is some code to execute if the pattern matches. The following example on the name ID match, prints what it finds, and returns a number 5. This return value tells the caller to the lexer that it found an ID token. If several patterns match a string input, flex takes the longest match. If they are still equivalent, it uses the first matching rule in the rules specification.

 

{ID}

{ printf("found a ID token %s\n", yytext); return 5; }

is equivalent to 

[a-z][a-z0-9]*

{ printf("found a ID token %s\n", yytext); return 5; }

 

The lexer is called up from another module via the interface "int yylex();". Yylex contains the code for rules. It assumes that the input text file to read is passed through "FILE* yyin". When the lexer reaches the end-of-file of yyin, yylex returns a 0.

 

The rules section has other features such as supporting predicating the string matching, input stream look ahead, and manipulating the input stream. We will not cover this.

 

When flex runs on the source file, it generates C code (compatible with C++) to perform the parsing. Code in the user code section is copied verbatim to the output file. Similarly code found in the definitions and rules section surrounded "{%" and "%}" or as indented text is copied verbatim to the output file. This generated output of flex is by default called lex.yy.c, though this can be specified from the command line. In our example file, another module calls up the scanner in lex.yy.c, to perform analysis (dump out strings found).

 


Regular Expression

 

Regular expressions are patterns that a string can match. You can specify exact match, or more general patterns. Here are some important examples snarfed from the flex manual:

 

x

match the character `x'

.

any character (byte) except newline

[xyz]

a "character class"; in this case, the pattern matches either an `x', a `y', or a `z'

[abj-oZ]

a "character class" with a range in it; matches an `a', a `b', any letter from `j' through `o', or a `Z'

[^A-Z]

a "negated character class", i.e., any character but those in the class. In this case, any character EXCEPT an uppercase letter.

[^A-Z\n]

any character EXCEPT an uppercase letter or a newline

r*

zero or more r's, where r is any regular expression

r+

one or more r's

{name}

the expansion of the "name" definition (see above)

r|s

either an r or an s

 

 


How to run

The source code is located on the following pages:

C/C++

Makefile, sample_lex.l, scanner.C, token.h

 

Java uses jflex

Makefile, sample_flex.flex, Jscanner.java, Token_type.java

 

Test input: test.s

 

The java flex compiler jflex is not installed on the ACS computers. You either have to add the path /home/linux/ieng6/cs141s/public/jflex to your PATH, or use the full path /home/linux/ieng6/cs141s/public/jflex/jflex.

Download code to a directory. To build, run make on the respective Makefile.

> make

To run C/C++, run as

> scanner <input text file>

To run Java, run as

> java Jscanner <input text file>