Overview

Compilers and interpreters take as input programs in string form. Lexing and parsing are the first two steps towards converting this string input into a program in the language that can then be executed or interpreted. Lexing is the process that takes the entire string input and breaks it into smaller chunks, or tokens. Tokens are often separated by spaces, newlines, operators, and other characters like ;, (, and }. Once a string has been tokenized into a sequence of tokens, the parsing process takes the sequence and groups tokens together according to the rules defined by a grammar.

Implementations of lexers and parsers for different languages have a lot similarities. To avoid duplicating effort, tools called lexer-generators and parser-generators allow the compiler writer to focus on the interesting details of his/her language -- namely, the syntax and semantics -- while taking care of the boilerplate code (think fold).

Used for writing compilers and interpreters in OCaml, ocamllex is a lexer-generator and ocamlyacc a parser-generator. The compiler writer implements two specification files in particular formats that ocamllex and ocamlyacc process, and they generate pure OCaml code that can be executed to lex and parse strings.

We will look at an example lexer and parser for a simple language of binary expressions and assignments. Run ./build.sh to build everything.

beParser.mly

In this file we define the different kinds of tokens for our language, the grammar rules that define legal statements in our language, and how to process the stream of tokens when groups of them match a grammar rule. ocamlyacc will read this file and then generate an OCaml file called beParser.ml, which will be the complete parser for our language.

The code between %{ and %} gets copied verbatim into beParser.ml, so it is pure OCaml code. In this case, we just open up the BeAst module that defines the abstract syntax for our language of binary expressions and assignments.

Each %token line defines a kind of token. Note that multiple tokens can be defined on the same line, and you can define a token to carry along with it data of a particular type using <...>. Take a look at beParser.ml to see the OCaml datatype that corresponds to these token definitions.

The %start and %type lines define which of the rules (defined later in the file) should be used as the top-most rule when trying to parse an input string. The remainder of the file after the %% defines the grammar rules. For each rule in the grammar, there is a corresponding { and }. What goes inside there is the OCaml "expression" that gets returned when the particular rule is matched. It is normal OCaml code, with the exception of things like $1 and $2, which correspond to the values returned by the first and second subparts of the rule. For example, when some sequence of tokens matches the fourth atom rule, the expression matched by expr between two parentheses is referred to as $2.

For more complete documentation about ocamlyacc, see http://caml.inria.fr/pub/docs/manual-ocaml-4.00/manual026.html.

beLexer.mll

In this file we define how characters in the input string should be broken up into the tokens defined in beParser.mly. The code within the { and } at the top is pure OCaml code that gets copied verbatim into a file called beLexer.ml. Notice that we open the BeParser module (which will be generated by ocamlyacc so that the token datatype is in scope).

The rest of the file defines how to convert ASCII characters into tokens. We choose the name token for our definition, but we very well could have named in anything. Whatever name we choose, however, we will be the entrypoint to the lexer; look at the file beLexer.ml and look for a function called token.

The syntax in this section of the specification should be readable because it is very close to pure OCaml. The input string is matched against a series of string literals and regular expressions. As soon as one of the patterns matches the input string, the OCaml code within the corresponding { and } is added to the list of tokens, and the token function is recursively invoked on what remains of the input string. Note that in the first rule, lexbuf is the implicit name of the input string, so token lexbuf is a recursive call to the lexer that does not add any tokens to the list whenever any whitespace is matched in the input string.

For more complete documentation about ocamllex, see http://caml.inria.fr/pub/docs/manual-ocaml-4.00/manual026.html.

Putting it all together

After running ocamllex beLexer.ml and ocamlyacc beParser.mly, the modules BeLexer and BeParser can now be used to lex and parse strings and interpret them as statements in our little language. The parse_string function in beMain.ml shows how to use the generated lexer and parser to process a string.

Run ./be.top, which is an OCaml top-level that already has loaded all the modules we need (look at build.sh to see how be.top is created). We can now process strings in the language:
% ./be.top 
        Objective Caml version 3.10.0
# BeMain.token_list_of_string "true || false";;
- : BeParser.token list = [BeParser.TRUE; BeParser.OR; BeParser.FALSE]

# BeMain.token_list_of_string "(look ma no hands)";;
- : BeParser.token list =
[BeParser.LPAREN; BeParser.Id "look"; BeParser.Id "ma"; BeParser.Id "no"; BeParser.Id "hands"; BeParser.RPAREN]

# BeMain.parse_string "(true || false) && x";;
- : BeAst.bexpr = BeAst.And (BeAst.Or (BeAst.Const true, BeAst.Const false), BeAst.Var "x")

# BeMain.parse_string "     true || (false && x)";;
- : BeAst.bexpr = BeAst.Or (BeAst.Const true, BeAst.And (BeAst.Const false, BeAst.Var "x"))

# BeMain.parse_string "look ma no hands";;
Exception: Parsing.Parse_error.