```
> {-# LANGUAGE LambdaCase #-}
> import Data.Char
> import Data.Functor
> import Control.Monad
```

Before we continue, a word from our sponsors:

` **Don't Fear Monads**`

They are simply an (extremely versatile) abstraction, like `map`

or `fold`

.

A parser is a piece of software that takes a raw `String`

(or sequence of bytes) and returns some structured object, for example, a list of options, an XML tree or JSON object, a program’s Abstract Syntax Tree and so on. Parsing is one of the most basic computational tasks. *Every* serious software system has a parser tucked away somewhere inside, for example

```
System Parses
-------------- ------------------------------
Shell Scripts Command-line options
Browsers HTML
Games Level descriptors
Routers Packets
```

`!--`

- Shell Scripts (command-line options)
- Web Browsers (duh!)
- Games (level descriptors)
- Routers (packets)
- etc

–>

(Indeed I defy you to find any serious system that *does not* do some parsing somewhere!)

The simplest and most accurate way to think of a parser is as a function

`type Parser = String -> StructuredObject`

The usual way to build a parser is by specifying a grammar and using a parser generator (eg yacc, bison, antlr) to create the actual parsing function. While elegant, one major limitation of the grammar based approach is its lack of modularity. For example, suppose I have two kinds of primitive values `Thingy`

and `Whatsit`

.

```
Thingy : rule { action }
;
Whatsit : rule { action }
;
```

If you want a parser for *sequences of* `Thingy`

and `Whatsit`

we have to painstakingly duplicate the rules as

```
Thingies : Thingy Thingies { ... }
EmptyThingy { ... }
;
Whatsits : Whatsit Whatsits { ... }
EmptyWhatsit { ... }
;
```

This makes sub-parsers hard to reuse. Next, we will see how to *compose* mini-parsers for sub-values to get bigger parsers for complex values.

To do so, we will generalize the above parser type a little bit, by noting that a (sub-)parser need not (indeed, will not) consume consume *all* of its input, and so we can simply have the parser return the unconsumed input

`type Parser = String -> (StructuredObject, String) `

Of course, it would be silly to have different types for parsers for different kinds of objects, and so we can make it a parameterized type

`type Parser a = String -> (a, String) `

One last generalization: the parser could return multiple results, for example, we may want to parse the string

`"2 - 3 - 4"`

either as

`Minus (Minus 2 3) 4`

or as

`Minus 2 (Minus 3 4)`

So, we can have our parsers return a *list* of possible results (where the empty list corresponds to a failure to parse.)

`> newtype Parser a = P (String -> [(a, String)])`

The above is simply the parser (*cough* action) the actual parsing is done by

`> doParse (P p) s = p s`

Lets build some parsers!

Recall

`newtype Parser a = P (String -> [(a, String)])`

Which of the following is a valid single-character-parser that returns the **first** `Char`

from a string (if one exists.)

```
-- a
oneChar = P $ \cs -> head cs
-- b
oneChar = P $ \case -> {[] -> [('', []) | c:cs -> (c, cs)}
-- c
oneChar = P $ \cs -> (head cs, tail cs)
-- d
oneChar = P $ \cs -> [(head cs, tail cs)]
-- e
oneChar = P $ \case -> { [] -> [] | cs -> [(head cs, tail cs)]}
```

```
```

Yes, we can!

```
> oneChar :: Parser Char
> oneChar = P (\cs -> case cs of
> c:cs' -> [(c, cs')]
> _ -> [])
```

Lets run the parser

```
ghci> doParse oneChar "hey!"
[('h',"ey!")]
ghci> doParse oneChar ""
[]
```

Now we can write another parser that grabs a **pair** of `Char`

values

```
twoChar :: Parser (Char, Char)
twoChar = P (\cs -> case cs of
c1:c2:cs' -> [((c1, c2), cs')]
_ -> [])
```

Lets run the parser

```
ghci> doParse twoChar "hey!"
[(('h', 'e'), "y!")]
ghci> doParse twoChar "h"
[]
```

Recall

```
twoChar :: Parser (Char, Char)
twoChar = P (\cs -> case cs of
c1:c2:cs' -> [((c1, c2), cs')]
_ -> [])
```

Suppose we had some `foo`

such that behaved **identically** to `twoChar`

.

`twoChar' = foo oneChar oneChar `

What must the type of `foo`

be?

`Parser (Char, Char)`

`Parser Char -> Parser (Char, Char)`

`Parser a -> Parser a -> Parser (a, a)`

`Parser a -> Parser b -> Parser (a, b)`

`Parser a -> Parser (a, a)`

```
```

Indeed, `foo`

is a **parser combinator** that takes two parsers and returns a new parser that returns a pair of values:

```
pairP :: Parser a -> Parser b -> Parser (a, b)
pairP p1 p2 = P (\cs ->
[((x,y), cs'') | (x, cs' ) <- doParse p1 cs,
(y, cs'') <- doParse p2 cs']
)
```

Now we can more cleanly write:

`> twoChar = pairP oneChar oneChar `

which would run like this

```
ghci> doParse twoChar "hey!"
[(('h','e'), "y!")]
```

**EXERCISE:** Can you explain why we get the following behavior?

```
ghci> doParse twoChar "h"
[]
```

Now we could keep doing this, but often to go forward, it is helpful to step back and take a look at the bigger picture.

Here’s the the **type** of a parser

`newtype Parser a = P (String -> [(a, String)])`

it should remind you of something else, remember this?

`type ST a = S (State -> (a, State))`

*(drumroll…)*

Indeed, a parser, like a state transformer, is a monad! if you squint just the right way.

We need to define the `return`

and `>>=`

functions.

The bind is a bit tricky, but we just saw it above!

```
:type bindP
bindP :: Parser a -> (a -> Parser b) -> Parser b
```

so, we need to suck the `a`

values out of the first parser and invoke the second parser with them on the remaining part of the string.

returnP :: a -> Parser a returnP x = P $ -> [(x, cs)]

Recall

```
doParse :: Parser a -> String -> [(a, String)]
doParse (P p) str = p str
```

Consider the function `bindP`

:

```
bindP :: Parser a -> (a -> Parser b) -> Parser b
bindP p1 fp2 = P $ \cs -> [(y, cs'') | (x, cs') <- undefined -- 1
, (y, cs'') <- undefined -- 2
]
```

What shall we fill in for the two `undefined`

to get the code to typecheck?

`p1 cs`

and`fp2 x cs`

`doParse p1 cs`

and`doParse (fp2 x) cs'`

`p1 cs`

and`fp2 x cs'`

`doParse p1 cs`

and`doParse (fp2 x) cs`

`doParse p1 cs`

and`doParse fp2 x cs'`

```
```

Indeed, we can define the `bindP`

function for `Parser`

s as:

```
> bindP p1 fp2 = P $ \cs -> [(y, cs'') | (x, cs') <- doParse p1 cs
> , (y, cs'') <- doParse (fp2 x) cs']
```

See how we suck the `a`

values out of the first parser (by running `doParse`

) and invoke the second parser on each possible `a`

(and the remaining string) to obtain the final `b`

and remainder string tuples.

The `return`

is very simple, we can let the types guide us

```
:type returnP
returnP :: a -> Parser a
```

which means we must ignore the input string and just return the input element

`> returnP x = P (\cs -> [(x, cs)])`

Armed with those, we can officially brand parsers as monads

```
> instance Monad Parser where
> (>>=) = bindP
> return = returnP
```

This is going to make things really sweet…

Since parsers are monads, we can write a bunch of **high-level combinators** for composing smaller parsers into bigger ones.

For example, we can use our beloved `do`

notation to rewrite `pairP`

as

```
> pairP :: Parser a -> Parser b -> Parser (a, b)
> pairP px py = do x <- px
> y <- py
> return (x, y)
```

shockingly, exactly like the `pairs`

function from here.

Next, lets flex our monadic parsing muscles and write some new parsers. It will be helpful to have a a *failure* parser that always goes down in flames, that is, returns `[]`

– **no** successful parses.

`> failP = P $ const []`

Seems a little silly to write the above, but its helpful to build up richer parsers like the following which parses a `Char`

*if* it satisfies a predicate `p`

```
> satP :: (Char -> Bool) -> Parser Char
> satP p = do c <- oneChar
> if p c then return c else failP
```

we can write some simple parsers for particular characters

`> lowercaseP = satP isAsciiLower`

```
ghci> doParse (satP ('h' ==)) "mugatu"
[]
ghci> doParse (satP ('h' ==)) "hello"
[('h',"ello")]
```

The following parse alphabet and numeric characters respectively

```
> alphaChar = satP isAlpha
> digitChar = satP isDigit
```

and this little fellow returns the first digit in a string as an `Int`

```
> digitInt = do c <- digitChar
> return ((read [c]) :: Int)
```

which works like so

```
ghci> doParse digitInt "92"
[(9,"2")]
ghci> doParse digitInt "cat"
[]
```

Finally, this parser will parse only a particular `Char`

passed in as input

`> char c = satP (== c)`

**EXERCISE:** Write a function `strP :: String -> Parser String`

such that `strP s`

parses **exactly** the string `s`

and nothing else, that is,

```
ghci> dogeP = strP "doge"
ghci> doParse dogeP "dogerel"
[("doge", "rel")]
ghci> doParse dogeP "doggoneit"
[]
```

mapM f [] = return [] mapM f (x:xs) = do y <- f x ys <- mapM f xs return (y:ys)

strP = mapM char

f = foo . bar

` cs :: [Char] `

map char cs :: [Parser Char]

strP :: [Char] -> Parser [Char] strP [] = return “” strP (c:cs) = char c >> strP cs >> return (c:cs)

sequence :: [Parser Char] -> Parser [Char]

Next, lets write a combinator that takes two sub-parsers and **non-deterministically chooses** between them.

`> chooseP :: Parser a -> Parser a -> Parser a`

That is, we want `chooseP p1 p2`

to return a succesful parse if *either* `p1`

or `p2`

succeeds.

We can use `chooseP`

to build a parser that returns either an alphabet or a numeric character

`> alphaNumChar = alphaChar `chooseP` digitChar`

After defining the above, we should get something like:

```
ghci> doParse alphaNumChar "cat"
[('c', "at")]
ghci> doParse alphaNumChar "2cat"
[('2', "cat")]
ghci> doParse alphaNumChar "230"
[('2', "30")]
```

How would we go about encoding **choice** in our parsers?

```
-- a
p1 `chooseP` p2 = do xs <- p1
ys <- p2
return (x1 ++ x2)
-- b
p1 `chooseP` p2 = do xs <- p1
case xs of
[] -> p2
_ -> return xs
-- c
p1 `chooseP` p2 = P $ \cs -> doParse p1 cs ++ doParse p2 cs
-- d
p1 `chooseP` p2 = P $ \cs -> case doParse p1 cs of
[] -> doParse p2 cs
rs -> rs
```

```
```

`> p1 `chooseP` p2 = P (\cs -> doParse p1 cs ++ doParse p2 cs)`

Thus, what is even nicer is that if *both* parsers succeed, you end up with all the results.

Here’s a parser that grabs `n`

characters from the input

```
> grabn :: Int -> Parser String
> grabn n
> | n <= 0 = return ""
> | otherwise = do c <- oneChar
> cs <- grabn (n-1)
> return (c:cs)
```

**DO IN CLASS** How would you nuke the nasty recursion from `grabn`

?

Lets now use our choice combinator to define:

`> foo = grabn 2 `chooseP` grabn 4`

What does the following evaluate to?

`ghci> doParse foo "mickeymouse"`

`[]`

`[("mi","ckeymouse")]`

`[("mick","eymouse")]`

`[("mi","ckeymouse"),("mick","eymouse")]`

`[("mick","eymouse"), ("mi","ckeymouse")]`

```
```

and only one result if thats possible

```
ghci> doParse grab2or4 "mic"
[("mi","c")]
ghci> doParse grab2or4 "m"
[]
```

Even with the rudimentary parsers we have at our disposal, we can start doing some rather interesting things. For example, here is a little calculator. First, we parse the operation

```
> intOp = plus `chooseP` minus `chooseP` times `chooseP` divide
> where
> plus = char '+' >> return (+)
> minus = char '-' >> return (-)
> times = char '*' >> return (*)
> divide = char '/' >> return div
```

**DO IN CLASS** Can you guess the type of the above parser?

Next, we can parse the expression

```
> calc = do x <- digitInt
> op <- intOp
> y <- digitInt
> return $ x `op` y
```

which, when run, will both parse and calculate

```
ghci> doParse calc "8/2"
[(4,"")]
ghci> doParse calc "8+2cat"
[(10,"cat")]
ghci> doParse calc "8/2cat"
[(4,"cat")]
ghci> doParse calc "8-2cat"
[(6,"cat")]
ghci> doParse calc "8*2cat"
[(16,"cat")]
```

What does the following return:

`ghci> doParse calc "99bottles"`

- Type error
`[]`

`[(9, "9bottles")]`

`[(99, "bottles")]`

- Run-time exception

To start parsing interesting things, we need to add recursion to our combinators. For example, its all very well to parse individual characters (as in `char`

above) but it would a lot more swell if we could grab particular `String`

tokens.

Lets try to write it!

`> string :: String -> Parser String`

```
string "" = return ""
string (c:cs) = do char c
string cs
return (c:cs)
```

**DO IN CLASS** Ewww! Is that explicit recursion ?! Lets try again (can you spot the pattern)

`> string = undefined -- fill this in`

Much better!

```
ghci> doParse (string "mic") "mickeyMouse"
[("mic","keyMouse")]
ghci> doParse (string "mic") "donald duck"
[]
```

Ok, I guess that wasn’t really recursive then after all!

Lets try again.

Lets write a combinator that takes a parser `p`

that returns an `a`

and returns a parser that returns *many* `a`

values. That is, it keeps grabbing as many `a`

values as it can and returns them as a `[a]`

.

```
> manyP :: Parser a -> Parser [a]
> manyP p = many1 `chooseP` many0
> where
> many0 = return []
> many1 = do x <- p
> xs <- manyP p
> return (x:xs)
```

But beware! The above can yield *many* results

```
ghci> doParse (manyP digitInt) "123a"
[([], "123a"), ([1], "23a"),([1, 2], "3a"),([1, 2, 3], "a")]
```

which is simply all the possible ways to extract sequences of integers from the input string.

Often we want a single result, not a set of results. For example, the more intuitive behavior of `many`

would be to return the maximal sequence of elements and not *all* the prefixes.

To do so, we need a *deterministic* choice combinator

(#%)

```
> firstChooseP :: Parser a -> Parser a -> Parser a
> firstChooseP p1 p2 = P $ \cs -> case doParse (p1 `chooseP` p2) cs of
> [] -> []
> x:_ -> [x]
```

```
> (<|>) :: Parser a -> Parser a -> Parser a
> p1 <|> p2 = P $ \cs -> case doParse (p1 `chooseP` p2) cs of
> [] -> []
> x:_ -> [x]
>
```

The above runs choice parser but returns only the first result. Now, we can revisit the `manyP`

combinator and ensure that it returns a single, maximal sequence

```
> mmanyP :: Parser a -> Parser [a]
> mmanyP p = mmany1 <|> mmany0
> where
> mmany0 = return []
> mmany1 = do x <- p
> xs <- mmanyP p
> return (x:xs)
```

**DO IN CLASS** Wait a minute! What exactly is the difference between the above and the original `manyP`

? How do you explain this:

```
ghci> doParse (manyP digitInt) "123a"
[([1,2,3],"a"),([1,2],"3a"),([1],"23a"),([],"123a")]
ghci> doParse (mmanyP digitInt) "123a"
[([1,2,3],"a")]
```

Lets use the above to write a parser that will return an entire integer (not just a single digit.)

```
oneInt :: Parser Integer
oneInt = do xs <- mmanyP digitChar
return $ ((read xs) :: Integer)
```

bob :: (a -> b) -> Parser a -> (Parser b) bob f p = do x <- p return $ f x

*Aside*, can you spot the pattern above? We took the parser `mmanyP digitChar`

and simply converted its output using the `read`

function. This is a recurring theme, and the type of what we did gives us a clue

`(a -> b) -> Parser a -> Parser b`

Aha! a lot like `map`

. Indeed, there is a generalized version of `map`

that we have seen before (`lift1`

) and we bottle up the pattern by declaring `Parser`

to be an instance of the `Functor`

typeclass

```
> instance Functor Parser where
> fmap f p = do x <- p
> return (f x)
```

after which we can rewrite

```
> oneInt :: Parser Int
> oneInt = read `fmap` mmanyP digitChar
```

Lets take it for a spin

```
ghci> doParse oneInt "123a"
[(123, "a")]
```

Lets use the above to build a small calculator, that parses and evaluates arithmetic expressions. In essence, an expression is either binary operand applied to two sub-expressions or an integer. We can state this as

```
> calc0 :: Parser Int
> calc0 = binExp <|> oneInt
> where
> binExp = do x <- oneInt
> o <- intOp
> y <- calc0
> return $ x `o` y
```

This works pretty well!

```
ghci> doParse calc0 "1+2+33"
[(36,"")]
ghci> doParse calc0 "11+22-33"
[(0,"")]
```

but things get a bit strange with minus

```
ghci> doParse calc0 "11+22-33+45"
[(-45,"")]
```

Huh? Well, if you look back at the code, you’ll realize the above was parsed as

`11 + ( 22 - (33 + 45))`

because in each `binExp`

we require the left operand to be an integer. In other words, we are assuming that each operator is *right associative* hence the above result.

I wonder if we can try to fix it just by flipping the order

```
> calc1 :: Parser Int
> calc1 = binExp <|> oneInt
> where
> binExp = do x <- calc1
> o <- intOp
> y <- oneInt
> return $ x `o` y
```

```
> calc1' :: Parser Int
> calc1' = oneInt <|> binExp
> where
> binExp = do x <- calc1'
> o <- intOp
> y <- oneInt
> return $ x `o` y
```

What does the following evaluate to?

`ghci> doParse calc1' "11+22-33+45"`

`[( 11 , "+22-33+45")]`

`[( 33 , "-33+45")]`

`[( 0, "+45")]`

`[( 45 , "")]`

- None of the above

```
```

Indeed, there is a bug here … can you figure it out?

**Hint:** what will the following return?

`ghci> doParse calc1 "2+2"`

Even worse, we have no precedence, and so

```
ghci> doParse calc0 "10*2+100"
[(1020,"")]
```

as the string is parsed as

`10 * (2 + 100)`

```
> sumE1 = chain addOp prodE1
> prodE1 = chain mulOp factorE1
> factorE1 = parenP sumE1 <|> oneInt
```

```
> chainl op base = base >>= bob
> where
> bob x = grab x <|> return x
> grab x = do o <- op
> y <- base
> bob $ x `o` y
```

```
> sumE1 = prodE1 >>= bob
> where
> bob x = grab x <|> return x
> grab x = do o <- addOp
> y <- prodE1
> bob $ x `o` y
>
> prodE1 = factorE1 >>= bob
> where
> bob x = grab x <|> return x
> grab x = do o <- mulOp
> y <- factorE1
> bob $ x `o` y
```

We can add both associativity and precedence, by stratifying the parser into different levels. Here, lets split our operations into addition-

```
> addOp = plus `chooseP` minus
> where
> plus = char '+' >> return (+)
> minus = char '-' >> return (-)
```

and multiplication-precedence.

```
> mulOp = times `chooseP` divide
> where
> times = char '*' >> return (*)
> divide = char '/' >> return div
```

Now, we can stratify our language into (mutually recursive) sub-languages, where each top-level expression is parsed as a **sum-of-products**

```
> sumE = addE <|> prodE
> where
> addE = do x <- prodE
> o <- addOp
> y <- sumE
> return $ x `o` y
>
> prodE = mulE <|> factorE
> where
> mulE = do x <- factorE
> o <- mulOp
> y <- prodE
> return $ x `o` y
>
> factorE = parenP sumE <|> oneInt
```

We can run this

```
ghci> doParse sumE "10*2+100"
[(120,"")]
ghci> doParse sumE "10*(2+100)"
[(1020,"")]
```

Do you understand why the first parse returned `120`

? What would happen if we *swapped* the order of `prodE`

and `sumE`

in the body of `addE`

(or `factorE`

and `prodE`

in the body of `prodE`

) ? Why?

Recall that in the above,

```
factorE :: Parser Int
factorE = parenP sumE <|> oneInt
```

What is the type of `parenP`

?

`Parser Int`

`Parser a -> Parser a`

`a -> Parser a`

`Parser a -> a`

`Parser Int -> Parser a`

```
```

Lets write `parenP`

```
> parenP p = do char '('
> x <- p
> char ')'
> return x
```

There is not much point gloating about combinators if we are going to write code like the above – the bodies of `sumE`

and `prodE`

are almost identical!

Lets take a closer look at them. In essence, a `sumE`

is of the form

`prodE + < prodE + < prodE + ... < prodE >>>`

that is, we keep chaining together `prodE`

values and adding them for as long as we can. Similarly a `prodE`

is of the form

`factorE * < factorE * < factorE * ... < factorE >>>`

where we keep chaining `factorE`

values and multiplying them for as long as we can. There is something unpleasant about the above: the addition operators are right-associative

```
ghci> doParse sumE "10-1-1"
[(10,"")]
```

Ugh! I hope you understand why: its because the above was parsed as `10 - (1 - 1)`

(right associative) and not `(10 - 1) - 1`

(left associative). You might be tempted to fix that simply by flipping the order of `prodE`

and `sumE`

```
sumE = addE <|> prodE
where
addE = do x <- sumE
o <- addOp
y <- prodE
return $ x `o` y
```

but this would prove disastrous. Can you see why?

The parser for `sumE`

directly (recursively) calls itself **without consuming any input!** Thus, it goes off the deep end and never comes back. Instead, we want to make sure we keep consuming `prodE`

values and adding them up (rather like fold) and so we could do

```
> sumE1 = prodE1 >>= addE1
> where
> addE1 x = grab x <|> return x
> grab x = do o <- addOp
> y <- prodE1
> addE1 $ x `o` y
>
> prodE1 = factorE1 >>= mulE1
> where
> mulE1 x = grab x <|> return x
> grab x = do o <- mulOp
> y <- factorE1
> mulE1 $ x `o` y
>
> factorE1 = parenP sumE1 <|> oneInt
```

It is easy to check that the above is indeed left associative.

```
ghci> doParse sumE1 "10-1-1"
[(8,"")]
```

and it is also very easy to spot and bottle the chaining computation pattern: the only differences are the *base* parser (`prodE1`

vs `factorE1`

) and the binary operation (`addOp`

vs `mulOp`

). We simply make those parameters to our *chain-left* combinator

```
> p `chainl` pop = p >>= rest
> where
> rest x = grab x <|> return x
> grab x = do o <- pop
> y <- p
> rest $ x `o` y
```

after which we can rewrite the grammar in three lines

```
> sumE2 = prodE2 `chainl` addOp
> prodE2 = factorE2 `chainl` mulOp
> factorE2 = parenP sumE2 <|> oneInt
```

```
ghci> doParse sumE2 "10-1-1"
[(8,"")]
ghci> doParse sumE2 "10*2+1"
[(21,"")]
ghci> doParse sumE2 "10+2*1"
[(12,"")]
```

That concludes our in-class exploration of monadic parsing. This is merely the tip of the iceberg. Though parsing is a very old problem, and has been studied since the dawn of computing, we saw how monads bring a fresh perspective which have recently been transferred from Haskell to many other languages. There have been several exciting recent papers on the subject, that you can explore on your own. Finally, Haskell comes with several parser combinator libraries including Parsec which you will play around with in HW2.