# Regular Expressions

Regural expressions are easily defined as an abstract data type in haskell:

``````data Rex = Empty
| Epsilon
| Sym Char
| Union Rex Rex
| Concat Rex Rex
| Star Rex``````

There are six cases, corresponding to different types of regular expressions. Notice that this is a recursive (inductive) type definition. Using this user-defined type we can represent arbitrary regular expressions as haskell values of type `Rex`. For example, the expression a(a + b)* is written as

``regex0 = Concat (Sym 'a') (Star (Union (Sym 'a') (Sym 'b'))) :: Rex``

In order to make regular expressions easier to type, we can introduce infix operators for the union and concatenation operations:

``````(##) = Concat  -- concatenation of two expressions
(%%) = Union   -- union of two expressions

regex1 = Sym 'a' ## Star (Sym 'a' %% Sym 'b') -- all strings starting with a
regex2 = Sym 'b' ## Star (Sym 'a' %% Sym 'b') -- all strings starting with b``````

We used the operators `##` and `%%` to avoid conflicts with other predefined haskell symbols, but you can redefine the operators using other names.

## Converting Regular Expressions to Automata

Regular expressions can be converted into finite automata. This is done by giving an NFA construction for each type of regular expression. We begin by giving NFAs for the empty set, the emtpy string and the single character regular expressions:

``````emptyNFA :: NFA () -- NFA for the empty language
emptyNFA = ([()],[],delta,(),[]) where
delta () _ = []

epsilonNFA :: NFA () -- NFA for the empty string language
epsilonNFA = ([()],[],delta,(),[()]) where
delta () _ = []

symbolNFA :: Char -> NFA Int -- NFA for a single character language
symbolNFA a = ([0,1],[a],delta,0,) where
delta 0 a = 
delta _ _ = []``````

For the other operations, we need to implement NFA constructions showing that regular languages are closed under concatenation, union and star. In this constructions, it is often useful to add a new state to an existing automaton. This is easily done using the following user-defined type:

``````data New st =  New | Old st
deriving (Eq,Ord,Show)``````

A value of type `New a` is either a value of the old type `Old x` (for any `x :: a`), or a new value `New`. Alternatively, we could have also used the haskell standard type `Maybe a`, with constructors `Nothing` and `Just a`. But defining a new type with different names makes our automata more readable. The following constructions correspond to Theorem 1.45, 1.47 and 1.49 in the textbook, with minor differences. Notice how we first define an NFA with ϵ-transitions, and then we remove the ϵ-transitions using `convertNFAeps2NFA`.

``````concatNFA :: (Ord st1, Ord st2) => NFA st1 -> NFA st2 -> NFA (Either st1 st2)
concatNFA (qs1,sigma1,delta1,s1,fs1) (qs2,sigma2,delta2,s2,fs2) =
convertNFAeps2NFA (qs,sigma,delta,s,fs) where
qs = [ Left q | q<-qs1] ++ [ Right q | q<-qs2]
sigma = nub (sigma1 ++ sigma2) -- union of the alphabets, with duplicates removed
s = Left s1
fs = map Right fs2
delta (Left q) (Just a)  = -- transitions from the first  NFA
[Left p  | p <- delta1 q a]
delta (Right q) (Just a) = -- transitions from the second NFA
[Right p | p <- delta2 q a]
delta (Left q) Nothing =   -- epsilon transitions from first to second NFA
if (elem q fs1) then [Right s2] else []
delta _ _ = []             -- no other transition

unionNFA :: (Ord st1, Ord st2) => NFA st1 -> NFA st2 -> NFA (New (Either st1 st2))
unionNFA (qs1,sigma1,delta1,q1,fs1) (qs2,sigma2,delta2,q2,fs2) =
convertNFAeps2NFA (qs,sigma,delta,New,fs) where
qs = [New] ++ [Old(Left q) | q<-qs1] ++ [Old(Right q) | q<-qs2]
sigma = nub (sigma1 ++ sigma2)
fs = [Old(Left q) | q<-fs1] ++ [Old(Right q) | q<-fs2]
delta (Old (Left q)) (Just a) = [Old (Left p) | p<-delta1 q a] -- transitions from the first NFA
delta (Old(Right q)) (Just a) = [Old(Right p) | p<-delta2 q a] -- transitions from the second NFA
delta q Nothing = -- epsilon transitions from start state
if (q == New) then [Old(Left q1), Old(Right q2)] else []
delta _ _ = []    -- no other transition

starNFA :: (Ord st) => NFA st -> NFA (New st)
starNFA (qs,sigma,delta,s,fs) =
convertNFAeps2NFA (qs1,sigma,delta1,New,fs1) where
qs1 = [New] ++ [Old q | q <- qs]
fs1 = [New] ++ [Old q | q <- fs]
delta1 (Old q) (Just a) = [Old p | p <- delta q a] -- transitions from original NFA
delta1 New x = -- epsilon transition from the new to the old start state
if (x == Nothing) then [Old s] else []
delta1 (Old q) _ | elem q fs = [Old s]
delta1 _ _ = []``````

All that remains to do is to combine the above constructions into a (recursive) function `convertRex2NFA :: Regex -> NFA st` that turns regular expressions into equivalent NFAs. However, here we face a small programming problem: what type of states `st` shall we use for the output automaton? Notice how this type depends on the input regular expression. So, no matter which type we choose, a naive implementation of the `convertRex2NFA` function would not type-check. This problem could be solved using certain advanced features of the haskell language, like existential types. But in order to stick to a small subset of haskell, we simply rename the states of the output automaton to integers using the function `intNFA :: NFA st -> NFA Int` defined at the end of NFA.hs. This way, the output of `convertRex2NFA` is always an automaton with states of integer type.

``````convertRex2NFA :: Regex -> NFA Int
convertRex2NFA Empty = intNFA (emptyNFA)
convertRex2NFA Epsilon = intNFA (epsilonNFA)
convertRex2NFA (Sym a) = intNFA (symbolNFA a)
convertRex2NFA (Union r1 r2) = intNFA (unionNFA (convertRex2NFA r1) (convertRex2NFA r2))
convertRex2NFA (Concat r1 r2) = intNFA (concatNFA (convertRex2NFA r1) (convertRex2NFA r2))
convertRex2NFA (Star r) = intNFA (starNFA (convertRex2NFA r))``````

Finally, apply the conversion procedure to our example regular expression, and look at the result using JFLAP:

``````> ghci RegEx.hs JFF.hs
RegEx> let nfa1 = convertNFA2NFAeps (convertRex2NFA regex1)
RegEx> writeFile (JFF.writeNFA nfa1) "nfa1.jff"
RegEx> :q
> java -jar JFLAP.jar nfa1.jff``````

All functions defined in these notes can be found in the RegEx.hs module.