Regular Expressions

Regural expressions are easily defined as an abstract data type in haskell:

data Rex = Empty
         | Epsilon
	     | Sym Char
		 | Union Rex Rex
		 | Concat Rex Rex
		 | Star Rex

There are six cases, corresponding to different types of regular expressions. Notice that this is a recursive (inductive) type definition. Using this user-defined type we can represent arbitrary regular expressions as haskell values of type Rex. For example, the expression a(a + b)* is written as

regex0 = Concat (Sym 'a') (Star (Union (Sym 'a') (Sym 'b'))) :: Rex

In order to make regular expressions easier to type, we can introduce infix operators for the union and concatenation operations:

(##) = Concat  -- concatenation of two expressions
(%%) = Union   -- union of two expressions

regex1 = Sym 'a' ## Star (Sym 'a' %% Sym 'b') -- all strings starting with a
regex2 = Sym 'b' ## Star (Sym 'a' %% Sym 'b') -- all strings starting with b

We used the operators ## and %% to avoid conflicts with other predefined haskell symbols, but you can redefine the operators using other names.

Converting Regular Expressions to Automata

Regular expressions can be converted into finite automata. This is done by giving an NFA construction for each type of regular expression. We begin by giving NFAs for the empty set, the emtpy string and the single character regular expressions:

emptyNFA :: NFA () -- NFA for the empty language
emptyNFA = ([()],[],delta,(),[]) where
  delta () _ = []

epsilonNFA :: NFA () -- NFA for the empty string language
epsilonNFA = ([()],[],delta,(),[()]) where
  delta () _ = []

symbolNFA :: Char -> NFA Int -- NFA for a single character language
symbolNFA a = ([0,1],[a],delta,0,[1]) where
  delta 0 a = [1]
  delta _ _ = []

For the other operations, we need to implement NFA constructions showing that regular languages are closed under concatenation, union and star. In this constructions, it is often useful to add a new state to an existing automaton. This is easily done using the following user-defined type:

data New st =  New | Old st
  deriving (Eq,Ord,Show)

A value of type New a is either a value of the old type Old x (for any x :: a), or a new value New. Alternatively, we could have also used the haskell standard type Maybe a, with constructors Nothing and Just a. But defining a new type with different names makes our automata more readable. The following constructions correspond to Theorem 1.45, 1.47 and 1.49 in the textbook, with minor differences. Notice how we first define an NFA with ϵ-transitions, and then we remove the ϵ-transitions using convertNFAeps2NFA.

concatNFA :: (Ord st1, Ord st2) => NFA st1 -> NFA st2 -> NFA (Either st1 st2) 
concatNFA (qs1,sigma1,delta1,s1,fs1) (qs2,sigma2,delta2,s2,fs2) =
  convertNFAeps2NFA (qs,sigma,delta,s,fs) where
    qs = [ Left q | q<-qs1] ++ [ Right q | q<-qs2]
    sigma = nub (sigma1 ++ sigma2) -- union of the alphabets, with duplicates removed
    s = Left s1
    fs = map Right fs2
    delta (Left q) (Just a)  = -- transitions from the first  NFA
      [Left p  | p <- delta1 q a] 
    delta (Right q) (Just a) = -- transitions from the second NFA
      [Right p | p <- delta2 q a] 
    delta (Left q) Nothing =   -- epsilon transitions from first to second NFA
      if (elem q fs1) then [Right s2] else []
    delta _ _ = []             -- no other transition

unionNFA :: (Ord st1, Ord st2) => NFA st1 -> NFA st2 -> NFA (New (Either st1 st2))
unionNFA (qs1,sigma1,delta1,q1,fs1) (qs2,sigma2,delta2,q2,fs2) =
  convertNFAeps2NFA (qs,sigma,delta,New,fs) where
    qs = [New] ++ [Old(Left q) | q<-qs1] ++ [Old(Right q) | q<-qs2]
    sigma = nub (sigma1 ++ sigma2)
    fs = [Old(Left q) | q<-fs1] ++ [Old(Right q) | q<-fs2]
    delta (Old (Left q)) (Just a) = [Old (Left p) | p<-delta1 q a] -- transitions from the first NFA
    delta (Old(Right q)) (Just a) = [Old(Right p) | p<-delta2 q a] -- transitions from the second NFA
    delta q Nothing = -- epsilon transitions from start state
      if (q == New) then [Old(Left q1), Old(Right q2)] else []  
    delta _ _ = []    -- no other transition

starNFA :: (Ord st) => NFA st -> NFA (New st)
starNFA (qs,sigma,delta,s,fs) =
  convertNFAeps2NFA (qs1,sigma,delta1,New,fs1) where
    qs1 = [New] ++ [Old q | q <- qs]
    fs1 = [New] ++ [Old q | q <- fs]
    delta1 (Old q) (Just a) = [Old p | p <- delta q a] -- transitions from original NFA
    delta1 New x = -- epsilon transition from the new to the old start state
      if (x == Nothing) then [Old s] else [] 
    delta1 (Old q) _ | elem q fs = [Old s]
    delta1 _ _ = []

All that remains to do is to combine the above constructions into a (recursive) function convertRex2NFA :: Regex -> NFA st that turns regular expressions into equivalent NFAs. However, here we face a small programming problem: what type of states st shall we use for the output automaton? Notice how this type depends on the input regular expression. So, no matter which type we choose, a naive implementation of the convertRex2NFA function would not type-check. This problem could be solved using certain advanced features of the haskell language, like existential types. But in order to stick to a small subset of haskell, we simply rename the states of the output automaton to integers using the function intNFA :: NFA st -> NFA Int defined at the end of NFA.hs. This way, the output of convertRex2NFA is always an automaton with states of integer type.

convertRex2NFA :: Regex -> NFA Int
convertRex2NFA Empty = intNFA (emptyNFA)
convertRex2NFA Epsilon = intNFA (epsilonNFA)
convertRex2NFA (Sym a) = intNFA (symbolNFA a)
convertRex2NFA (Union r1 r2) = intNFA (unionNFA (convertRex2NFA r1) (convertRex2NFA r2))
convertRex2NFA (Concat r1 r2) = intNFA (concatNFA (convertRex2NFA r1) (convertRex2NFA r2))
convertRex2NFA (Star r) = intNFA (starNFA (convertRex2NFA r))

Finally, apply the conversion procedure to our example regular expression, and look at the result using JFLAP:

> ghci RegEx.hs JFF.hs
RegEx> let nfa1 = convertNFA2NFAeps (convertRex2NFA regex1)
RegEx> writeFile (JFF.writeNFA nfa1) "nfa1.jff"
RegEx> :q
> java -jar JFLAP.jar nfa1.jff

All functions defined in these notes can be found in the RegEx.hs module.