Sunday, February 3, 2008

From Regexps to Regexp Patterns, still functional

Here now is the long overdue follow-up to "RegExp but no state machine". Remember, we were talking about regular expressions: state machines come to mind, and we hurry to the cabinet with the imperative chainsaw, but it turns out that some scalpel-like pure functions can be used to cut the problem as well.

Binding Variables


Quite often, we are not just interested in the boolean answer to the question "did this input match the regexp?". We also want to know the parts that matched parts of the regexp, as in "if the file ends in .txt, put its name without the extension into variable x". In other words, we want to bind variables to parts of the input, depending on how the input matched. Regexps that contain variables deserve to get a beautiful and original name: let us call them regexp patterns.

You want to see an example that is not buried in natural language? We can do that... I will use Scala syntax, but keep in mind that regexp patterns in Scala are
undergoing a redesign - these example are just here to explain regexp patterns.

 
def just_the_name(name:Seq[Char]) = name match {
case Seq(filename @ _*, '.', 't', 'x', 't') => filename
}


Matching regexp patterns is a bit more involved than matching regexps. That is not apparent from the above, tame example, but what about something like this



def is_backup(name:Seq[Char]) = name match {
case Seq(_*, bak @ (() | ('.','b','a','k'))) => !bak.isEmpty
}


There are two ways that the input name can match the regexp patterns, a fact that is cleverly used to return a boolean to the caller. When there are many or-patterns and "wildcard_star" patterns, the number of possibilities can grow very large.

Back to Business


Maybe regexp patterns will be resurrected soon in Scala, for now, let us focus on the problem and its functional solution. We will write a Scala program that matches regular expression patterns.

Here is code to represent regexp patterns. It uses the RegExp classes defined before (a complete listing that you can compile in one go is provided at the end).

abstract class Pat[A]
// v is the variable (represented as Int), w the prefix that was
// already seen, r the remaining part we yet have to match
case class PVar[A](v:Int, word:RegExp.Word[A], re:RegExp[A]) extends Pat[A] {
override def toString = "x"+v+"@"+re.toString
}
case class PPair[A](left:Pat[A], right:Pat[A]) extends Pat[A]
case class PChoice[A](left:Pat[A], right:Pat[A]) extends Pat[A]
case class PatNil[A] extends Pat[A]
case class PatCons[B](hd:B, tl:List[B]) extends Pat[B]


As the comment says, these pattern representation is interesting: it does not only represent regexp patterns, but also partial matches of inputs. When we write down a regexp pattern, the "word" field of PVar will be empty. During the matching, as input becomes available, the different possibilities of going through the input will correspond to different instances of Pat.

Let us try how that would look for the is_backup example. To keep things short, we turn the pattern into x@'A', y@(()|'B')). We match the input A against the pattern, and see how we would go about this. We freely add the empty sequence () where it is appropriate.


Is the first letter of A consumed by x@'A', y@(()|'B'))? Yes, and it is bound to x.
Is () consumed by y@(()|'B'))? Yes, and nothing is bound y and we are done.



Now the same with AB


Is the first letter of AB consumed by x@'A', y@(()|'B'))? Yes, and it is bound to x.
Is the first letter of B consumed by y@(()|'B'))? Yes, and B is bound y.
Is () consumed by ()? yes and we are done.


So we want to step character by character through the input, and
keep track of the bindings by rewriting our patterns -- this is were the partial derivatives come into play. A function pdPat will rewrite a pattern with a partial derivative (the partDeriv function was explained before, in "RegExp but no state machine".)


def pdPat[A](pat:Pat[A],c:A):Pat[A] =
pat match {
case PVar(x, w, r) => PVar(x, w:::List(c), partDeriv(r, c))
case PPair(p1, p2) =>
val pn = PPair(pdPat(p1, c), p2)
if (acceptsEmpty(strip(p1)))
PChoice(pn, PPair(mkEmpPat(p1), pdPat(p2, c)))
else
pn
case PChoice(p1, p2) => PChoice(pdPat(p1, c), pdPat(p2, c))
}


In short, this function takes a regexp pattern and a character. It goes through the regexp pattern and constructs the partial derivative of its underlying regexp. When it needs to keep track of a binding (case PVar), it does so by appending the character at the end. It uses the functions mkEmpPat (which means matching against empty input and makes the regexps disappear or fail in a regexp pattern, but keeps the bindings that were computed before) and strip, which just gives us the underlying regexp and ignores all variable binding stuff that might be going on.

Exercise: write a function that computes the result of accepts_empty(strip(p)) directly, without constructing a regexp first.


The Full Monty


Here now, without further ado, the full example (including stuff from the first part). As always, it might be instructive to run to try out with your own examples, execute step-wise in your IDE debugger (Eclipse and Netbeans), or, the classic, insert lots of println statements.

Note that different possibilities of feeding an input character to a choice are tracked by return a list of environments. The "longest match" and the "shortest match" policies can be recovered by selecting a special element of the list of environments that is the end result of running the patMatch function.


object RegExpPat {

// from part 1

type Word[A] = List[A]

def matchRegExp[A](re:RegExp[A], w:Word[A]): Boolean =
w match {
case List() => acceptsEmpty(re)
case c::w1 => matchRegExp(partDeriv(re, c), w1)
}

def acceptsEmpty[A](re:RegExp[A]): Boolean =
re match {
case Phi() => false
case Empty() => true
case Choice(r1,r2) => acceptsEmpty(r1) || acceptsEmpty(r2)
case Seq(r1,r2) => acceptsEmpty(r1) && acceptsEmpty(r2)
case Star(r) => true
case L(_) => false
}

def partDeriv[A](re:RegExp[A], c:A): RegExp[A] = {
(re, c) match {
case (Phi(), _) => Phi()
case (Empty(), _) => Phi()
case (L(c1), c2) if (c1 == c2) => Empty()
case (L(c1), _) => Phi()
case (Choice(r1,r2), c) => Choice(partDeriv(r1,c),partDeriv(r2,c))
case (Seq(r1,r2), c) =>
val rn = Seq(partDeriv(r1,c),r2)
if (acceptsEmpty(r1)) Choice(rn, partDeriv(r2,c)) else rn
case (rs @ Star(r), c) => Seq(partDeriv(r,c),rs)
}
}

abstract class RegExp[A]

case class Phi[A]() extends RegExp[A]
case class Empty[A]() extends RegExp[A]
case class L[A](letter:A) extends RegExp[A]
case class Choice[A](left:RegExp[A], right:RegExp[A]) extends RegExp[A]
case class Seq[A](left:RegExp[A], right:RegExp[A]) extends RegExp[A]
case class Star[A](exp:RegExp[A]) extends RegExp[A]

// part 2

abstract class Pat[A]
// vx is the variable (represented as Int), w the prefix that was
// already seen, r the remaining part we yet have to match
case class PVar[A](v:Int, word:Word[A], re:RegExp[A]) extends Pat[A] {
override def toString = "x"+v+"@"+re.toString
}
case class PPair[A](left:Pat[A], right:Pat[A]) extends Pat[A]
case class PChoice[A](left:Pat[A], right:Pat[A]) extends Pat[A]
case class PatNil[A] extends Pat[A]
case class PatCons[B](hd:B, tl:List[B]) extends Pat[B]

// combinator
def pat_var[A](n: Int, r:RegExp[A]) = PVar(n,List(),r)

// binding
type Env[a] = List[(Int,Word[a])]

def patMatch[A](pat:Pat[A], word:Word[A]):List[Env[A]] = {
(pat,word) match {
case (p, c::w) =>
patMatch(pdPat(pat, c), w)
case (PVar(x,w,r),List()) =>
if (acceptsEmpty(r)) List(List((x,w))) else List()
case (PChoice(p1,p2), List()) =>
patMatch(p1, List()):::patMatch(p2, List())
case (PPair(p1,p2), List()) =>
for(xs <- patMatch(p1, List()); ys <- patMatch(p2, List())) yield xs:::ys
}
}

def longPatMatch[A](pat:Pat[A], word:Word[A]):Option[Env[A]] =
patMatch(pat,word) match {
case env::_ => Some(env)
case _ => None
}

def shortPatMatch[A](pat:Pat[A], word:Word[A]):Option[Env[A]] =
patMatch(pat,word) match {
case List() => None
case xs => Some(xs.last)
}

def pdPat[A](pat:Pat[A],c:A):Pat[A] =
pat match {
case PVar(x, w, r) => PVar(x, w:::List(c), partDeriv(r, c))
case PPair(p1, p2) =>
val pn = PPair(pdPat(p1, c), p2)
if (acceptsEmpty(strip(p1)))
PChoice(pn, PPair(mkEmpPat(p1), pdPat(p2, c)))
else
pn
case PChoice(p1, p2) => PChoice(pdPat(p1, c), pdPat(p2, c))
}

def strip[A](pat:Pat[A]):RegExp[A] =
pat match {
case PVar(_,w,r) => r
case PPair(p1,p2) => Seq(strip(p1),strip(p2))
case PChoice(p1,p2) => Choice (strip(p1),strip(p2))
}

def mkEmpPat[A](pat:Pat[A]):Pat[A] =
pat match {
case PVar(x,w,r) =>
if (acceptsEmpty(r))
PVar(x, w, Empty[A]())
else
PVar(x, w, Phi[A]())
case PPair(p1,p2) => PPair(mkEmpPat(p1), mkEmpPat(p2))
case PChoice(p1,p2) => PChoice (mkEmpPat(p1), mkEmpPat(p2))
}

// example for testing
val the_choice = Choice(Empty(), L('B'))

val ex = PPair(PVar(1, List(), Star(L('A'))), PVar(2, List(), the_choice))

def main(args:Array[String]) {
println(longPatMatch(ex, List('A')))
println(longPatMatch(ex, List('A','B')))
}
}


Oh, did I mention that there is no mutable state involved?

Credit: The functional implementation described here follows Kenny Zhuo Ming Lu and Martin Sulzmann. A toast to them! Several people, including myself, found similar solutions, but they are the heroic ones that implemented and proved it all.

3 comments:

jimmy wilson said...

Like me it will be helpful for others too thanks for sharing

Leather Biker Jacket Clothing

Frandis Charles said...

You have great blogging skills which surely will help me out in my future blogs, Thanks for sharing it keep posting blogs like that.

Hank Moody Jacket Style

Harish Yes said...

Thanks a lot for sharing this article. i Will check back later for more of your posts. packers and movers bangalore packers and movers bangalore marathahalli packers and movers marathahalli packers and movers hsr layout packers and movers btm layout packers and movers bommanahalli packers and movers whitefield packers and movers koramangala packers and movers jp nagar