Friday, December 7, 2007

Regexp but no state machine

Let me tell you a story on pure functional programming and the Scala language.

Functional programming led to amazingly beautiful things like the
Curry-Howard correspondence, MapReduce and the lift web framework.

I want to show you a simple program that matches regular expressions (regexps), which demonstrates rather nicely what makes functional programming so appealing. You probably know what regexps are and that they can be turned to state machines and so on...

Let's forget about that for a moment. Let us start from scratch.

Well, not entirely from scratch. Speaking the words of theory, we say, a *word* w is either the empty word, a single *letter* c, or the concatenation of two words w1,w2. A regular expression then describes a set of words that all have a particular structure in common.

In order to write down regexps in Scala, we need to define the building blocks. And since we would like to keep the door open to matching other things than strings of characters, these building blocks should be generic. Here is the code.


abstract class RegExp[A]

case class Phi[A]() extends RegExp[A]
case class Empty[A]() extends RegExp[A]
case class L[A](letter:A) extends RegExp[A]
case class Choice[A](left:RegExp[A], right:RegExp[A]) extends RegExp[A]
case class Seq[A](left:RegExp[A], right:RegExp[A]) extends RegExp[A]
case class Star[A](exp:RegExp[A]) extends RegExp[A]


We can now directly write things like Choice(L('a'),Star(L('b'))). Here is what these expressions mean:

Phi matches nothing
Empty matches the empty word
L(c) matches the single letter c
Seq(r1,r2) match all words w1,w2 where w1 matches r1 and w2 matches r2
Choice(r1,r2) matches all words w that match either r1 or r2 (or both)
Star(r) matches all words w1,...,wn where each of the wi matches r.

Now how can we write code that follows these definitions? Our first try might be a recursive function accepts that takes a regexp and a word and returns a boolean. Whenever a regexp contains another regexp, we could then call accept recursively, with suitable new input. Most rules seems easy, like for Choice we can try to match the input against r1 and if that doesn't work out, we would try r2. However, for Seq and Star we would have to cut a word in pieces and match the pieces against the regexps, possibly backtracking and trying out other ways to cut the word in pieces. That is quite inefficient.

It turns out there is a better way: A word can obviously only match if each of its letters were matched by some part of the regular expression. So why don't we take each letter, and try to "consume" it with some part of the regular expression. If we can play this game until no letters are left, this means the word is accepted. The key to making this work is to represent the "remaining" expression. But hey, don't we have everything there to represent expressions? We could just build the remaining expression as we go.

Here is the full program. Check out the partDeriv function, which computes the 'derivative' of a regular expression with respect to an input letter.

object Main {
abstract class RegExp[A]
case class Phi[A]() extends RegExp[A]
case class Empty[A]() extends RegExp[A]
case class L[A](letter:A) extends RegExp[A]
case class Choice[A](left:RegExp[A], right:RegExp[A]) extends RegExp[A]
case class Seq[A](left:RegExp[A], right:RegExp[A]) extends RegExp[A]
case class Star[A](exp:RegExp[A]) extends RegExp[A]

type Word[A] = List[A]

def matchRegExp[A](re:RegExp[A], w:Word[A]): Boolean =
w match {
case List() => acceptsEmpty(re)
case c::w1 => matchRegExp(partDeriv(re, c), w1)
}

def acceptsEmpty[A](re:RegExp[A]): Boolean =
re match {
case Phi() => false
case Empty() => true
case Choice(r1,r2) => acceptsEmpty(r1) || acceptsEmpty(r2)
case Seq(r1,r2) => acceptsEmpty(r1) && acceptsEmpty(r2)
case Star(r) => true
case L(_) => false
}

def partDeriv[A](re:RegExp[A], c:A): RegExp[A] =
(re, c) match {
case (Phi(), _) => Phi()
case (Empty(), _) => Phi()
case (L(c1), c2) if (c1 == c2) => Empty()
case (L(c1), _) => Phi()
case (Choice(r1,r2), c) => Choice(partDeriv(r1,c),partDeriv(r2,c))
case (Seq(r1,r2), c) =>
val rn = Seq(partDeriv(r1,c),r2)
if (acceptsEmpty(r1)) Choice(rn, partDeriv(r2,c)) else rn
case (rs @ Star(r), c) => Seq(partDeriv(r,c),rs)
}

def main(args:Array[String]): Unit = {
println matchRegExp(Choice(L('a'),Star(L('b'))), List('b','b','b'))
}
}


What do you know? Pattern matching and recursive calls. Not a single imperative update. It doesn't even use anonymous functions. The program looks remarkably similar in Haskell.

I could now go and try to explain why the partDeriv function does what it does, but you wouldn't learn anything from it. There are two things that are helpful to understand why this works: one is calculating what the function does "by hand", using sample inputs (feel free to insert debug print statements and run it if you are too lazy for that). And, if that is not enough, read the theorem and proof on the 1964 paper by Janus A. Brzozowski which is called "Derivatives of Regular Expressions" JACM 11:4 pp 481-494.

What it comes down to is that some computational problems have a structure to them that permits elegant solutions. David Pollak has summed it up nicely when he said about Scala "what it does is, it reduces code to its essence". Not every problem will have an elegant purely functional solution, but conciseness and clarity have their place in every software engineering project. Now enjoy finding the essence of your code.

8 comments:

David Pollak said...

How does this compare to the O(n) palindrome search describe in http://johanjeuring.blogspot.com/2007/08/finding-palindromes.html ?

Andrew Dalke said...

I first came across derivatives to regular expressions when I learned about RELAX-NG. It makes sense for XML schemas because they are relatively simple.

Modern regexp engines are more complicated, and are well-known for doing things that cannot be expressed as a regular expression. One example is backreferences, which makes it possible to match strings which are not of prime length. I don't think that's possible here.

Other difficulties are: support for different locales at run-time (\w is locale dependent), warnings/errors for cases like "(a*)*" which can potentially match the empty string forever, and the distinction between greedy and non-greedy matches.

There are also performance issues, like when searching for a regexp match somewhere in a string, if the regexp starts with a constant text string then the implementation could use a specialized sublinear string search algorithm.

I think you'll even find that group capture is hard to do using this approach. Consider matching with "(.*A.*)(.*B.*)(.*C.*)". I think, and this is only through intuition, that you'll end up having to keep track of all possible match groups using some sort of tree structure. When you consume 1 character at a time you don't know until the end which of the possible solutions to "xAxxCxBxxAxxBxCCCBA" will match.

So while you're absolutely right that regular expressions are easy to interpret this way, the modern use of "regular expressions" (which aren't regular) is not.

Burak Emir said...

Andrew, since the problem described above is matching, it is going to be hard for me to convince you that it is in fact possible to do better than all the backreferences, groups and whatnot from the Emacs,Posix heritage. For that, one would define regexp patterns (that contains variables). XHaskell
is all that - which I find more elegant than the approaches to variable binding in regexp patterns that you mention.

BTW, you can safely try matching Star(Star(L('A'))), against empty or non-empty strings, it won't loop.

Neel Krishnaswami said...

Hi Burak,

In Brzozowski's paper, he notes that if you quotient by the idempotence, commutativity, and associativity of Choice, then there are a finite number of derivatives. This lets you see his algorithm as lazy DFA construction.

Now, this is a little painful to code up directly in ML, because the natural way of encoding idempotent/commutative/associative choice is with a set, and that means that you need a mutually recursive type between the definition of a regexp and sets of regexps, so that Choice can have the type choice : set(regexp) -> regexp.

That is, we want a kind of type/Set module recursion that's difficult to express in ML.

This seems like a natural fit for Scala, though, and I'd be interested seeing how to do it.

Andrew Dalke said...

I looked at the XHaskell reference you mentioned. Everything I read there agrees with my statement that this approach works for regular expressions but does not hold for the non-regular extensions present in most modern (perl5-like) engines.

It's easy to convince me. I made a very clear statement that the regexp pattern that matches strings of non-prime length is not possible with this approach. Implement it, along with the matched substring with length equal to a factor of the full string's length, and I'm sold.

Burak Emir said...

Andrew, I tried to be careful in mentioning which issues can be addressed using functional style regexp matching. I did not intend to imply that non-regular matching (like Perl) would be possible with derivatives - I admit I never thought much about how one could solve this.

I do appreciate that you seem to be saying that regexp matching happens in context, and that existing solutions have evolved to accomodate all sorts of context (search and replace in text, for instance) and that applications typically need matching-functionality that is beyond regular.

But I am not aiming to convince anybody that they should do text operations with the algorithm above. The post is not about applications at all, but about expressing things.

It's like you are saying, cars have four wheels, your "car" only has two. But I happen to like motorbikes, even if driving in the rain is inconvenient. In sunny weather, it is more fun to ride a bike. Plus you can pass between the cars in stop-and-go traffic :)

I am nevertheless certain that throwing math and more practical examples of perl-like text processing may help to solve those problems elegantly (derivatives for LL parsing?), too. I am concerned with the proper way to express things, not with the fastest possible implementation. The proper way for prime-length matching may just be a program in a general-purpose language that calls isPrime...

Burak Emir said...

hey neelk, the Berry-Sethi construction (also called position automata, or Glushkov automata) is somewhat related to the idea of derivatives (continuations through the expression), I actually use that in a stateful translation of regexps. Have a look in the sources of the Scala standard library, under scala.util :) All fanciness of Scala's type system is used in order to formulate the algorithm without committing to a particular type of "letters".

Japanese Used Cars said...
This comment has been removed by the author.