Fog Creek Software
Discussion Board




Non-halting regular expressions?

I was using regular expressions to parse out certain patterns in a large body of code.  When the program hit a particular module, it would just hang as if it was stuck in an infinite loop.

After lots of frustration I eventually traced it to a call to the find() method in Sun's java.util.regex.Matcher class where it was hanging. It was nothing to do with my code.

The string that made the regex get stuck was only about 1K long, and the regex wasn't particularly long or complex. It did have back references, which can cause the processing to take longer, but it still got stuck after I removed the back references.

Ultimately I had to change the regex to stop it from getting stuck.  The modified regex was a bit less exacting that what I really wanted, so I had to add some code to do look at the groups to make sure it matched what I really wanted.

From automata theory I remember that all context-free languages are decidable, which would include regular expressions.  Although the modern-day regex classes probably do provide matching abilities beyond what a mathematically strict regular expression can do.

Are there some combination of strings and regexes that would legitimately fail to halt in a modern-day regex utility? Or is it a bug in Sun's Matcher class?  Are there are just some strings that can send the regex object into an exponential or high-polynomial loop, even without back references?

T. Norman
Saturday, June 19, 2004

Perhaps give it a try in the regex coach: http://www.weitz.de/regex-coach/

Rhys Keepence
Saturday, June 19, 2004

What are you parsing? I've seen many people try to use a regular expression when it would be more appropriate to use a parsing engine, they're different tools for different tasks.

Tom H
Saturday, June 19, 2004

Sounds like a bug.


Saturday, June 19, 2004

"I've seen many people try to use a regular expression when it would be more appropriate to use a parsing engine, they're different tools for different tasks."

That's beside the point. The point is whether you can write a regex with today's regex engines and provide it with a short string (under 2K) that it would cause it to loop forever or for some exponentially long time, with such behavior being a necessary mathematical limitation of the engines, rather than being a bug.

T. Norman
Saturday, June 19, 2004

I don't know how that particular regular expression implementation works - but if it's the process of building an NFA from the regular expression and then constructing a DFA to process the text with, you can *in theory* get a very big DFA, as the states in the latter belong to the power set of the former. 

So perhaps the NFA to DFA conversion was wallowing in a lot of states (and memory to allocate them in). Have a look at the Dragon book if you want to persue this possibility.

This rarely ever happens in practice, and I would believe that most industrial strength implementations would detect this and either report and error or switch to on-the-fly NFA matching (again, see the Dragon book). Any Java implementors care to comment?

Regards,

Gerard
Saturday, June 19, 2004

In this case the Pattern and Matcher were built successfully, and it didn't have a problem when other strings were passed to the Matcher.  It just locked up when a particular string was passed to it.

T. Norman
Saturday, June 19, 2004

Try checking memory usage as well as time.

Also try feeding in initial segments of the string to find where you get the infinite loop (or exponential time).

Pathological cases are known.

Why don't you post the regexp and maybe someone can spot the problem and or suggest a different way to write the expression.

frustrated
Saturday, June 19, 2004

Or, for that matter, you could give your example to Sun so they might fix it in a later release.

Julian
Saturday, June 19, 2004

Most regex implementations today do not build a DFA / NFA -- especially those that offer backreferences (which are not "regular" at all).

And because they do NOT build DFAs and NFAs, it's very simple to construct pathological cases - e.g., "((a*)(a*))+b" can take exponentially long to decide that aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaac is not in the language using the matching techniques commonly in use.

Ori Berger
Sunday, June 20, 2004

Ori Berger:

"Most regex implementations today do not build a DFA / NFA -- especially those that offer backreferences (which are not "regular" at all)."

Tell us more...

Cheers,

Gerard
Sunday, June 20, 2004

Ori Berger, in the words of Donald Trump ...

You're Hired!

The expression that gave trouble was very much like what you described. It contained two sub-expressions together that were similar, but with one that would match a subset of the other.  The real thing was more complex and specific, but in essence it had something like this:

(foo[a-z]+)(foo[a-z0-9]+)

With the first being a subset of the other, the "greedy" matching mechanism probably caused them to eat up each other, which caused an exponential loop to backtrack and resolve the conflicts.

The problem was resolved by removing the first one and then a couple extra lines of code to look at the eventual match afterwards to see if it satisfied what I was looking for.

T. Norman
Sunday, June 20, 2004

T.Norman wrote:

"(foo[a-z]+)(foo[a-z0-9]+)

With the first being a subset of the other, the "greedy" matching mechanism probably caused them to eat up each other, which caused an exponential loop to backtrack and resolve the conflicts."

Problem solved (and a tip of the hat to Ori Berger), but for the benefit of my own education, how does this cause a problem?

Am I right in thinking that current regex implementations are using backtracking with successive expansion on encountering a Kleene closure and suchlike - or even doing character-by-character LL parsing with backtracking?

Or what, exactly?

Cheers,

Gerard
Monday, June 21, 2004

See "Mastering Regular Expressions", 2nd Ed, Jeffrey Friedl, O'Reilly.

Explains in complete detail how regex engines work, how to craft a good regex, and how to avoid cases that can loop forever.

Ian
Monday, June 21, 2004

Ian,

Thanks for the book reference - I'll keep a lookout for it. To be honest, I'll take a peek at the implementation of Boost's regex class template in the meantime, now my curiosity is piqued.

Regards,

Gerard
Tuesday, June 22, 2004

*  Recent Topics

*  Fog Creek Home