Fog Creek Software
Discussion Board




Welcome! and rules

Joel on Software

Regex Question

I have text like the following:

begin 1
...
begin 2
...
end 2
...
begin 3
end 3
...
end 1

I want to generate HTML DIV's and obey the nesting, so I need to match on the correct end tag.  I guess what I'm looking for is the ability to use $1 in the match, like:

text = Regex.Replace(text, @"begin (\d{1,5})(.*?)end $1", "<div>$2</div>");

The idea is that the match on the "end" part has knowledge about the match on the "begin" part.

steved
Thursday, March 18, 2004

As far as I know you cannot do that with a RegExp.

In the end, a regular expression is a representation of what is called a 'context free grammar', precisely because it cannot be aware of the context in which matches are made.

I may be wrong, but if that is the case, pleas someone enlight me to go back to school and sue my languages and automata's teacher.

I think you will have to build a small parser that uses a stack to keep track of the level of nested constructs it has found.

.NET Developer
Friday, March 19, 2004

This is similar to the classic problem of matching balanced parentheses -- such as matching the parenthetical groups in "( () ( () ) )."

.NET Developer is right that you normally can't do this with regexes.  However, the .Net Framework regex library adds support for this, which it calls a balancing group.  This should allow you to match something like

begin

  begin
  end

  begin
    begin
    end
  end

end

You could also capture the trailing numbers for each group.

The MSDN documentation on this feature is pretty sketchy, but it's covered in Dan Appleman's ebook "Regular Expressions with .Net."  (Pp. 44- 50)

Robert Jacobson
Sunday, March 21, 2004

The anwer to your question is in, luckily, the sample chapter O'Reilly's RegEx book:
http://www.oreilly.com/catalog/regex2/chapter/ch09.pdf

Duncan Smart
Monday, March 22, 2004

*  Recent Topics

*  Fog Creek Home