Fog Creek Software
Discussion Board




I hate regular expressions

I hate regular expressions.

Here are some references for regular expressions:

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/cpgenref/html/cpconregularexpressionslanguageelements.asp

http://www.regular-expressions.info/

http://etext.lib.virginia.edu/helpsheets/regex.html


Couldn't they invent a simple, easy to remember, easy to understand, yet powerful syntax?

That's what people in CS have struggled for many years to invent, and they succeeded pretty well with languages like Pascal, Java, Modula 2, C#, Python, etc, etc.

But with regular expressions.. no... you have to use ()\_^*+/[]{}, in a cryptic "soup of special characters".

A novice in regular expressions looks at such an expression, and doesn't know which characters are special characters, which are escaped characters, etc.

With regular expressions, what you don't know CAN and IS VERY LIKELY to bite you in the behind, sometimes very badly.


Is there any alternative to regular expressions?

A library which offers the same powerful matching capabilities, but with a sane syntax designed for humans.

Mike
Tuesday, November 25, 2003

You mean

WITH FOO AS VARIABLE TO SUBSTITUTE BEGIN
  IF CONTAINS BAR AT ANY POSITION THEN
    SUBSTITUTE IT WITH "BAZ";
    REPEAT FOR ALL;
  END IF;
END WITH

as opposed to

$foo =~ s/bar/baz/g



I can imagine that thousands of developers only have
waited for a better (because more pascal like)
regular expression syntax.

But I don't belong to them,

Ignore my ignorance
Tuesday, November 25, 2003

I guess it's possible to write a parser that uses real words as regex symbols rather than cryptic backslashes.
But then again, there is the very English like Pascal and the very, well, nothing human like, C, and we all know which is the most successful.
I think programmers need to be obscure to feel good about themselves and their skills.
Besides, we are all lazy, and that means we want to type as little characters as we possibly can.
And another thing. If I remeber correctly the use of regexp for search and replace tasks was initiated on Unix platforms. The users of those tend to be REALLY cryptic, and enjoy it to the greatest extent.

Eli Golovinsky
Tuesday, November 25, 2003

Hey! I don't mean like that.

I mean something humanly readable.

Let's take this simple "Hello, World" program:

void main(void) {
    printf("Hello, world");
}

What happens if I rewrite it in the spirit of regular expressions?

v m(){
    \p("Hello, world");[]
}

Nice, eh?!

Mike
Tuesday, November 25, 2003

As for C being more successful than other languages in my list, I agree, except for Java.

Let's face it, C was a masterpiece years ago, when you needed fast access to the memory and hardware.

Now, the game is a lot different - Java, Python, and other languages are the winners now.

Also - C does have a readable, easy to understand syntax.

It is very intuitive that blocks are placed between braces.

However, this:

[ab]{3} to match 3 times either a or b is not intuitive at all.

Mike
Tuesday, November 25, 2003

How about this:

regular expression:

[ab]{3}

human readable expression:

('a' OR 'b') 3 TIMES

What is wrong with the syntax above, which looks a bit like Forth?

Of course, it isn't as easy to parse by the computer, but  it's a lot easier to remember.

Mike
Tuesday, November 25, 2003

How about something more SQL like?

Replace IN foo ALL substring('%bar%') WITH 'baz';

MyVar = Extract From foo substring('%crlf'); //get the first line.

The syntax would still have all those weird wildcards for defining the substrings..

Eric DeBois
Tuesday, November 25, 2003

Good thread.

I'm so used to regexps -- and you guys have probably never even seen the Emacs regexps, which are even wackier -- that I never really thought about alternatives. I know Larry Wall has some thoughts about regexp cleanup for Perl 6, but no serious thoughts of replacing them.

Band aid replacement: just choose readable names to replace the cryptic symbols.

For example:

^\(abc\)+.foobar$

Would become:

<regexp>
<start>
  <group repeat="one-or-more">abc</group>
    <any/>
    <literal>foobar<literal>
<end>
</regexp>

I'm sure it could be done, but dunno if it's really any better.

Portabella
Tuesday, November 25, 2003

You would really like perl then.

Malta
Tuesday, November 25, 2003

You can also use s-expressions.  Something like:
(BEGIN-LINE [at-least-one "abc"] Any-char "foobar" END-LINE)

You can add syntactic sugar and tasteful names; this is just a 10-sec example.  The nice thing is that it's a light, flexible syntax.  Text can stay in text, metachars and control stuff can look like symbols, yet it remains readable.

Tayssir John Gabbour
Tuesday, November 25, 2003

BTW, "text can stay in text" meant that you don't have to mix metachars and normal text inside a string.  The two are separated, but not jarringly so.

Tayssir John Gabbour
Tuesday, November 25, 2003

I write a bunch of simple little interpreters for fun (or sometimes at work to make tiny control languages) and I think I know what the original poster wants.  The trouble understanding regular expressions in plain text is the stupid "soup of special characters". 

Think about it.  When we study regular expressions, our textbooks get around this by having special characters that truly *are* special characters.  The star operator, for instance, is actually a star exponent, and it is not used in the language - it's only a meta-character.  Concatenation is usually that little circle, and union is the big "U" looking thing.  Also, when they're actually talking about the letter 'x', they'll make it bold or something, so as to distinguish between when they’re talking about the variable, x. 

The way textbooks do it, it's just much, much easier to process for us human beings.     

bob
Tuesday, November 25, 2003

A problem with your "easy to understand" syntax is that it breaks down once you start creating more complex expressions. Once you learn the syntax such expressions aren't very difficult to create as you're just stringing together atoms  -- [a-z] is an atom, just like a{3} or \s*.

Not only that, but the "easy to understand" syntax gets in the way of people who know what they're doing. Programming languages shouldn't cripple people with experience, such languages wind up going the way of the dinosaur and become toys for people to learn with but not much else.

I don't mean to be a snob, but we're programmers, aren't we? This kind of stuff is just syntax, and IME any syntax can be learned with enough practice.

Chris Winters
Tuesday, November 25, 2003

The evolution of programming languages, starting with FORTRAN, show that on the contrary: people prefer languages that are BOTH easy to understand and concise.

Whenever concision gets in the way of the ease of understanding, people, more often than not, choose the ease of understanding.

Any CS program worth it's salt will tell it's students rules like:

A. name your objects, variables and methods well - go for explicit names, and avoid cryptic names

B. write the program in a way which is easy to understand and maintain later

It seems that the designers of regular expressions completely disobeyed rule A above.

Mike
Tuesday, November 25, 2003

If you don't like my 10 second s-expression example, there are easy mitigating strategies.  For example, since you're not mixing up metachars, you can easily create aliases:
(^ [+ "abc"] . "foobar" $)

Even easier, we could use postfix syntax, and really look like old-style regexps:
(^ ["abc" +] . "foobar" $)
instead of:
"^(abc)+.foobar$"

You can keep finding things you don't like about s-expressions, and I can fix them for you.  Old-style regexps mix things up, so I can't fix things as easily.  In java, I remember having to usually put four "leaning sticks" together, just for simple old-style regexps!  I can read them, but that doesn't mean I like masochism.  Sometimes good syntax helps write quality software.

Tayssir John Gabbour
Tuesday, November 25, 2003

Well, ok, the s-expression standard at http://theory.lcs.mit.edu/~rivest/sexp.txt doesn't want me using weird punctuation alone to stand for tokens.  I posted that too quickly, I'll need some spare time to make nice tokens for you.  If it's actually a good thing to do so.

Tayssir John Gabbour
Tuesday, November 25, 2003

Ok, my lisp environment was fine with everything except '.' .  So we can call it 'Any' or replace it with another char or small char sequence.  The other ones (*,^,$,+) were absolutely fine.

Tayssir John Gabbour
Tuesday, November 25, 2003

I love REs. You just have to learn the language then
you are set. It's worth the effort.

Sometimes we change the river.
Sometimes the river changes us.

son of parnas
Tuesday, November 25, 2003

I think regular expressions are OK as they are, and I don't think I have the "survivor" mentality of someone who wants to exclude others based on obscure knowledge.

Here's why: Regexps are cryptic because they pack a tremendous amount of meaning into a few characters. If regular expressions were easy to understand, they would be very verbose. They are a cipher for a specialized form of algorithm expression - they are a sort of programming language. 

How many end users have the need to search for certain sets of characters or exclusions in certain positions of incoming character strings? Not very many. If this were a widespread need, I think that a "friendlier" notation would have emerged over the years.

Regexp syntax is reasonably standardized and is very, very old. That's a big deal. That alone implies that there are thousands if not millions of examples and many programming languages and applications that use regular expressions.

The simplest form of a regular expression is simply the text being searched for. (less escaping of the 'special' characters, of course.)

I just don't see the problem. The only "gap" I see is that nobody has ever developed tools that can translate regexps into a verbal statement of the expression's meaning. This would be useful as a debugging tool.

Bored Bystander
Tuesday, November 25, 2003

http://www.scsh.net/docu/post/sre.html

proposes a syntax for regular expressions using s-expressions (the things out of which programs are built in Lisp-like languages). Instead of [ab]{3} you'd have

    (= 3 (or "a" "b"))

or

    (= 3 ("ab"))

The problem this solves isn't exactly the problem the original poster complained about: the notation is still cryptic at first sight. (Especially if you aren't familiar with parenthesized-prefix notations to begin with.) However, the *structure* is more apparent, which in my experience is a more important problem with traditional REs. It really doesn't take very long to learn what all the metacharacters mean, but disentangling a complicated RE is painful. A bit like reading one of those half-page sentences some 19th-century authors were keen on: the words aren't a problem, but it taxes your concentration and short-term memory more than it feels like it should :-).

Gareth McCaughan
Tuesday, November 25, 2003

BB: while reading this thread, I thought it might be an interesting exercise to write a gui tool for regular expressions.  Not wanting to reinvent the wheel, I looked for similar programs, I found quite a few.

A few of them that look interesting are at:
http://www.sellsbrothers.com/tools/#regexd
http://weitz.de/regex-coach/

Also, ActiveState's Komodo IDE evidently has a "Regular Expression (Rx) Toolkit" that has a GUI interface. http://www.activestate.com/Products/Komodo/more_information.plex

RocketJeff
Tuesday, November 25, 2003

Mike,

I've used to do a lot of regular expression work. The syntax for regular expressions may be hard to read at first, but after some experience with them I found them easy to work with.

Before your can work effectively with regular expressions you need to read the instructions.

In math programming would you prefer to replace '+' with PLUS and '-' with MINUS just so you don't need to know math symbols?

Regular expression have a tremendous amount of power due to all of the special operators. These operators aren't that hard to remember and make regular expression reading much easier for people who are familiar with the rules.

NathanJ
Tuesday, November 25, 2003

Try the "Regulator" by Roy Osherove...

http://weblogs.asp.net/rosherove/posts/33126.aspx

Kentasy
Tuesday, November 25, 2003

I think that the regular expression syntax is very natural from the point of view of the person implementing the regex interpreter.  I think that the special operators ("[],(),{},+,*,.,$,etc") are so short exactly because they're intended to be used so frequently.

It's like that in more 'traditional' areas of mathematics too.  For example, instead of saying "d/dx" you could say "lim dx -> 0 ((f(x + dx) - f(x)) / dx)".  But that uses special operators too, and if you want to make it look like English it winds up being a very long paragraph involving epsilons and deltas and so on.

I don't think that something like (matches "abcdefefef" (list (match-literal "abcd") (match-1-n "ef"))) is simpler and more 'natural' than "abcd(ef)+", and it certainly wouldn't make the maintenance job easier!

K
Tuesday, November 25, 2003

A lot of times, for me at least, all that's really missing is a sentence or two of comments describing what some regular expression is supposed to do to kind of set the context for interpreting the expression.  I also like certain syntax variations that allow the writer of a regular expression to group sub-expressions in the same way as we use parentheses to group mathematical subexpressions, even when its not strictly necessary.

TTT
Tuesday, November 25, 2003

REBOL uses grammars (similar to EBNF) instead of RE's, so it's handy for quick, cross-platform and cross-network parse operations.

For example you can easily extract text, such as the following, which prints out all of the HREF links on the JOS home page in the REBOL shell console:

page: read http://www.joelonsoftware.com

parse/all page [any
  [thru {href="} copy data to {"} (print data)]
]

So that's a simple example of readable data extraction. You can also change/modify the matched data, just put any code operations in the parens. Docs on REBOL's parse dialect are found at:
http://www.rebol.com/docs/core23/rebolcore-15.html

REBOL may not meet RE's full pattern-matching power, but it is a quick and readable alternative.

One strength of the grammars-based parsing approach is that you can create a domain-specific dialect that better describe a problem domain, such as:

pub all new .htm to my blog
every 3 hr check JOS for update
download all pdf at foo.com
show all stocks in portfolio foo +- 5% change
list names in db 1 - 3 where email ends in @yahoo.com

Most of us are good at using programming languages, but perhaps not very experienced (including myself) at designing our own dialects. It's an idea with possibilities, though.

Edoc
Tuesday, November 25, 2003

Just script up a parser plugin in your IDE to color code regex's so that your escaped characters are visibly distinct from your functional characters. No biggie.

Tony Chang
Tuesday, November 25, 2003

K made a good point which I agree with.
As I see it, the only problem with regex as they currently stand is they are unreadible and the only reason they are unreadible is the necessary use of escape characters. Colorizing them helps a lot.

Tony Chang
Tuesday, November 25, 2003

What about optimization?  For most non-trivial regular expressions you can re-write them more than one way.  Generally I've seen that parsers don't do a particularly good job of optimizing the regexp before execution, and having a tool that finds the most efficient regexp given a string would be very handy.

MR
Tuesday, November 25, 2003

So come up with something better and publish a white paper.

www.MarkTAW.com
Tuesday, November 25, 2003

Well if you're going to go the route of syntax coloring, you might as well use an extended character set to represent special operators instead.  The IDE could do the mapping between your special character set and ASCII (so you don't have to rewrite the tokenizer for your language).

The comment about Rebol is valuable I think, but won't really work as a replacement for regular expressions.  EBNF parser-generators (which is essentially what he's describing) build up structure from tokens that are defined by regular expressions at the lowest level, and they use regular expression idioms for defining token relationships.

For example:

expression ::= (number operator)+ number
operator ::= [\+-/\*]
number ::= [0-9]+

(for a simple grammar that allows you to read flat arithmetic expressions containing integers)

For C++ programmers, check out the Spirit parser framework (part of the boost library now) to do that kind of thing (without Rebol semantics).

K
Tuesday, November 25, 2003

Well, in Perl there is a flag that can add whitespace and
comments to regular expressions. Not too many people use it a lot unless they are really aiming for readability.

In any case, I agree that regular expression are hard to read, but I think a more verbose syntax will make programmers very annoyed. It's like asking a programmer to write the COBOL:

add x to y giving z

instead of the C:

z=x+y;

This reminds me in a way of the Icon string matching constructus, which were like regular expressions only with very explicit operations. (look for this character, look for the second occurence of this string, etc.)

Shlomi Fish
Tuesday, November 25, 2003

Maybe I've missed K's point, but I don't see why one couldn't use the often simpler grammars approach in cases where readability/maintainability is important.

I'm no REBOL guru, yet I am able to do in Rebol much of the garden-variety web work (pattern matching, search & replace, etc.) that would normally send me into Perl territory-- and this is no dig against Perl.

At the end of the day, I only care about whether or not I can get what I need done in a manner that suits my needs. If it's an RE approach great, if it's another that's swell too.

Edoc
Tuesday, November 25, 2003

For PERL 6, they are working on an entirely new, more powerful, and more readable regex syntax:

http://www.perl.com/pub/a/2002/06/04/apo5.html?page=2

It all seems like good ideas to me.

Almost Anonymous
Tuesday, November 25, 2003

Were regular expressions designed for use in code? I'm not sure. Their conciseness and poor readability implies they weren't. But for doing searches (or, better yet, search-and-replace) in text editors or via grep, this conciseness is a definite virtue.

Insert half smiley here.
Tuesday, November 25, 2003

I love regexps... I don't fully understand them, and they're always surprising me, but I love them.

Jack of all
Tuesday, November 25, 2003

Hi,

I wrote an article comparing Rebol Parse rules and Regular expressions here: http://www.compkarori.com/vanilla/display/PARSE-Versus-Regexs

I would be glad if anyone added his notes.

Ladislav
Wednesday, November 26, 2003

The second worst problem I have with regular expressions is the implementation inconsistencies - for example, a regexp ^(a?|b*)$

perl: just works.

DreamWeaver will probably not match, for no reason that I can discern. Some things will, giving me enough confidence to try something a little more complex, then it just spits the dummy, and won't match anything more. At which point I copy the regexp and do it with perl.

grep  ... *won't* do what you'd expect: it doesn't know about '|', you need egrep for that. (guess which grep is programmed into my fingers though)

And emacs. Well. In emacs that would be written as the much more sightly ^\(a?\|b*\)$ Also emacs has had a tendency to match possibly empty patterns twice - though I can't seem to do that in my current version (21.2.1)

There also doesn't seem to be consistent rules for control characters - I've yet to work out how to match a tab character in sed.

The worst problem I have with regexps? Reading them. This is mitigated by commenting the hell out of them.

Still - they're better than any alternative I've seen.

Steve P
Wednesday, November 26, 2003

*  Recent Topics

*  Fog Creek Home