Fog Creek Software
Discussion Board




Parser generators--does anyone use them?

If you had to write some general-purpose parsing code, code to, say, parse some MIME fields or an *.ini file into memory, would you go ahead and use a parser generator like Bison, or would you do it all by hand?

Does the language matter? I.e., would you use a parser generator with C or C++ but not with Perl?

Jesus to Kimo: Take Steroids
Saturday, July 31, 2004

Use them?  I make them! :)

Yes they're still important even if you use Perl.  Look on CPAN and you'll find plenty of LL(k), LALR, and General LR parser generator projects.  Perl makes it easy for the programmer to define regular grammars (ie: in regular expressions), but as anybody will tell you, the set of languages defined by regular grammars is a proper subset of the languages defined by the context-free grammars.

For very simple text formats (like Lisp S-expressions), I typically implement a parser in a compact finite-state automaton (or push-down automaton) -- and I've got a simple framework for making those too.  For parsing programming languages or set constraints (e.g.: SQL), a CFG-based parser is invaluable.

If you're using C++, you might try boost's LL(k) parser generator library ("Spirit").  If the grammar you have in mind is inseparably left-recursive or ambiguous, you'll need to look at other types of parser generators.  The General LR parsers (e.g.: Earley, Tomita) will parse any CFG but they can be less efficient than necessary.  I've written a version of Earley's parser as my 'catch-all' parser and am working on some grammar analysis stuff to conditionally return LL(k) or LALR parsers if possible.

Kalani
Saturday, July 31, 2004


For a think like INI files, i wouldn't bother with YACC; i guess it is harder to debug YACC rules (shift reduct conflicts; insert proper errror handling; bother with C languager bindings etc..) than to write some simple parsing code.

For more complex grammars there is Visual Parse++ - an IDE that makes debugging of rewriting/production rules easy.

Michael Moser
Saturday, July 31, 2004

We have used boost "spirit" extensivly, it's a great tool.

Craig
Sunday, August 01, 2004

All the time.

Simple files -- throw it through a suitable Lexer. More complex stuff I use YACC for.

I've got all the fixes sorted out for putting multiple parsers into the same object code and so on... Have a whole bundle of C++ classes for supporting the parser so the action parts of the statements are minimal. Once you've got a good toolkit to work with, it's all superb.

Oh -- the YACC errors get a lot easier to fix with experience, BTW. You end up with a sort of mental model of what sets it off whinging and you can build the grammar to avoid them.

Katie Lucas
Sunday, August 01, 2004

Make sure you aren't reinventing the wheel, things like ini and MIME parsers are commonly available:

http://www.python.org/doc/current/lib/module-ConfigParser.html

http://www.python.org/doc/current/lib/module-email.html

Anony Coward
Sunday, August 01, 2004

Antlr (which can generate Java and C++ parsers) comes with grammers for Java, XML, HTML, etc.

And for lex/yacc (or flex/bison), grammers are available for just about anything.

dot who has tutored
Sunday, August 01, 2004

I used YACC to parse email addresses.

Worked out very well, it translated almost directly from the RFC, and the resulting code handled ALL possible internet email addresses, which is impossible with regexes. Would have been a lot harder without the tool.

Chris Tavares
Sunday, August 01, 2004

Is it easy to generate a "clean" C++ parser ? I am generally not very enthusiastic about Yacc-generated code. Should I have a look at spirit then ?

Pakter
Sunday, August 01, 2004

Yes, boost's Spirit library lets you declare grammars directly in C++ code (ie: it's not a code-generator).  There are some minor differences between Spirit syntax and traditional generative grammar syntax (as necessitated by using C++ syntax).  Adjoining non-terminals are specified with "<<" instead of a space.  The Kleene star is a prefix operator instead of a suffix operator.

Spirit's also an LL(k) parser, and I don't think that it does much in the way of left-recursive factoring or anything.  So if you have a rule like E ::= E op E | ... you'll force it to overflow the stack.

ANTLR seems to generate good code, but I haven't used it very much.

Kalani
Monday, August 02, 2004

I had to write a parser of logical expressions for a project. I was given a run-around for a few weeks, then the mgmt said that they don't want me to use a third-party parser generator (I was leaning towards CUPS for Java, but ANTLR was also considered) for "legal reasons". Licenses, you see, scared them -- though the official list of libraries approved for use in the product included several libraries with exactly the same license.

I ended up writing a parser by hand (and included some snide comments about why it's kind of not the most efficient thing, etc :)

DEBEDb
Wednesday, August 04, 2004

We actually used javacc (a java-oriented parser generator) as a key component in developing our PDF text extraction library for Java ( http://snowtide.com/home/PDFTextStream/ ).  It was really the best way to formalize the functioning of various parsing routines, allowing us to concentrate on **what** we were parsing, and not **how**.  That's a huge advantage.

Now, we did end up making some significant changes to the way certain files were generated, just for performance reasons.  But that's just an implementation detail -- using javacc to begin with was a huge win for us.

Chas Emerick
Monday, August 23, 2004

*  Recent Topics

*  Fog Creek Home