Fog Creek Software
Discussion Board




Building Text Parsers With C

Anyone have any suggestions on how? Namely, how to build parsers in C for standard formats like tab-delimited fields, CSV, or *.ini files, and how build event-stream parsers or tree-based parsers for custom formats. Any tips and/or suggested reading?

Vans Davis
Wednesday, March 10, 2004

Yea -- don't reinvent the wheel.

Ron
Wednesday, March 10, 2004

I think the standard way is using the (in)famous tools yacc & lex. One is a tokenizer and the other is a gramar generator.

I have never used them, but they are seroius tools to build text parsing and processing engines with C.

http://dinosaur.compilertools.net/

.NET Developer
Wednesday, March 10, 2004

If you are actually interested in writing a parser, as an exersize of its own, I suggest looking at some parser generators, such as lemon (should be available at http://www.hwaci.com , as I recall).

As far as very simple formats go, if you don't need anything too complex, you may be able to get by with the copy algorithm, istream_iterator, and back_inserter if you have the STL available.

I highly recomend using the STL and C++ standard library. I just find that working with char*s in C and the C standard library is a pain in the ass, if you want good error checking and to avoid memory leaks.

If you want to do something, and am not interested in writing a parser itself, I suggest you consider just using XML or a readily available ini file library. There are a ton, for all languages. Many languages and toolkits provide a higher-level config facility, such as java's Preferences API, or the windows registry.

Mike Swieton
Wednesday, March 10, 2004

I'm not reinventing the wheel. My C application has to parse a somewhat-compilacted custom format comprised of elements from other, more popular formats (like CSV), so I've been looking for texts, tutorials, and good exampes to emulate.

Vans Davis
Wednesday, March 10, 2004

For simple things like CSV,Column based formats using lex and yacc or their equivalent is really overkill.

I you are open to using C++ suggest you look up the regular expression library at http://www.boost.org/libs/regex/doc/index.html.

Using this library you could very quickly roll your own parser

Code Monkey
Wednesday, March 10, 2004


You probably want to look at Tinyxml

http://www.grinninglizard.com/tinyxml/

Miffo
Wednesday, March 10, 2004

CSV format parsed by a C program?

strtok(), man.

Anything more complex? I'd probably look on Freshmeat, Sourceforge, and just do Google searches. It's amazing what people will donate for free to the public domain.

Lex and Yacc would be my last choice for such simple parsing requirements simply because someone's probably written one already and is proud to make it public domain.

But if you *must*, Lex would be more than adequate for creating code that runs through CSV or .ini file formats.

Bored Bystander
Wednesday, March 10, 2004

In a few weeks, I need to write a parser in C++ for RTF and HTML.  I had considered building something entirely custom.  However, I do have some experience with Lex and Yacc.  Should I try an existing tool or roll my own?

BTW, the STL is not available on the platform I'm using and interfacing with normal C/C++ code is generally a pain.

Almost Anonymous
Wednesday, March 10, 2004


Perl is so great when it comes to parsing. I never looked at the yacc & lex combo but they seem very complicated. Perhaps I am wrong. Please correct me if I am.

Is there a C/C++ based solution (some kinda library) out there you know about/use that has the power and simplicity of perl parsing without actually embedding perl?

entell
Thursday, March 11, 2004

I'll second using perl as a parser. Check out  the rec::descent module for instance http://www.perl.com/pub/a/2001/06/13/recdecent.html

Matthew Lock
Thursday, March 11, 2004


web search for recursive descent and 'railroad diagrams'.

Writing this stuff is heaps of fun. 

But not really neccesary for csv files.

braid_ged
Thursday, March 11, 2004

I've written an Earley parser generator in C++, which has good performance characteristics (given fairly large sentences).  I enjoy using mine much more than an LL(k) parser framework like boost's "Spirit" or some such thing, especially because I *always* define left-recursive grammar rules and the gymnastics that have to be done to refactor such grammars (if possible) to work with an LL(k) parser can cause a lot of frustration.

Also, it's breadth-first, so no back-tracking is necessary (you don't have to know the whole contents of the stream to parse before starting to parse).

Plus LALR and LL(k) parsers are so 70s.

K
Thursday, March 11, 2004

For complex stuff, the following work well with C++:

ANTLR: http://www.antlr.org/

boost::spirit: http://www.boost.org/

Lex and yacc are not OO; they can be made to work but will not provide as elegant a solution.

For small jobs such as CSV (a little more work than strtok() can do if you really want to get it right) it's worth learning how to hand-code recursive-descent parsers.

I'm starting to find that for real programming language design (in my case VHDL and Verilog), the bitch is not parsing (easy due to the use of parser generators), but getting the semantic analysis right - making sure that all required semantic checks are performed and that your compiler will not crash given invalid input.

David Jones
Thursday, March 11, 2004


I third:  Do it in perl.

If you have a c compiler on your platform, you can do it in perl.

Matt H.
Thursday, March 11, 2004

hehe,

http://www.cs.brandeis.edu/~mairson/poems/node1.html

K
Thursday, March 11, 2004

"If you have a c compiler on your platform, you can do it in perl."

Not true.  You need a c compiler and pretty reasonable implementation of POSIX.  I have the former but not the latter.

Almost Anonymous
Thursday, March 11, 2004

Bored, I'll assume your suggestion of strtok() wasn't serious.

CSV can contain embedded commas in string literals. The following line has four fields, but a strtok() scan with comma separators would find six.

1,2,Test,"Another, see, here?"

HeWhoMustBeConfused
Thursday, March 11, 2004

Some people above recommended Perl. If you've never used Perl, then I suggest you learn Python instead. There is a CSV library built into Python (called csv).

I have used Perl for many years, and I forced nyself to learn Python. I will never user Perl again. Python does everything Perl can do and much more.

W
Monday, April 19, 2004

The question was, "...how to build parsers in C...?" Suggesting other languages is a bit like answering the question, "How do I get to X by car?" with, "Take the train."

Mark
Friday, April 23, 2004

*  Recent Topics

*  Fog Creek Home