Fog Creek Software
Discussion Board




To yacc or not to yacc

Whether tis nobler in the mind to suffer the slings and arrows of outrageous parsing.. Er. Sorry.

Well, I want to implement a simple SGML/XML parser (no validation, no DTD handling) that would allow me to derive XML processors in an event-driven manner (like SAX). It will be done in C/C++ because I don't like Java.

Now I know there are tools like Expat and Xerces available, but I'd rather do it on my own if possible.

So I've been trying to research the feasibility of implementing this in the simplest manner possible, and questions come to mind that I'm not sure I have answers to:

- Can XML's grammar be expressed as LALR(1) so as to enable processing by flex/bison?
- Would implementing a recursive descent parser by hand be a better choice?

Warren Henning
Thursday, June 19, 2003

No yacc.

I seriously doubt that XML syntax can be expressed in yacc's parsing model anyway.  Not sure.

Even the phrase "recursive descent parser" connotes more than you need.  Parsing XML just isn't the same as parsing a language.

If you must reinvent the wheel, I recommend writing it by hand. 

Better yet, use expat and use the time you saved to go see a good movie.

Eric W. Sink
Thursday, June 19, 2003

How is using yacc considered "doing it on your own" and using a real XML parser isn't?  This sounds vaguely like ... "I want to drive in a few nails to this piece of wood, but I don't want to use a hammer.  Will a screwdriver do?"

Alyosha`
Thursday, June 19, 2003



Why not use a SAX implementation?

I don't see why you want to use yacc/lex?

James Nicoll
Thursday, June 19, 2003

The point was to have a project to help me learn compilers. XML seemed simple enough but would still require me to apply important knowledge.

Warren Henning
Thursday, June 19, 2003

I may be weird, but of all the compilers I've written in my life, I've never felt that yacc or lex was actually easier than just hand coding one. (As long as you know the basics of parsing and lexing, and you understand that you have to implement functions like GetNextToken and PushBackToken.)

Joel Spolsky
Thursday, June 19, 2003

Well... I just was trying to think of a good summer project (i.e. one that could be done by an 18 year old with no formal training such as myself in about 1-1.5 months). Anyone got any ideas? I was just thinking something with compilers, and then I hit upon the simpler idea of using XML. I really have no idea what I'm doing...

Warren Henning
Thursday, June 19, 2003

The original goal for XML was for something that a CS Grad student could implement in a reasonable amount of time. However, when you add namespaces, schema validation, etc., apparently a reasonable about of time is now "a couple of years."

If you're just looking for a way to experiment with parsing, XML might be the way to go. You might also want to consider something a little higher level, though.

For example, what about an SVG parser? SVG is Scalable Vector Graphics, and is an XML representation of vectored graphics, kinda like flash.

Now, doesn't the XML parser do this for you? Well, no, the XML parser handles the angle brackets, but you still need code to figure out what the actual semantics of the XML is.

You could do this with a parser. Use expat or another SAX parser, and treat the SAX events as tokens to feed into your parser.

Hope this helps,

-Chris

Chris Tavares
Thursday, June 19, 2003

My suggestion: write a library to read configuration files. You know, unix style, with # as comments, with sections and pairs, similar to .ini files. Make it very simple. This kind of stuff can be made using a very simple state machine (sounds complicated, but believe me, it is not, it's basically a switch statement inside a loop.)

Leonardo Herrera
Thursday, June 19, 2003

Warren, your ambition and self-starter motivation is inspiring, and I think you've got a great idea for a summer project. 

I'd encourage you to write from scratch: you'll learn a lot more IMO. Also, don't be too stubborn to look at existing solutions when you get stuck.  It's true that there's a lot of satisfaction and pride in solving a tough problem.  But if your goal is to _learn_, then you would do well to see how other people have solved the same problem.

-Thomas

Thomas
Thursday, June 19, 2003

Here is a list of tasks that gnu wants done.

http://www.gnu.org/prep/tasks_toc.html

Compiling work:

http://gcc.gnu.org/projects/beginner.html

Good luck, most of them look difficult.

--
ee

eclectic_echidna
Thursday, June 19, 2003

If you are looking to learn compilers then here a few books that I have read:

Garbage Collection: Jones, Lins

How Debuggers Work: Rosenberg

Advance Compiler Design Implementation: Muchnick

Writing Compilers and Interpreters: Mak

Compilers: Principles, Techniques and Tools: Aho, et. al

I would recommend trying to write a simple BASIC or PASCAL interpreter or even a simple expression parser.

Here's a link to an expression analyzer that I wrote, I got up to parsing/interpreting IF THEN ELSE END IF and then... I went to see a good movie!

- http://www.sswltd.com/downloads/expressionanalyzer.zip -

My $0.02.

Dave B.
Thursday, June 19, 2003

DTDs have been proved to be transformable to LL(1) and LALR(1) grammars, so it is certainly feasible to write a simple XML parser with yacc/lex. However, all the crud that has accumulated in the standard over the years make it kind of moot not to use a home-grown parser for anything other than learning projects.

XML could be a good project if you know the language (although it is a data language, so you have to have some kind of transformation in order to get anything exciting out at the other end), compared to learning another language first in order to parse it.

I don't know what you want to achieve by writing a sample parser in yacc/lex without any formal connection at all, other than learning by first-hand experience that yacc/lex are horrible tools to work with.

You don't say why you prefer C++ over Java, but I suspect you appreciate the ability to shoot yourself in the foot at your own liking. In that respect, you're in for hell of a ride when it comes to flex/bison etc. :-)

Roland Kaufmann
Thursday, June 19, 2003

This sounds like a great project for you, although I don't know that it will actually help you learn anything about compiler implementations. A compiler not only lexes and tokenizes and creates a syntactic tree of the source code, but it analyzes the semantics of the code in order to generate machine opcodes, which it can then optimize for the target platform. The code generation aspect of compiler theory is the interesting part, not the lexing/tokenizing/tree-building.

Having said that much, though, I think XML parsers are cool. In fact, I'm about 70% of the way through implementing an XML parser of my own. The reason I'm not using a pre-existing library is that I'm writing my parser in a language that doesn't yet have a DOM XML parser.

(It's the D programming language. You can read more about it here: http://www.digitalmars.com/d/ )

Anyhow, before you start, it would probably be a good idea to see how some other people are doing it. Take a look at the code for the xerces project (from the apache folks), and you'll hopefully get a good feel for how you DON'T want to do it.

When I started my parser, I actually wrote 500 lines of code before deciding that I was doing it wrong (I started out using a multi-pass regular-expression-matching parser). Then I sat down and actually _read_ the XML standard (here: http://www.w3.org/TR/REC-xml or, read the annotated version, here: http://www.xml.com/axml/testaxml.htm ) and the DOM Level 1 standard (here: http://www.w3.org/TR/1998/REC-DOM-Level-1-19981001/ ), and I got a much better idea for how I should really do it (a one-pass stack-based scanning finite state machine). The architecture that I've implemented is very similar to the PugXML project (which you can find here: http://www.codeproject.com/soap/pugxml.asp ).

You'd be wise to look at other implementations out there, and to actually do some design work before you get started. Or else, like me, you'll wirte 500 lines of code before realizing that you should have started off on a different foot.

Good luck with your project. And if you have any questions, you can email me.

Benji Smith
Thursday, June 19, 2003

By the way, my recommendation is: no, don't use yacc. The only thing you'll really learn about is: HOW TO USE YACC. If I'm not mistaken, what you really want to learn about is": HOW TO PARSE TEXT (LIKE A COMPILER). You won't learn that from using yacc.

Benji Smith
Thursday, June 19, 2003

I don't view writing an XML parser as "solving a tough problem". 

It is merely the implementation of a bunch of small tedious rules.  There's something to be said for doing it as an exercise in being able to focus while working on unexciting crap, but the actual guts of the thing isn't a "tough problem".

IMO.

Mister Fancypants
Thursday, June 19, 2003

Fancypants: depends on how skilled one is; I wrote a parser for a particular xml language (okay, my vocabulary isn't up to spec -- a set of documents with a specified set of tags) for an early programming class in college. In retrospect, there wasn't anything "hard" about it, but there was a lot of work which I didn't know how to do when I started, so it was a good experience.

As for a personal suggestion, a parser for a non-trivial something is probably a good thing to start with. Heck, come up with your own scripting language. Write a little drawing program and give it a scripting language then write the parser for the scripting language or what have you.

But XML would work too. :)

Steven C.
Thursday, June 19, 2003

Ah, I see -- you want to write code to learn.  Suddenly I do want to recommend writing a parser instead of using one that already exists.

But don't write an XML parser.  Parse something else like Java or C#.

Have fun with it.  Don't take the project too seriously.

Writing compilers and interpreters can be a good learning experience, but not many people ever make any money doing it.

And don't use yacc.

Eric W. Sink
Thursday, June 19, 2003

If you want a real challenge, try writing a compiler for Intercal. Then run screaming. :-)

A friend of mine actually had an InterCal compiler for .NET working; dunno if he ever released it or not.

Chris Tavares
Thursday, June 19, 2003

Have a look at the boost spirit parser.

http://boost.org/libs/spirit/index.html

punter
Thursday, June 19, 2003

I second the suggestion for boost::spirit. It's an amazing piece of technology, and it's very usable at the same time.
And the people on the mailing list are very supportive, should you run in trouble.

Phoenix
Friday, June 20, 2003

*  Recent Topics

*  Fog Creek Home