Fog Creek Software
Discussion Board




Custom XML Parser

Hi,

I'm wondering on how to go about writing a simple XML parser in .NET, one that doesn't use the XmlDocument class (DOM API) and relies only on (in-built, so to say low-level) file system and string parsing functions. Are there any examples or sample code that illustrates on how to go about building a custom XML parser on these lines. The main issue is that I'm not quite sure about how to render the XML file to an in-memory data structure of some sort using a custom parsing approach. Once I manage to build an in-memory hierarchical data structure from the XML data file, I can manipulate it further.

Any ideas or suggestions will be very helpful. Thanks!

RiceGuy
Saturday, August 07, 2004

I'll beat everyone else to the punch and ask the obvious question - why? What would drive you to reinvent solutions to the need that is so well catered to?

Dennis Forbes
Saturday, August 07, 2004

Don't reinvent the wheel -- google TinyXML.

Skagg
Saturday, August 07, 2004

I too was tempted to do that, instead of learning the .NET XML classes.

Instead, a suggestion is to use the .NET XmlNode class (use your XML to initialize an XmlNode instance), because perhaps XmlNode meets your requirements: it represents an in-memory hierarchical data structure that is buildable from an XML file, and which you can then manipulate further.

Christopher Wells
Saturday, August 07, 2004

@Dennis Frobes: "why?"

"Mind is a terrible thing to waste" ... well actually I think I'll learn a thing or two about XML/RSS parsing this way. No real-world objectives ... just practical.

RiceGuy
Saturday, August 07, 2004

or use one of the push-model readers. (SAX, the equivalent in .NET whose name I forget, etc)

or dig up Mark Pilgrim's feed parser (probably part of the feed validator @ feedvalidator.org). it can supposedly handle all sorts of weird exception cases. haven't looked at it but i imagine it's of good quality... python.

mb
Saturday, August 07, 2004

Um, how about perusing the MSDN documentation first? Or read Dino Esposito's "Applied XML Programming for .NET". What you want already exists.

The XmlReader/XmlWriter family of classes provides an excellent lightweight sequential XML parser. XMLReader even optionally supports validation, although that does slow down the reading quite a bit.

Chris Nahr
Saturday, August 07, 2004

Following up on Chris, XmlTextReader is pretty darn fast as compared to loading the DOM model as well.  I have used in successfully in a couple of pretty big projects. It's also how things get serialized / deserialized when you use a Xml serialization.  Good stuff.  Concur on the don't re-invent the wheel, btw.... this should do what you need.

<sigh/>
Saturday, August 07, 2004

I agree that you shouldn't reinvent the wheel for production, but if you're just doing it for the educational experience then there's nothing wrong with that.

In that respect, I've made an XML to S-expression parser (since I use S-expressions throughout my projects).  The mapping is generally something like:

<tag attr="value">Hello <tag2 attr2="value2"/> World!</tag>

to

(tag (attributes ((attr "value"))) (data ("Hello " (tag2 (attributes ((attr2 "value2"))) (data)) "World!")))

(It's not a lexical conversion, but the S-expression above is what you'd get it you dumped the in-memory tree to a file.)

It's very easy to do if you've got some system for defining push-down automata, or you've got a parser generator system that lets you define capturing groups.

Kalani
Saturday, August 07, 2004

Its for educational purposes. I know its reinventing the wheel but I was to learn more about parsing algorithms using .NET.

I'm thinking hard as to what logic would go for parsing the tags, elements and child nodes. Maybe looking for starting and ending notations (< and />) and create a custom data structure (parent-child) on this basis. But a reference would have helped me get a better idea. Any thoughts?

RiceGuy
Saturday, August 07, 2004

Really you ought to just go right to the theory of parsers (you'll want to parse things besides XML anyway -- that's par for the course in this business).  Here's a good article to read for some inspiration:

http://en.wikipedia.org/wiki/LR_parser

You might find it easier to implement an LL(k) parser than an LR parser, but whatever kind of parser you implement, you ought to understand context-free grammars.

Here's a good description of push-down automata (a model for CFG-based parsing):

http://web.uvic.ca/~ling48x/ling484/notes/pda.html

If you're going to make a specialized parser for XML by hand, it's a good idea to write down the grammar.  Here's the official formal grammar for XML:

http://www.galiel.net/el/study/XML_1_0_Grammar.html

It looks like that grammar doesn't require a backtracking parser with capturing groups (meaning that it doesn't automatically verify that every tag "<a>" ends with "</a>" -- that's left to the stage after syntactic analysis).

If you know C++, send me an email and I can show you a few tools I've come up with to make this sort of thing very easy to do.

Kalani
Saturday, August 07, 2004

Honestly, I've learned more about programming by reinventing wheels than any other technique. In fact, its probably made more sympathetic to using another library, as I often find myself understainding the issues involved that lead to choices that you may not appreciate on first reading the documentation.

Anyway, if you're interested in learning about parsing, two of the best resources I've found are the book "Parsing Techniques: A Practical Guide"(available as a pdf at http://www.cs.vu.nl/~dick/PTAPG.html ). And Jack Crenshaw's "Let's Build A Compiler" series of tutorials(also online at http://compilers.iecc.com/crenshaw/ ).

I also have more info on parsing on my webpage: http://www.titivillus.net/titivillus/thesis/

Matt Estes
Saturday, August 07, 2004

I've read Jack Crenshaw's series but I don't think that it's necessarily the best resource for understanding how to build parsers.  He builds an informal LL parser and sometimes he obscures the nature of what his parser does (e.g.: by saying "if you use a stack you're almost certainly overcomplicating the problem -- just let your language manage the stack for you!" but not looking more thoroughly at this detail can lead the reader to try his approach on grammars where it doesn't work at all).

Matt, have you chosen a specific thesis topic yet?  Maybe you ought to look at CMU's "link grammar" project for an example of a parser-generator that accepts a class of context-sensitive grammars.  You might also find the subject of probabilistic parsing to be interesting.  Check out the book "Foundations of Statistical Natural Language Processing" from MIT Press (there's a great probabilistic extention to Earley's algorithm that you ought to read if you haven't already).

Kalani
Saturday, August 07, 2004

*  Recent Topics

*  Fog Creek Home