Fog Creek Software
Discussion Board




Ripped from Slashdot:  Is XML too hard?

Slashdot has an interesting discussion on the difficulties of working with XML and several links to several articles discussing several shortcomings of XML written by several smart people.  And while I think XML (or something like it ) is a necessary thing, I also believe that working with XML is not nearly as easy as some people who try to sell you on it try to lead you to believe.

And the reason for this is that XML by itself isn't all that useful.  It's all the OTHER crap that you have to learn IN ADDITION to XML that introduces the complexity.  So you want to parse your XML document?  Well, you've got to learn either DOM or SAX (or both).  So you want it to be validated?  Well, time to learn DTDs and Schemas.  So you want to turn it into another XML document with a different strucutre?  Well, time to learn XSLT.  The list goes on and on.  And many of the solutions for each of these shortcomings are non-intuitive, error prone, or not generally applicable enough.

All that said, I think the idea of having a language/OS/vendor/system-neutral data representation scheme is worthwhile.  I guess the tools just need to catch up.

What are you opinions/experiences with XML?

Crimson
Tuesday, March 18, 2003

Finally, it's coming to light....

An XML document is nothing to create. Parsing it; doing something with it is the difficult part. I'm glad this is finally getting noticed.

TB Sheets
Tuesday, March 18, 2003

I've just started using Microsoft CRM which uses an XML format for queries.  What a pain in the A$$.  In order to do a "SELECT contactid, firstname FROM contacts where LastName='GXXX' " equivalent in this 'FetchXML':

(hoping this msg board will display it correctly)

<fetch mapping='logical'>
    <entity name='contact'>
        <attribute name = 'contactid'/>
        <attribute name = 'FirstName'/>
        <filter type='and'>
              <condition attribute = 'LastName' operator='eq' value='GXXX'/>
        </filter>
    </entity>
</fetch>

GiorgioG
Tuesday, March 18, 2003

Yes and no.

XML itself is simple. DTD's are relatively trivial as well (unless there are some really powerful esoteric features I've not seen discussed).

What's hard is using SAX to parse a file. My particular application sits fine with SAX callbacks and a stack, but I can imagine that many would not. I have to side with Bray on that one: SAX is extremely counterintuitive, and DOM is often not an option.

XML is great though. If only because it has established parsers (that is the sole reason my latest project used it).

Mike Swieton
Tuesday, March 18, 2003

Yup, too bad MS and other vendors blessed it as the second coming.  Will it eventually go away?

Mike
Tuesday, March 18, 2003

I think xml at its heart is just a form of serialization. It is nice to have a standard one, though the world could probably use a binary standard one too. If it is harder than just serializing objects then I think its just a sign of immature tools. Xml in .net is pretty much transparent, at least for the common uses including soap, which is how it should be. You should never have to see < or  >, or use a x-dejour acronym, unless you are doing something very specific. Unless you are stuck in an xml poor enviroment writing your own parser etc, ugh.

Robin Debreuil
Tuesday, March 18, 2003

There are ways other than SAX and DOM to parse XML. It's just that these two are the standard ways of doing it.

In .NET Framework System.Xml.XmlReader allows you to "seek" to a particular node and pull out the data. This approach is similar to SAX in terms of not requiring the whole documents to be in memory, but doesn't require the callbacks. This approach is perfectly fine for applications that serialize objects to XML - storing configuration info, SOAP, etc...

igor
Tuesday, March 18, 2003

XML has some nice features.  Mostly it's a good way for file formats that might be useful to have somebody else code against be slightly easier to parse. 

The problem is that DOM and SAX are both trying to be two-size-fits-all solutions for the problem.  They both work roughly OK if you are using them for what they were origionally designed for and pretty poorly for things they were not designed for from the start but are shoehorned into.

XSLT is actually quite nice, once you get the hang of it.  But you need to view it as a scripting language more than anything else.  It is a scripting language that is designed to be limited to the task at hand -- converting sets of similar XML documents into a new format without creating even the slightest hint of an exploitable hole.  It really could have been done as a scheme-like language and done up much less verbose, but this way, you don't need to write extra parsers, you can do everything with the DOM.  In most cases, it's prolly easier to write it in a scripting language of your choice unless you need it to be used in an XSLT-friendly environment.

The same goes for most of the other standards.  They hope that you use that instead of rolling it yourself in the language of your choice and an XML parser.

And there's too much lot of hype and other weirdness, just like every other buzzword technology.  DTDs and Schemas are complimentary, not competitive, damnit.

flamebait sr.
Tuesday, March 18, 2003

I don't care; just hide the xml behind functions.  When did this become impossible?  When I abstracted away xml in an app that was "impossibly slow," I was able to cache and other stuff.  And no one had to know they were using xml.

XML in code is like Hard Rock Cafe on a t-shirt -- advertising.

Tj
Tuesday, March 18, 2003

The original article /. was referring to basically boils down to this:

1) The XML DOM requires you to read the entire XML into memory before processing. This takes too many resources.

2) SAX (the other main competitor) is very hard to program to in real world situations due to its design (callbacks from nodes).

3) XML as text is too irregular to process with line/regex based tools, which are the tool of choice for quick text stream processing.

I would mostly agree with these statements. XML parsers and query engines these days are built around having everything in memory. If you want to do stream processing, you're pretty much stuck with SAX right now. And the callback-based model SAX uses really takes a lot of work to implement what are fairly simple things.

As for how to get the best of both worlds: I've got an idea, but I think I may save it for a magazine article. ;-)

Chris Tavares
Tuesday, March 18, 2003

how do you figure it's too irregular to process? At the root you've got <, >, </, and =""
If people can parse HTML with Regex (and they do), what's the big deal with a highly structured tag library like XML?

FWIW, I consider the best application of XML to be for transferring data, in which case you'll be reading the entire document in anyway...

Philo

Philo
Tuesday, March 18, 2003

You're forgetting about namespaces. An element can have a namespace prefix or not. The namespace prefix can be different but refer to the same actual namespace. But a regex based search will treat foo:MyElement and bar:MyElement as different, even if they are in fact the same according to the namespace declarations.

Not to mention the fact that line breaks are pretty much arbitrarily allowed in an XML file - niave (sp?) tools tend to choke on stuff like this.

In a way, it's kinda funny, since XML isn't a regular language, and so you can't parse it with regexes anyway!

Chris Tavares
Tuesday, March 18, 2003

Proclaimng XML immiment death is like calling for the end of ascii text - oh that happened and very few people noticed - now we have unicode.

Bottom Line - Tools - why anyone is messing with raw XML is beyond me.

Get out of emacs, vi, notepad or what ever macho text-editor you are using and get a real IDE and get a life

Long live silver bullets
Tuesday, March 18, 2003

Note that most things that are useful are not "regular".  Generally, any interesting file format won't be a regular grammar, at least as far as any regular expression searching system is concerned.

I'm not sure what the problem is.  If you want to grep through a file for something, XML isn't getting in your way.  And if you want to get data out of it, there is DOM, SAX, and a set of various alternatives for different languages.  You will get just about as far parsing lisp or C styled syntax with regular expressions as you would using XML.

flamebait sr.
Tuesday, March 18, 2003

Two APIs Java programmers can use to make their XML processing a whole lot easier are JDOM (http://www.jdom.org/) and JXPath (http://jakarta.apache.org/commons/jxpath/).

JDOM was conceived because DOM and SAX didn't quite fit into the Java way of doing things. JXPath makes it easy to hop around an XML tree with XPaths.

Walter Rumsby
Wednesday, March 19, 2003

Actuall you can use XSLT  to do the parsing for you.  Then DOM or  SAX is irrelevant.

Most of the XSLT process have ability to call user functions for a matching XPath statement.  You can use those abilities to eliminate DOM or SAX parsers.

Nitin Bhide
Wednesday, March 19, 2003

I have always used my own tokenisers.  So the code like "myxmldoc.next()" returns a token.  Tokens are not necessarily tags.  The types of token I've found useful are OPEN_TAG (includes name and, if specified, namespace), ATTRIBUTE (key value pair), CLOSE_TAG, TEXT etc.  The good bit about splitting the OPEN_TAG and CLOSE_TAG is that <hmm/> and <hmm></hmm> look exactly the same.

The nicest thing about tokenisers is that you don't need to coordinate state between several callbacks.

My tokenisers have evolved to have lots of useful methods for skipping stuff e.g. "myxmldoc.nextTag("hmm")" would skip all stuff until the next interesting tag.  Great equiv to what Tim Bray was doing with regex in the article pointed to in that slashdot article.

My tokeniser keeps a stack so that closes can be checked against opens; a trival check that the xml is well-formed.  I have never yet bothered with making them validating, but it would be very possible.  I do have simple utility functions like "nextSwallowing("hmm")" which would return the next tag after next, ensuring the skipped tag was a <hmm> and causing an exception if it wasn't.  Etc.  Great for hardcoding structure into the code.

I found that for simple xml parsing, the code in my method body actually looks pretty much like the xml, even the indentation!  Very easy to read.  Another thing that Mr Bray was complaining about mitigated I think..

So I suggest people consider tokenisers instead of SAX, since IMHO it is far superior to use.

IIRC kxml (http://kxml.sourceforge.net) is a tokeniser too.  And funnily enough, they did that for the same reasons that drove me originally - good performance and low memory overhead for J2ME Java.. when I saw their code, I actually almost thought they were copying - the methods were even the same named!  Ah well, must have been obvious I guess ;-)

/me checks website and finds there is now a "pull" API yeah baby!!

Nice
Wednesday, March 19, 2003

For an interesting take on this read
http://www.kuro5hin.org/story/2003/3/19/34653/8898#xmlstreams .

"It's quite amusing to me that a post which is really Tim Bray complaining about the crappy APIs for processing XML he is used to got so much traction on Slashdot and Jon Udell's blog as an indictment of XML.

The posting by Tim Bray was really an announcement that he is disconnected from the current landscape of XML technologies."

Just me (Sir to you)
Wednesday, March 19, 2003

XSLT is a parser?  I thought it was just an xml document...

apw
Wednesday, March 19, 2003

And a C program is just an ASCII file right?

Just me (Sir to you)
Wednesday, March 19, 2003

With Java:

JDOM and JXPATH to load and find nodes etc

and

**VELOCITY** (jakarta.org) to do the transformations. Who needs XSLT for most of the templating ?

Phil
Wednesday, March 19, 2003

*  Recent Topics

*  Fog Creek Home