Fog Creek Software
g
Discussion Board




Dynamically determining the XML structure

One problem I have had with XML parsing is that one has to know in advance what the structure of the XML document they are going to parse is. At that, one usually ends up hard coding node names into the program.

I was wondering then what is the use of the good DTDs and the IXMLDOMDocumentType objects in the parser? Why have a DTD then if you can not use it in your program to dynamically obtain the structure of a document.

I hope I am making myself clear. For instance, I was given an XML file like this:


<?xml version="1.0"?>
<TEAM>
    Visual Basic Team
    <MEMBER>Sathyaish Chakravarthy
        <SKILL>VB 6.0</SKILL>
        <SKILL>VB.NET</SKILL>
        <SKILL>Win32 API (SDK)</SKILL>
        <SKILL>ADO.NET</SKILL>
        <SKILL>SOAP</SKILL>
        <SKILL>XML</SKILL>
        <SKILL>HTML</SKILL>
        <SKILL>C/C++</SKILL>
        <SKILL>HTML</SKILL>
        <SKILL>Cascading Style Sheets</SKILL>
        <SKILL>Microsoft BizTalk Server 2000</SKILL>
        <SKILL>MS SQL Server</SKILL>
        <SKILL>Sybase</SKILL>
        <SKILL>Oracle 8</SKILL>
        <SKILL>MS Access</SKILL>
        <SKILL>Seagate Crystal Reports</SKILL>
    </MEMBER>

    <MEMBER>Ajay Kumar
        <SKILL>C/C++</SKILL>
        <SKILL>Visual Basic 6.0</SKILL>
        <SKILL>C/C++</SKILL>
    </MEMBER>

    <MEMBER>Rishabh Agarwal
        <SKILL>C/C++</SKILL>
        <SKILL>AS 400</SKILL>
        <SKILL>Visual Basic 6.0</SKILL>
        <SKILL>Visual Basic 6.0</SKILL>
    </MEMBER>

</TEAM>


Even if I had a DTD to this document, yet *I* would have to know that the document has the following heirarchy:


<!DOCTYPE TEAM

[
<!ELEMENT TEAM (#PCDATA) | (MEMBER)+ >
<!ELEMENT MEMBER (#PCDATA)| (SKILL)+ >
<!ELEMENT SKILL (#PCDATA)>
]

>


I was thinking what might be the use of the DTD when it cannot tell my application that must consume the XML about the structure of the document so that the application may dynamically look for elements depending on the document structure in the DTD.

Sathyaish Chakravarthy
Thursday, January 22, 2004

First of all, stop using DTD's. Learn XSD.

Secondly, yes you have to know the structure of the data. It works this way with databases as well - you have to know table names and field names. Even in a metadata solution, you need a table/field entry point.

Nature of the beast. There are ways to use self-defining data, but then you end up getting the user to make data-based decisions, you're just channeling the data back and forth between the user and the database.

Philo

Philo
Thursday, January 22, 2004

Why do you want to use DTD rather then XML Schema?

So is one of followwing your scenario?
- you have schema for XML you want to load and reject the rest
- you have XML and you want to know if it is valid according to your schema
- you have XML and want to find schema for it

WildTiger
Thursday, January 22, 2004

No, just that I never learnt XSD, but I will be learning it as soon as I can. Was just curious. On the one hand, I could recursively drill down the hierarchy to depict the XML DOM Tree on say a treeview without knowing the structure, or use XPath with SelectSingleNode to read a particular node's contents without knowing its ancestors, but this would not guarantee that I am accessing the correct node at the correct level. For instance, a node <BBB>, which I access with an XPath instruction //BBB or //child::BBB could be well anywhere within the document and I necessarily won't have a guarantee as to which BBB I am reading.

May be, there's joy in XSD I haven't discovered yet.

Sathyaish Chakravarthy
Thursday, January 22, 2004

Dear god no - stick with DTD or learn Relax NG. It won't solve your original problem (no schema technology will) but at least you'll be spared the horror that is W3C XML Schema.

To solve your problem, you'll need a generic data model (e.g. RDF graphs) instead of just a generic syntax like you have now.

matt.
Thursday, January 22, 2004

I'd just like to point out that people who use DTDs are not by definition morons. DTDs and XSDs do different things, they're not interchangeable. Try defining a macro in XML Schema... right.

Of course, the OP might still be better off with an XSD... but that's a case-by-case decision.

Chris Nahr
Thursday, January 22, 2004

Why would you want a macro in a document definition?

Philo

Philo
Thursday, January 22, 2004

To define character entities and text passages that can thus be conveniently referenced in the document. You can import entire files with a custom-defined entity. That's only possible with DTD, not with XSD.

DTD is also capable of handling syntax other than XML, although that doesn't apply here. (But it's the reason why the HTML standard is defined in DTD, not in XSD.)

Chris Nahr
Thursday, January 22, 2004

>but at least you'll be spared the horror that is W3C XML Schema

To be honest, I have a little secret I won't admit to. I find the whole XML thing a bit scary after the namespaces.

Sathyaish Chakravarthy
Thursday, January 22, 2004

If you're using Visual Studio .Net, you can have Visual Studio autogenerate a schema from an existing XML file.  (Open an XML file in Visual Studio, then select the XML: Create Schema menu item.) 

The resulting schema might need some massaging, but it's a good way to get started.

Robert Jacobson
Thursday, January 22, 2004

Wow! That was one cool thing I never knew. But then I hadn't used VS.NET very extensively. Thanks for the trick.

Sathyaish Chakravarthy
Thursday, January 22, 2004

"To define character entities and text passages that can thus be conveniently referenced in the document"

Isn't this mixing definition and content?

Philo

Philo
Thursday, January 22, 2004

>Isn't this mixing definition and content?

It is and yet it isn't. Because in the DTD, for external entities, you're not defining the substance or content, you're just creating aliases for the location where the content resides in most cases, which is analogous with declaration. But for internal parsed character entities, it involves mixing declaration with definition.

But coming to think of it, I think this is the only one useful thing that a DTD allows you to do, declare general and character (internal and external, parsed and unparsed) entities so you can use them with entity references. Validation is something that is only useful for constructing the document and not for consuming it.

Sathyaish Chakravarthy
Thursday, January 22, 2004

>Validation is not required for consuming documents
garbage-in - garbage-out? But how you would figure out if your "out" is something useful then?

XML by itself is not so useful and I believe in most cases you know in advance what data you can parse. So you will not use "//bla" since you know what path to the node to use...  Or am I missed your question completely initially and you are trying to write XML parser yourself?

WildTiger
Thursday, January 22, 2004

"Because in the DTD, for external entities, you're not defining the substance or content, you're just creating aliases for the location where the content resides in most cases"

Aaaaiiiieeeee!!!!!
I guess if DTD is the tool you feel is best suited in this case, fine, but man is this an abuse of what a document definition is about. For one thing, it's generally supposed to be portable, and pointing to an external location certainly destroys that concept.

Sorry - just completely alien to the way I've used schemas.

Philo

Samir
Thursday, January 22, 2004

The nice thing I have discovered about .NET is that if you have an xsd schema (which Visual Studio .NET can auto generate from your xml as mentioned above) is that you can run the xsd.exe utility to create a bunch of classes based on your schema.

This allows you to serialise and deserialize an entire object model from an XML file in about three lines of code. Personally, I think this is way cool.

Better Than Being Unemployed...
Friday, January 23, 2004

You can also go the other way... generate an XSD from a class definition.  IMO, one of the best features of the Framework. 

(Shuddering at memories of my hand-rolled serialization routines in VB6...)

Robert Jacobson
Friday, January 23, 2004

*  Recent Topics

*  Fog Creek Home