Fog Creek Software
g
Discussion Board




XML sorting

Can anybody tell me how to sort XML data using the XML SDK?

KS
Wednesday, February 11, 2004

Here is how I do XML sorting on a web server using 3 MICROSOFT.FreeThreadedXMLDOM objects

1. load XML source  (lo_XMLIn)

2. load an XSL for sorting  ( see xsl:sort syntax )  (lo_XSL)

3. use xml.transformNodeToObject method to re-gen your source XML file as a sorted output XML file  ( lo_XMLOut)

Set lo_XMLIn = Server.CreateObject("MICROSOFT.FreeThreadedXMLDOM")

Set lo_XMLOut = Server.CreateObject("MICROSOFT.FreeThreadedXMLDOM")

Set lo_XSL = Server.CreateObject("MICROSOFT.FreeThreadedXMLDOM")

lo_XMLIn.async = False
lo_XMLIn.validateOnParse = True
lo_XMLIn.loadXML(ls_XMLRecord)
lo_XSL.async = False
lo_XSL.validateOnParse = True
lo_XSL.loadXML(ls_XSL)
            
lo_XMLIn.transformNodeToObject lo_XSL, lo_XMLOut

Set lo_XMLIn = Nothing
Set lo_XSL = Nothing

** lo_XMLOut is your sorted XML **

Bob Brinker
Wednesday, February 11, 2004

And if you haven't read it already, see one of my favorite Joel articles of all time:  http://www.joelonsoftware.com/articles/fog0000000319.html

Sam Livingston-Gray
Wednesday, February 11, 2004

+1 for XSL sucking to sort, munge, or do much of anything.  I hate XSL.

Sassy
Wednesday, February 11, 2004

I like Joel's article too, but I suspect his stated understanding of how relational databases works is incorrect.

Perhaps he just said it to strengthen the point he was making, but I do not think that many modern databases store table rows in a fixed fashion:

"In a relational database, every row in a table (e.g. the books table) is exactly the same length in bytes, and every fields is always at a fixed offset from the beginning of the row."

If this were the case, we wouldn't have varchar, etc.

Steve Jones (UK)
Thursday, February 12, 2004

varchar is typically padded to maximum capacity, with a length field at the front.  So potentially-long varchars take lots of space even if they are empty.

True variable length data - called blobs or text or memo or whatever depending upon which db you ask -  is usually stored outside the table, with only a reference to it stored.

i like i
Thursday, February 12, 2004

Really ?

I seem to remember digging around in the SQL Server extent allocation a few years ago and this was not the case. It stored a variable number of rows per data page, depending upon how many would fit.

When you did an insert or an update (which is often really a delete, then an insert), SQL would check the data page to see it the row would still fit. If not, it would split the page and distribute the rows.

This was a few years ago now, but I'd be suprised if varchars were padded, as suggested.

Steve Jones (UK)
Thursday, February 12, 2004

Just checked SQL Server 2000 Books Online and it seems to support what I was saying:

"char[(n)] - Fixed-length non-Unicode character data with length of n bytes. n must be a value from 1 through 8,000. Storage size is n bytes."

"varchar[(n)] - Variable-length non-Unicode character data with length of n bytes. n must be a value from 1 through 8,000. Storage size is the actual length in bytes of the data entered, not n bytes."

Steve Jones (UK)
Thursday, February 12, 2004

Still, a varchar field in a record could be implemented as a pointer into a separate varchar dictionary or heap ... so that the *rest* of the record (consisting of fixed length fields) could be fixed length.

Even if the varchar data is inline with (on the same page as) the rest of the record, the record could be prefxed with a "record length" field (as a pascal string is), which means that you can still iterate through records with a single opcode, i.e. "ptr += (*ptr)" instead of "ptr += constant"

Christpoher Wells
Thursday, February 12, 2004

Christpoher,

As you say, it could be like that. This would yield better performance in some ways, but would generally require more I/Os to fetch data off the heap.

Indeed, you'd need as many as (n+1) I/Os per row, where n is the number of variable-length fields fetched. This is unlikely to scale too well.

As you suggest, I think that SQL Server does have a row length field for every row in a data page. The exception to this is, I believe, when all columns are fixed-length, in which case the row length would not be required, and we revert to what Joel was suggesting in his original article.

Steve Jones (UK)
Thursday, February 12, 2004

I stand corrected, at least on the products quoted.

However, I would guess that this puts varchars in the same bucket as blobs etc.  Fundemental records are still fixed length.

i like i
Thursday, February 12, 2004

Joel has a good point but he is a bit off with the SQL example.

The difference between varchar and char, it is not as much in parsing speed as it in saving hard drive space. The downside of varchars is that they require a lot of hard drive reallocations as the rows fill up. The access speed is roughly equal since after the row is found on the hard drive a pointer offset is used to jump to the proper field in both cases. How is the row found? That is where the indexes come to rescue.

Varchars are quite different from Blobs type-wise and implementation-wise. For starters the Blobs can't be indexed, used in constraints or joins while you can do pretty much everything with varchars.

Going back to XML vs. SQL, if one drops the indexes, the SQL search will be as slow as the XML search since the SQL has to use a full table scan which is not much different than parsing the XML file.

Conversely, if one adds an index to the XML elements, the things will speed up substantially.

coresi
Thursday, February 12, 2004

...and you index XML how exactly?

Sam Livingston-Gray
Thursday, February 12, 2004

I hate XML.  Although I think that in the average case a full-table scan would be faster than parsing XML -- because the data is already parsed/verified and stored in a nice compact binary format.

Plus, if you're doing something like:
WHERE foo=123

(and we have relatively static row lengths) the DBMS can jump directly to the attribute 'foo' in the next tuple instead of going byte by byte to find the next row, which it would have to do in the XML case.

MR
Friday, February 13, 2004

"...and you index XML how exactly?"

I can give you a broad hint here: the ame way DB engines are implemented. The fact that XML can be converted so easily to and from a RDB shows they have a lot in common. Nevertheless, my point is: one can't compare the access speed of a file with an index versus a file without an index.

This would be a good idea for a product. I know few different groups in my company who would like to use XML for storing and manipulating huge amounts of data but they don't because it is too slow. An indexing engine plus a sort of XML fill factor to speed up inserts would make them happy.

coresi
Friday, February 13, 2004

Perhaps the next version of SQL Server (codename Yukon) will be helpful here.

I know that MS are doing a huge amount of work to try to get XML into their database.

The new version certainly has XML as a datatype, and I believe you can use XPath queries against it.

Steve Jones (UK)
Saturday, February 14, 2004

*  Recent Topics

*  Fog Creek Home