Fog Creek Software
Discussion Board




Simple XML Compression

Hello,

I'm sending some rather large XML packets over UDP, and the speed is proving to become a bit of an issue. I'd like to do some very simple compression, but I'd like to see if there's a better way out there.

The tree I'm sending consists of 10000+ 'parallel' children, each similar with 4 children. For various (good) reasons the tag names are rather verbose, and by my estimate ~95-98% of the data being sent is either tag names, or angle brackets. Seems there's very little entropy in the data, so good compression should be reachable. The data is just integers, with a few periods here and there (IP Addresses).

I've thought of quickly parsing the tree, and replacing the tag names with a single unique character, with a map table at the beginning. This would help a great deal, but I still would have roughly half the characters being angle brackets.

Is there an easy way to get any further compression, while still have the contents being valid XML? My XML and compression technology knowledge is limited, so please be gentle ;-)

As a possible alternative, is there a short algorithm for compressing a character data stream into only printable (viewable) characters? Off-hand I'd guess there's about 60-70 possibilities, with upper/lower case char's, integers, symbols, etc.

Edward
Tuesday, July 20, 2004

Some time back I investigated this. (though not very detailed investigation).

1. One simple way is to use gzip. I think you have to twick the web server to enable compression from server side. Most of th browsers support gzip compression/encoding. Then you don't have to do any coding and no changes are required at client side.

2. Another way it to use a simple XSLT. Replace tagnames with one or two character sequences., then add an XSL stylesheet to convert the tag names to real names. Again, this approach is supported by IE and Mozilla.  So changes are required at client side.

3. other options like bzip etc. required changes/addons at client side.

Nitin Bhide
Tuesday, July 20, 2004

Do you really need the compressed version to be valid XML? There are readily available compression libraries for any language; I would think it would be not only easier but also better-performing if you used one of those rather than brewing your own.

Brad
Tuesday, July 20, 2004

Nitin, yes, I have thought about that, and I'm certain that would be the best compression. If more simple compression schemes don't work out, that would have to do.

Everything we do here is in XML, as far as communication goes. Up until this point, it hasn't been a problem, but in this situation I'm collecting logs from a networking experiment, and the transmission time is becoming an issue.

At the layer that I'm working at, all the data is in printable characters (in XML), and I'd like to stay there if at all possible. If I could somehow get the data compressed down by 85-90 %, there wouldn't be a need for gzipping, and similar.

Edward
Tuesday, July 20, 2004

Brad, agreed. I'm just really really hoping I can do it while maintaining valid XML code. The libraries that I'm involved with all have XML parameters for this, and using something different would just add complexity that I don't want to deal with.

Just hoping to find a "good enough" solution.

Edward
Tuesday, July 20, 2004

Edward, what do you perceive as the difference between
<xml>(compressed data)</xml> and
(compressed data)

GZip is used to compress RSS, Atom, HTML, and XHTML (probably half of the blogs you visit are compressed, and any commercial operation should be compressing).  The sending server compresses it, the receiving server decompresses it.  All compression/decompression happens just before/after transmission so the rest of the server is completely unaware that compression was utilized.

Sometimes what goes over the wire is very important, in this case you're creating congestion so you can have angle-bracket happiness.  But the wire doesn't care.

Lou
Tuesday, July 20, 2004

See IBM Developworks topics on XML compresion

http://www-106.ibm.com/developerworks/xml/library/x-matters13.html
http://www-106.ibm.com/developerworks/webservices/library/ws-sqzsoap.html
...

or

http://www.research.att.com/sw/tools/xmill/

brgs,

Carlos

carloca
Tuesday, July 20, 2004

WBXML..?

i like i
Tuesday, July 20, 2004


Replacing node names with single unique characters should be somewhat effect.

If you can ensure that no quotes will be used, you can convert child elements to attributes named the same:

<test>
    <test1>number</text1>
    <test2>number2</text2>
</test>

Becomes:

<t t1="number" t2="number2" />

KC
Tuesday, July 20, 2004

Thanks for the suggestions, I think I can do something acceptable.

Edward
Tuesday, July 20, 2004

Guys - he's talking about sending custom UDP packets I think, so gzip encoding over HTTP really isn't gonna help. I expect you could still hack something together based on gzip though.

I personally think XML is horribly inapproriate for a lot of sitautions in which it's currently trendy to use it... and this on the face of it could be one. If you need an efficient binary protocol / representation for your data, then maybe you should design one, and generate the XML from that if/when it's actually necessary, not the other way round... ?

Matt
Tuesday, July 20, 2004

Just use zlib's compress or compress2 function.  http://www.zlib.org/

It's virtually always better to use a standard compression library than to roll your own.  It's already debugged, it was designed by experts, and the guy on the receiving end won't be tempted to write his own broken implementation.

rob mayoff
Tuesday, July 20, 2004

This is akin to extending signal flags (bits of material stuck on sticks), with more material in order that the shortsighted general on the hill can get the message in a form he doesn't need his glasses for.

This may well be more efficient for the general but the battle is likely to be over and lost before he found out.

Simon Lucy
Tuesday, July 20, 2004

Convert the xml string to base 64, then transmit the data and decode it on the other end.

Pete
Tuesday, July 20, 2004

Pete, you can't be serious. Base64 encoding *increases* the size of the string by 33% or thereabouts.

Chris Tavares
Tuesday, July 20, 2004

http://www.research.att.com/sw/tools/xmill/

Matt
Tuesday, July 20, 2004

*  Recent Topics

*  Fog Creek Home