Fog Creek Software
Discussion Board




binary xml - just say no

How many times now have you found the following:

1. A program has xml output of its files, claiming that with the new xml format, the program supports an 'open format' for your valuable data!

2. The xml files all look like the following:

<Document>
  <bin name="Data">
0C0000004D6963726F2054756E65720001000000010000000500000001000500        0000FEFFFFFF0C000000504D756C7469496E707574000000FEFFFFFF06000000
... etc - 300k or more of hexadecimal numbers
  </bin>
</Document>

--

This is so stupid - just save that darn thing as binary data already. Oh, but then you would not be "EELEETE" using the cool new open format of XML, which all the way cool people use.

Dennis Atkins
Sunday, March 28, 2004

Just to clarify - yes, it is very easy to get the binary data out; my beef is that this is not really a sensible use of xml -- if using xml the wayf to do it would be to have the fields in the binary data each surrounded by xml tags indicating what sort of data it was and so forth. I have no beef with xml itself, I just thing that having xml files that contain nothing but a single giant blocrk of binary data encoded as hex makes absolutely no sense beyond allowing the marketing department to claim something is xml compatible.

Dennis Atkins
Sunday, March 28, 2004

At least the sales people can use XML in their presentations. Implementing export to binary-in-xml takes less then a day I guess ... so why not?

http://www.alexlechuck.com
Sunday, March 28, 2004

Looks like the MS Office XP XML format ...

T. Norman
Sunday, March 28, 2004

scenario: you offer text xml for clients wanting 'standards' or unknown systems, but also offer FASTER binary xml for clients needing speed (and willing to download or write their own compatible receiving software) what's wrong with that ? Same middle tier offers both, depending on preference for standards vs speed tradeoff.  Seems a win-win to me !

Joe Hendricks
Sunday, March 28, 2004

MIME is probably a better choice for transporting binary data.

Dumping an arbitrary string of binary data into XML isn't safe because you may accidentally include characters that shouldn't be there (like ">") and make a total mess of it for the parser on the receiving end. But if you're trying to move the binary data "fast", why bother with the overhead of parsing an XML document in the first place?

Beth
Sunday, March 28, 2004

I need to save a waveform (e.g. an audio stream) in a file. I'll use XML to describe the stream (compression, type, date, etc). In this case what would you do with the data itself:

- store it in a separate file, outside of the XML file?
- store it as binary data (e.g. Hex-encoded) in the XML file
- store it in a more verbose but parsable format in the XML, e.g. <bin name="Data"> 12 0, 0, 0, 45, etc ...  </bin>


Sunday, March 28, 2004

>I'll use XML to describe the stream (compression, type, date, etc).

Use whatever facilities the binary stream already has (e.g. WMA, RA, etc.) and you have both the data and the little extra stuff already built in -- and it's in a format which doesn't require all the stupidity of XML.

MR
Sunday, March 28, 2004

It's streams of medical data. There are several storage standards for medical device signals already, including XML-based standards. XML is a Requirement: because it's so easily extensible (store new kinds of tags in it later), and because there are tools to handle it (e.g. save it Oracle, query it, and so on). Yet this data has a "binary" component also. I realise this isn't the same as the OP's gripe, about XML that contains nothing but binary data.


Sunday, March 28, 2004

"Dumping an arbitrary string of binary data into XML isn't safe because you may accidentally include characters that shouldn't be there (like ">") and make a total mess of it for the parser on the receiving end."

That's why it's in hex format.  You won't get >; you'll get 3E.

Kyralessa
Monday, March 29, 2004

Could you have a file format with a valid XML header, then a null (or some other marker) followed by the raw binary contents?

Dan Maas
Monday, March 29, 2004

That's what CDATA is for.

Simon Lucy
Monday, March 29, 2004

The audio example above could easily be handled by encoding the meta data in an xml document, and the data itself in binary. these two parts could then be inserted in a MIME message, with the former referring to the latter (if necessary).

Using MIME is a great way to bundle data in different formats, and it's easy to add things such as compression and transfer encoding.

it does seems kind of strange to represent binary data in XML, but I suppose there could be a valid reason (apart from being able to add the word XML to presentations...).

cheers
/H

Henrik Sidebäck
Monday, March 29, 2004

I know...

Why not have a binary file, but with a header?!

everything old is new again
Monday, March 29, 2004

In the case of an audio stream, clearly you should make the waveform readable. i.e.

<audio type="wav">
  <channel type="left front">
      <sample>0</sample>
      <sample>+2</sample>
      ...
  </channel>
</audio>

:-) jk. 

Seriously, though, if you're making a containment file structure to hold one or more binary pieces of data, XML is ideal and is the right choice in many cases.

Dennis Forbes
Monday, March 29, 2004

>Seriously, though, if you're making a containment file structure to hold one or more binary pieces of data, XML is ideal and is the right choice in many cases.

Why not just code it in the binary?  As I said before, the streams themselves hold that information -- storing it *again* in XML is redundant.

MR
Monday, March 29, 2004

> :-) jk

The FDA is doing something like:

<digits>-4 -13 -18 -18 -18 -17 -16 -16 -16 -16 -16 -17 -18 -18 -18 -17 -16 -16</digits>

http://www.hl7.org/V3AnnECG/foundationdocuments/itsxml/datatypes-its-xml.htm#dtimpl-SLIST

Christopher Wells
Monday, March 29, 2004

""" As I said before, the streams themselves hold that information -- storing it *again* in XML is redundant"""

Yeah but you get to put "XML" on your resume. 


Monday, March 29, 2004

"Why not just code it in the binary?"

Well my point is that truly binary data _remains_ binary, but if you can utilize the flexibility of XML for your containment file, and leverage the many XML tools and components available, then why not? Let's say that I'm making a format to store 16-bit audio data, so I carefully map out my binary format.

AUDIO.FMT

5 bytes header holding the data "SOUND"
X bytes null terminated holding song name
2 bytes holding the number of channels
4 bytes holding the sample rate
xx - interlaced 16-bit samples alternating between channels

On the other hand, some so-called XML resume padder decides that instead they're going to use XML. In this case SOUNDDATE is filled with PCM 16-bit samples.

<AUDIO Song="Oh Come All Ye XML Faithful">
  <CHANNEL NAME="Left Front"
SAMPLERATE="44100"><SOUNDDATA>...</SOUNDDATA></CHANNEL>
  <CHANNEL NAME="Right Front" SAMPLERATE="44100"><SOUNDDATA>...</SOUNDDATA></CHANNEL>
</AUDIO>

And then one day I decide that I want to add closed captioning/bouncing ball functionality to this.  In XML this would be a no brainer and I can easily think of a non-breaking addition.

...<CAPTIONS><CAPTION OFFSET="00:01:27" DISPLAY="00:00:07">Hail ye XML God!</CAPTION></CAPTIONS>
</AUDIO>...

And so on. Brutally straightforward. If I wanted to add a DRM section or a video stream or any nature of metadata like who the author was, it's all trivial.

From a purely binary perspective obviously you can accommodate thes need, but you are FAR more likely to paint yourself in a corner (i.e. If there was a player that understood the first XML variant, it would be oblivious of the captioning in the second and that would be a non-breaking change. It is highly likely that binary file changes break existing apps in the first case).

So I guess my point is that even where your "XML" file is nothing more than a containment of a single binary-domain stream (images, sound, movies, etc), it still is justifiable because it offers metadata and flexibility that you don't get otherwise. There are downsides, such as the expansion of binary to a text-friendly format (i.e. base-64, bloating the binary by 33% or so), and the resources required for such free-form parsing, but it is a valid widget on one's toolbelt.

Dennis Forbes
Monday, March 29, 2004

Dennis,

I like your example of adding closed captioning to the audio. It illustrates a place where xml encoding of the data is appropriate. But that doesn't mean the actual sound data should be encoded inside the xml too. Taking an audio stream and using base-64 encoding to escape it within an xml document will add a ton of pointless overhead to the decoding of the sound data. It would make a lot more sense to just have two files:

music.mp3
music_captions.xml

If you have a player that knows how to read these closed-captioning files, it will know how to take the captioning information from the xml file and synch it with the audio data from the mp3 file.

Of course, that's just my opinion (except for the factual stuff).

Benji Smith
Monday, March 29, 2004

I think the XML standard could be easily extended to handle binary data.  My thought is just to prefix the file with the XML and after the last tag (XML must be completely enclosed by a start and end tag) just stream the binary data.

It should be a simple modification to XML parsers.  And then we can stop all this Base64 madness!  Just have the tags in the XML refer to the appropriate offset in the file:

<AUDIO Song="Oh Come All Ye XML Faithful">
  <CHANNEL NAME="Left Front"
SAMPLERATE="44100"><SOUNDDATA START="56783" LENGTH="107383"></CHANNEL>
  <CHANNEL NAME="Right Front" SAMPLERATE="44100"><SOUNDDATA START="164166" LENGTH="106570"></CHANNEL>
</AUDIO>

Programs only concerned with reading out the meta data can just ignore the binary content of the file.

Almost Anonymous
Monday, March 29, 2004

"I think the XML standard could be easily extended to handle binary data...And then we can stop all this Base64 madness! "

We are in total agreement, and this is a huge issue with the use of XML as a "property bag". As a minor variant of your idea, removing the need for offsets in the actual XML, it could be as simple as multiple "streams" packaged in one single XML file (with the XML having a "FAT" of sorts, and then a default XML text block).

<xmlpackage>
  <FAT>
      <data name="!stream1" offset="0" length="1000" endian="big" />
      <data name="!stream2" offset="1000" length="1000" endian="big" />
  </FAT>
  <base>
      <audio>
          <channel name="left channel">[!stream1]</channel>
          <channel name="right channel">[!stream2]</channel>
      </audio>
  </base>
</xmlpackage>
$DATA$

Exactly as you mentioned, to the parser I should be able to open the channel element and shove a bunch of data in or pull it out. Indeed, when opening this file it should be transparent, opening the file as if it was from base and down, automatically handling the data blocks where necessary.

Dennis Forbes
Monday, March 29, 2004

"As a minor variant of your idea, removing the need for offsets in the actual XML..."

I think the FAT idea is bit complicated (we don't need another SOAP!).  But I do like the idea of embedding it as you have done.  Maybe with a processing instruction:

<xmlpackage>
  <base>
      <audio>
          <channel name="left channel"><?bin 0:1000 ?></channel>
          <channel name="right channel"><?bin 1000:1000 ?></channel>
      </audio>
  </base>
</xmlpackage>
$DATA$

Almost Anonymous
Monday, March 29, 2004

Anybody remember when men were men and weren't afraid of reading a header into a struct?  XML is a solution looking for a problem (with apologies to XML-RPC).

offset | size | description
0          2      marker (FFD8 hex)
2          2      image width in pixels
4          2      image height in pixels
...
...
...

sissy boys
Monday, March 29, 2004

> Anybody remember when men were men and weren't afraid of reading a header into a struct?

Yes: you had to rebuild everything whenever you wanted to add a new element to the struct.

Christopher Wells
Monday, March 29, 2004

What if you add a new field to the xml record?  If you have a client reading the xml file, won't the client need programming to handle the extra field?  How is that different from adding a field to a binary file format? 

Maybe I just don't get xml (quite probable). 

mwv
Monday, March 29, 2004

>"If there was a player that understood the first XML variant, it would be oblivious of the captioning in the second and that would be a non-breaking change."

This could easily be accomplished in a binary format as well -- it is simply a matter of how you code your parser in either case.


>"then why not [use XML]?"

There are plenty of reasons why XML is bad and precious few (most are non-functional) reasons why it is "good".

You can't get metadata with a binary format?  Metadata was not invented with XML; *every* problem (that I'm aware of) that XML "solves" has been solved better somewhere else.

And mwv is correct -- you can change the format to your heart's content but the application will have to change as well in order to make use of it.

MR
Monday, March 29, 2004

"You can't get metadata with a binary format?"

The main advantage of XML is that it is human-readable and human-editable.  Binary formats are decidedly not.

Almost Anonymous
Monday, March 29, 2004

>"The main advantage of XML is that it is human-readable and human-editable.  Binary formats are decidedly not. "

Indeed -- those are the "non-functional" benefits I alluded to.  But, in a world of computer-to-computer transmission why does human-readability matter?

MR
Monday, March 29, 2004

The benefit to XML in these situations (changing data) is that you can add a minor change to the record and retain forward and backward compatibility.

Let's say I design version 1.0 of an XML schema for storing MP3 tag data.  It stores "artist," "album," "song name" and "length."  Later, in version 2.0, I decide that I want to add an additional field for "music genre."  Since it's in XML, it can be a non-breaking change:

(a) If a person is using version 1.0 of my app but encounters a song in 2.0 format, the app will just ignore the unknow "genre" tag. 

(b) If a person is using version 2.0 of my app but has a song in 1.0 format, it can easily just display "genre: unknown."

If you're using a binary format (or a proprietary text file format) instead, it will be difficult to mimic the forward compatibility from scenario (a).  Maybe you could encode the tag information at the end of the file, so the old player just ignores it, but that's kludgy.

You could get the backwards compatibility from scenario (b) more easily, but it requires some additional work with conditionals to determine how to load the file.  XML is a bit more graceful.

Robert Jacobson
Monday, March 29, 2004

Robert,

The analogy is not quite the same.  In your example, there are two assumptions that you're making:
1) the binary data is not read via a library
2) the binary data is not formatted in such a way as to make extensibility an option

To make it fair, you'd want to read your binary file with some sort of library much like XML is done.  This would do the same sort of massaging that XML libs do to XML data.

Further, it is possible (trivial, even) to construct a binary format which would achieve the intended result (adding some additional metadata).  Not only that, but provided you didn’t try and reproduce the same tags and encoding schemes as XML in your binary file it would be more efficient, too.

MR
Monday, March 29, 2004

"But, in a world of computer-to-computer transmission why does human-readability matter?"

Human-readable matters when you are actually trying to code up machine-to-machine interaction.  It helps to be able to actually see and understand the incoming and outgoing data.

"If you're using a binary format (or a proprietary text file format) instead, it will be difficult to mimic the forward compatibility from scenario (a)."

Binary formats do that all the time.  Just as XML can have a variable-length records -- so can binary files.  As mentioned by another poster -- you could create an binary file format that is semantically similar to XML.

Almost Anonymous
Monday, March 29, 2004

>"Human-readable matters when you are actually trying to code up machine-to-machine interaction.  It helps to be able to actually see and understand the incoming and outgoing data."

Sure -- but in that case it is a small matter of coding a suitible translator program.  You could even embed it in your text editor of choice to make it seamless.

MR
Monday, March 29, 2004

"Sure -- but in that case it is a small matter of coding a suitible translator program."

Isn't that the chicken and then egg?  You mean I'm going to code up a translator program to translate the binary format so I can read it so that I can write a translator program so my application can read it?

Now, if you can get everyone to agree on a standard binary interchange format.  Then get them to build nice parser libraries for it.  Then from that, build graphical tree-like viewers/editors for it.  Then we could do away with XML...

I'm not saying it's a bad idea.  In fact, it's a great idea.  I'm no fan of XML.  But I don't think it'll happen.

Almost Anonymous
Monday, March 29, 2004

I fully appreciate that it's in fashion to be "anti-XML" in some camps, so rather than responding directly I'll be more general.

-XML is not the best thing since sliced bread. The contrarian crowd sets up this strawman -- that XML is irrelevant because it doesn't solve all problems -- because it's easy to knock down. To the "XML crowd" it's just another tool that makes some tasks easier (kinda like ASCII made transferring data between computers easier. I have no doubt that in the early days of ASCII there were those proclaiming that "What does it matter? I have my own character translation service running between the EDC72 and the BAC44! ASCII is irrelevant!".

-Of _course_ you can model the same sort of expandable/versatile structure in your own `binary' format. XML didn't come from nowhere, and there were parsed hierarchical formats for decades before then. Having said that, there weren't many (because parsing is a non-trivial task), and of those that there were there was zero standardization. _Much_ more common was fixed "structures saved to disk", with all of the joys that that entails.

-One of the greatest benefits of XML is that there are a huge number of tools and libraries existing that can "understand" XML, parsing and translating documents, even when they have no understanding of the actual content. Inane chatter about how you can achieve the goals of XML touches upon the basic, versatile structure, but it ignores one of the primary reasons why many developers choose XML.

As an aside - I'd love to find the "átkíns" "©iális" aholes and tear them a new one. I'm getting about three dozen spams from these dirtbags per day. Email is borked.

Dennis Forbes
Monday, March 29, 2004

I agree with Almost Anonymous, the arguments against XML sound like "well, I could do it in a custom way" more efficiently.

And then instead of one common parser, I need to install and use your "library" for your format, another obscure library for Joe Blow's binary format, etc. Assuming there is docs on the web about it.

>>There are plenty of reasons why XML is bad and precious few (most are non-functional) reasons why it is "good".

Let's see...
- The format is readable and self-documenting (if done right)
- Tons of tools/libraries (msxml, jaxp, xml spy)
- Can be extended and is a standard. (See RSS for one reason why XML is better then a proprietary format) Look at how many 3rd party tools already exist for that...

Another example: if you use MS Infopath and point it to an XML doc or a web service, it will create a treenode of those elements, allowing me to drag & drop them on a form (using the element names as the labels by default). If I select an element within another one, it knows it is a one-to-many and lets me choose to create a list or repeating table, etc. And if I use an XML schema instead, all of the validations and constraints will be automatically utilized by the Infopath tool.

Another example: if mp3 used xml as it's format for the track info, I could open it up in notepad and fix a song title w/o having to use iTunes or any other editor made for that purpose.

AEB
Monday, March 29, 2004

>And then instead of one common parser, I need to install
>and use your "library" for your format, another obscure
>library for Joe Blow's binary format, etc. Assuming there is
>docs on the web about it.

As opposed to having to write handlers for all these different schemas.  You may have one parser, but that's just another level of abstraction.

>- The format is readable and self-documenting (if done right)

If you have trouble reading file specs, you have no business being a programmer.

> Tons of tools/libraries (msxml, jaxp, xml spy)

Quantity != quality.

> Can be extended and is a standard. (See RSS for one
> reason why XML is better then a proprietary format)
> Look at how many 3rd party tools already exist for that...

Look at how many JPEG libraries there are.  Care to compare the quality?


Note:  I'm in favor of XML, it looks good on my resume.  Ain't nothin' special 'bout it though.

mwv
Monday, March 29, 2004

"You may have one parser, but that's just another level of abstraction."

You make it sound like that's a bad thing. 

"If you have trouble reading file specs, you have no business being a programmer."

If you are a programmer then you know that frequently you don't have the specs or, far more often, they are not correct. 

"Quantity != quality."

Quantity == Available in a large number of platforms / languages.  At this point, I'd venture to say that nearly all programming languages have an XML parser written for them.

Almost Anonymous
Monday, March 29, 2004

>> Ain't nothin' special 'bout it though.

Except that it won the popularity contest and that is enough. In the same way that JPEG has many libraries due to it's popularity, XML's primary benefit is it's broad support.

When ERP systems talk XML, when mainframes communicate via XML, when files are in XML, when every language has an XML library, then that becomes it's main advantage. That day is here.

Same with TCP-IP, HTTP, HTML, SQL, mp3, zip.

It makes sense in some places, and it doesn't make sense in others (the example where the only data is binary).

AEB
Monday, March 29, 2004

Sure, XML is just another format for storing/transmitting data; it's not a panacea.  You can get the same end results through your own hand-rolled code or a custom library.  A custom format may be more efficient in terms of both time (potentially easier to parse) and space (fewer bytes wasted for the meta information.)  If those criteria are important for your application, you probably should use a binary format.

Personally, I have no particular desire to work on low-level plumbing details like this.  I'm working in C#/VB, and the .Net XML libraries usually get the job done with minimal fuss (plus they seem to be rock-solid.)  I also like that it's a standardized format, so the code should be more maintainable and I can use off-the-shelf editors.

My two cents.  Obviously YMMV....

Robert Jacobson
Monday, March 29, 2004

I'm not suggesting that the 'do it yourself' method is the correct answer to the problems we face in trying to get data from point A to point B.  I'm suggesting that a *better* standard should be adopted.

P.S. XML is only self documenting in the same way that this code sample is self-documenting:
surface_area = 4 * 3.14 * pow( r, 2 );

MR
Tuesday, March 30, 2004

Oh, and XML is really, really ill-suited for storing non-trivial data. 

MR
Tuesday, March 30, 2004

"Oh, and XML is really, really ill-suited for storing non-trivial data. "

Ah...one of those "I know because I'm in the big leagues". Let me try this brilliant technique on for size.

"C++ is really, really ill-suited for non-trivial applications"

"SQL Server is really, really ill-suited for non-trivial database needs"

"x86 is really, really ill-suited for non-trivial computational tasks"

"HTTP is really, really ill-suited for non-trivial content"

Wonderful. Such a masterful technique will surely smote all who oppose that which I hold dear, or who hold dear that which I disagree.

.
Tuesday, March 30, 2004

By trival I meant amounts -- e.g. although it's not good for data storage at all you could probably get away with storing a tiny amount of data as XML and not get burned.  For any amounts of data that are not trivial (anything which would resemble a typical database) then a DBMS is the way to go.

MR
Tuesday, March 30, 2004

*  Recent Topics

*  Fog Creek Home