Fog Creek Software
g
Discussion Board




Simple vs. complicated text formats.

In protocol after protocol, I find text formats that start off well enough, but always get sabotaged with unneeded complexity.

For example, some network protocol might devise a format for meta information comprised of name=value pairs on lines by themselves:

name=value
other_name=other value

Simple, right? But then they decide to add an "email" field, and they do it like this:

email=username@domain <Real Name>

Then they add an encryption field, and do it like this:

encryption=format:key

Then they add a connection field, and do it like this:

connection=123.123.123.123:99 INET/IP4

...now you end up needing custom parsers just for each field.

The great thing about XML is the symmetry it affords to its users: elements, with a name and optional attributes, comprised other elements. No, I'm not saying XML + XSLT + Schema + DTDs + whatever else is simpler than the above example, but just imagine the above rewritten as XML:

<meta>
  <email>
      <address>user@domain</address>
      <real-name>Real Name</real-name>
  </email>
  <encryption>
      <format>PGP</format>
      <key>some key</key>
  </encryption>
  <connection>
      <address>123.123.123.123</address>
      <port>99</port>
      <connection-type>INET</connection-type>
      <address-type>IP4</address-type>
  </connection>
</meta>

I guess what I'm complaining about is a lack of symmetry. If you start off with name=value pairs and need sub fields and go with a space or tab as the first sub field separator, then for God's sake, why not stay with that? Is it necessary to define all sorts of cute little formats with "<", ">", ":", "(" and ")" in them for each sub field as the mood strikes you?

Do these spec writers ever eat their own dog food? Have they considered how hard it will be to build an event stream-based or tree-based parser for their cute little formats?

The Iron-Fisted Hot Dog and Pretzel Baron of Fulton County
Saturday, April 3, 2004

if we define a field of a format as a name=value pair, i think that the next step it's not to create a sub field as in name=value1:value2:value3 but to create a way to aggregate related fields together, like this

[section1]
name1=value
name2=value

[section2]
name3=value
name4=value

using your xml example:

[email]
address=user@domain
realname=Real Name

[encryption]
format=PGP
key=some key

[connection]
address=123.123.123.123
port=99
connectionType=INET
addressType=IP4


Using a sub field removes semantic information from the format and makes it harder to understand. for example:

windowDefaultColors=white:black

So is this white text on black background or black text on white background ?
well, we have to consult our text format definition who has to keep not only the syntax rules but also a lot of semantic information. The format it's more expressive if we do this:

[windowDefaultColors]
background=white
foreground=black


the problem in many complicated formats it's that the autor sometimes fails to see that it doesn't need to change the basic buiding block to fit some complicated data but to add a way to give it a structure.

r-pt
Saturday, April 3, 2004

Yeah, format isn't really the issue.  I've seen some of the most horrble kludgy solutions done with enourmous and inconsistant XML files.  No file format makes the file designer sensible.

I for one thing that the old INI files were generally easy to navigate, create, and modify.  I generally prefer that format to XML.  So I guess that is why MS got rid of them favor of the unmaintainable registry.  Oh well.

--oren

Oren Miller
Saturday, April 3, 2004

The registry wins over INI files actually because it's HARD to parse.  The easy to parse INI files mean lots of programs try to manipulate them by hand therefore messing them up.  With the registry there is only one way in, the Windows API.

As for XML, the advantage over INI files (or almost any other format) is that good parsers are available in almost any language (C++, Javascript, VBScript, Java, Perl, C#, etc...) and they do a lot more than your average ini file parser.  The perl one for example will build arrays and hashes were appropriate.

Using DTD or XSLT they will validate your data for you.  Many of the parsers will also handle converting things to floats or enums instead of strings where appropriate.

The best parsers will also handle any language as well since XML covers that as well.  Putting Japanese in a .INI file and your on your own.  In XML you just either specify at the top the data is in Japanese or Unicode and your parse will handle the rest.

how are these lines parsed in an INI file?

the name=Gregg
name=gregg=man
place= Japan
zip = 12345

Is the first line valid?  Is the second line "name" eq "gregg=man" or "name=gregg" eq "man"?  Is the 3rd line place="Japan" or place="<space>Japan"?  Is the 4th "zip" or "zip<space>"

XML covers all those cases.

Gregg Tavares
Saturday, April 3, 2004

"imagine the above rewritten as XML"

Or appropriately structured plaintext, e.g.

email:
    address=user@domain
    real-name=Real Name
encryption:
    format=PGP
    key=some key
connection:
    address=123.123.123.123
    port=99
    connection-type=INET
    address-type=IP4

has
Saturday, April 3, 2004

Plain text files work best only when there is a fixed field size. Then parsing becomes almost trivial. Problems arise when the field size is a variant and one has to add multiple loop & search to the parsers.

Regards

Kaushik Janardhanan

KayJay
Saturday, April 3, 2004

email_address = name@domain
email_user_name = First Last

Every config file I've dealt with does just that, where did you get your example, Iron-Fisted ?

Egor
Saturday, April 3, 2004

"The registry wins over INI files actually because it's HARD to parse.  "

Good point. Never thought of that before.  It's like doctor's using Latin (a "dead language") : the general society is unlikely to change the usage of the latin words.  So medical terminology is  "write protected". You dn't have patients trying to misuse terminology.

Mr. Analogy
Saturday, April 3, 2004

> Every config file I've dealt with does just that, where did you get your example, Iron-Fisted ?

It wasn't a config file, it was samples from various network protocols I've had the pleasure of implmenting where the designers clearly had their heads up in the clouds.

The Iron-Fisted Hot Dog and Pretzel Baron of Fulton County
Sunday, April 4, 2004

"where the designers clearly had their heads up in the clouds"

lol, you weren't picking on HTTP were you?

i like i
Sunday, April 4, 2004

No, I wasn't picking on HTTP. Actually, compared to some of the stuff I've dealt with, HTTP isn't too bad.

It just pisses me off that nonsense like this gets standardized. Several people here have shown other ways to convey the same information in a generic, structured manner, and if they can figure it out, why can't lofty spec writers with PHDs?

I can see crap like this coming out of the closed doors of the ISO, but with the IETF and the public input its process seeks out, it's strange that few people apparently ever challenge the wisdom, ease-of-use, or efficiency of these crazy formats.

The Iron-Fisted Hot Dog and Pretzel Baron of Fulton County
Sunday, April 4, 2004

> "With the registry there is only one way in, the Windows API."

>"good parsers are available ... and they do a lot more than your average ini file parser."

>"Using DTD or XSLT they will validate your data for you.  "

Sounds a lot like a RDBMS, which is better than XML as storage.

MR
Sunday, April 4, 2004

Which protocols? All the ones I've ever seen are spec'd using BNF.


Monday, April 5, 2004

> Which protocols? All the ones I've ever seen are spec'd using BNF.

I'm complaining about the actual protocols themselves, not the syntax used to define them in some RFC. For example, the MIME headers used by HTTP, not the BNF grammar used to define the MIME headers.

The Iron-Fisted Hot Dog and Pretzel Baron of Fulton County
Monday, April 5, 2004

"I can see crap like this coming out of the closed doors of the ISO, but with the IETF and the public input its process seeks out, it's strange that few people apparently ever challenge the wisdom, ease-of-use, or efficiency of these crazy formats."

Pointing out when the emperor's doddering around in the scud [again!] can be a risky game. It might well be the Right Thing To Do, but there'll also be a good few folk wanting to see your head on a spike after costing them so dear.

Do not upset happy fun emperor. Happy fun emperor knows best.

Why, he told me so himself!

has
Monday, April 5, 2004

*  Recent Topics

*  Fog Creek Home