Fog Creek Software
g
Discussion Board




The problem with CSV files ?

Hi,

I read *somewhere* about how poorly the CSV (comma seperated values)  format is designed. (The goofiness of how it handles  imbedded "  and ,        etc.

But I CAN'T remember WHERE the heck I read it. (Tried googling and searching jos and can't find it).

Anyone know what I'm talking about?

I think it suggested better alternatives.

(I suppose that one alternative would be to use a less commonly used seperator character and not enclosing everything in quotes. 

e.g., 
some text" or whatever"  ;  next item; and another

Thanks!

Entrepreneur
Sunday, January 4, 2004

"The Art of Unix Programming"

Joel Spolsky
Sunday, January 4, 2004

The trick isn't in the delimiter but in how quotes and instances of the delimiter in the text are handled.  Building a parser for CSV files is relatively complicated and not necessarily a solved problem (the same applies to generating them). 

Lou
Sunday, January 4, 2004

"Building a parser for CSV files is relatively complicated and not necessarily a solved problem"

CSV is complicated? 

Rules for CSV:
* Each record is separated by a LF or CRLF
* Each record consists of a set of comma separated fields
* Leading/Trailing spaces of records are ignored
* You can embed CRLF, commas, or spaces in a field by delimiting the field with double-quotes.
* You can embed a double-quote in a field by using two in a row (e.g. "").

It's not hard to build a parser (and constructor) for CSV files.  I've built both and they handle any valid CSV file thrown at them.

Almost Anonymous
Sunday, January 4, 2004

Have you tried throwing a french locale CSV file at it?

Rhys Keepence
Monday, January 5, 2004

Nope.  But I was going to say that CSV is pretty good for most data unless you need multilanguage support.  Obviously as an ASCII-based format, it's got it's limitations.

I suppose you could encode CSV files w/ UTF-8 but then they wouldn't be CSV files anymore.  Better to go with XML.

Almost Anonymous
Monday, January 5, 2004

I've always prefered tab- to comma-separated files. It is much more rare to find tabs in data than commas.

HeWhoMustBeConfused
Monday, January 5, 2004

In case you didn't understand Mr. Keepence, most European countries, and many others, use the decimal comma instead of the decimal point.

This is one area where XML rocks!

Stephen Jones
Monday, January 5, 2004

Assuming you are using XML for certain purposes (like high-evel protocol packet envelops).. you'll hate how it deals with data containing special characters and sequences that are tainted:

* contains the same tags as the envelop
* contains bad characters
* ...

So you gotta either base64 them and ensure they are elements are CDATA them. Not very pretty either.

Li-fan Chen
Monday, January 5, 2004

The problem that all escape-based formats have is escaping escape characters.

I generally use wxDataXxputStreams to serialise data.

i like i
Monday, January 5, 2004

>Building a parser for CSV files is relatively
>complicated and not necessarily a solved
>problem (the same applies to generating them). 

As long as you know how the commas are escaped (for example, in word, add some commas, save as CSV, figure it out ...) you can transform the escaped characters into something wierd, do the convert, then transform them back.

Not -quite- trivial, but certainly easy.

split() is a friend of mine.

regards,

Matt H.
Monday, January 5, 2004

"Assuming you are using XML for certain purposes (like high-evel protocol packet envelops).. you'll hate how it deals with data containing special characters and sequences that are tainted."

The whole point of XML is to use an already coded parser and generator. It would be insanely stupid to generate XML "by hand" with code.

Brad Wilson (dotnetguy.techieswithcats.com)
Monday, January 5, 2004

Mr. Keepeace should use a properly written CSV parcer. Do not blame for format for it's implementors incompetence.

Mr Jack
Monday, January 5, 2004

The problem with XML is that it is a little too formal.
So binary data is inescapble. Anything else is fairly simple.

e.g. char code 0 or 7 is not possible; neither ^G (BEL) nor  are valid, so you have to base64 encode it.

thankfully many parsers will do this for you.

mb
Monday, January 5, 2004


I've seen people turn binary into HEX to get Binary into XML.

ugh.

I find it entertaining that after all that work to do it the 'right' way, no one has _EVER_ bothered to get the data out of binary.

ha.

Matt H.
Monday, January 5, 2004

To the first poster:
http://www.faqs.org/docs/artu/ch05s02.html#id2901882

Ik
Monday, January 5, 2004

I have always used, maybe naively, a pipe |.  With no problems so far.  Anyone see where I may run into trouble?


Monday, January 5, 2004

Glad to see someone brought this up. I too have suffered from people who believe that CSV is easy or that "hey, let's use a tab or a pipe since it's less likely to occur" ... and they don't even think of newlines in the data (of course all data must be allowed in the data). Someone posted the "robust definition" of a CSV above and that is what EVERYONE should use (if not using XML) ... I've had people just spit things out thinking I could just use a split (I always use industrical parsers as I only use things like split on local data I CREATE) and so they don't quote their data and yet they have newlines ... So please folks, just because you are delimiting data with commas or tabs etc. and rows with newlines, don't say you are producing CSV files or at least state your definition of CSV as there does not appear be to a one true CSV definition.

Me
Monday, January 5, 2004


CSV works well, use it all the time.  One hint is to always quote a field. this simplifies it alot.

i.e.

"big dog", "is a cool, but strange","dude that says "howdy"", "well?"

I've used huge records with random text of 64K in a field, no problem and the parcer is simple.

mike

mike
Monday, January 5, 2004

Time for the oringal poster to chime in.

We have used CSV files for 8 years or so. However, the code was a pain. You have to keep track of whether the next "  is the START or END of data, so you can't just search for the next delimiter. you have to parses EACH and EVERY character and ask:

1. Is is a "
2. Is it a starting " or an ending "
3. Is it part of a PAIR of "'s  (i.e., "" to escape out and allow an embedded "
4.  Oh, and is the character a  ","


Yes, it's a solvable problem. HOWEVER, why create a solution (to storing data) that is soo much more difficult to implement.

Thanks for the suggestion of the  |  PIPE. I was just looking on the keyboard last night to find an unsual character and that fits the bil. However, TAB-delimited is easier to edit with a spreadsheet. (So CREATING the files is easier).

Thanks for all the input.

Entrepreneur
Monday, January 5, 2004

Thanks to the poster above who posted the link the article I read.

Here's an explanation of why CSV sucks.

"In fact, the Microsoft version of CSV is a textbook example of how not to design a textual file format. Its problems begin with the case in which the separator character (in this case, a comma) is found inside a field. The Unix way would be to simply escape the separator with a backslash, and have a double escape represent a literal backslash. This design gives us a single special case (the escape character) to check for when parsing the file, and only a single action when the escape is found (treat the following character as a literal). The latter conveniently not only handles the separator character, but gives us a way to handle the escape character and newlines for free. CSV, on the other hand, encloses the entire field in double quotes if it contains the separator. If the field contains double quotes, it must also be enclosed in double quotes, and the individual double quotes in the field must themselves be repeated twice to indicate that they don't end the field.

The bad results of proliferating special cases are twofold. First, the complexity of the parser (and its vulnerability to bugs) is increased. Second, because the format rules are complex and underspecified, different implementations diverge in their handling of edge cases. Sometimes continuation lines are supported, by starting the last field of the line with an unterminated double quote — but only in some products! Microsoft has incompatible versions of CSV files between its own applications, and in some cases between different versions of the same application (Excel being the obvious example here).
"

from:
http://www.faqs.org/docs/artu/ch05s02.html

Entrepreneur
Monday, January 5, 2004

CSV parsing really isn't that difficult.  Inelegant -- sure, difficult -- no. 

I'm looking at some CSV parsing code that I wrote.  It's about 20 lines and took maybe a few minutes to implement plus fully test to catch the funky cases (such as , at the end of a line).  It isn't the most beautiful code I've ever seen but it has worked fine with everything thrown at it thus far.  I also generate some CSV files and have never had a complaint from those receiving them or a problem with the systems they're imported into.  These files are also frequently imported into Excel without problems. 

Using something like the Unix passwd format or XML would gain me nothing other than confusing the heck out of users who occassionally want to view or edit the data directly or who want to do imports or exports with Excel. 

The key is to use the right tool for the job.  You wouldn't say XML sucks because it's a bad format for image data.  In the same manner, CSV is great for some things and not so good for some others. 

SomeBody
Monday, January 5, 2004

The only problem with using XML to import into MS Office is that the importer needs to have Office XP minimum. The obvious thing to do is to send both a CSV file for older versions of Office, and an XML file for those with the latest.

Stephen Jones
Tuesday, January 6, 2004

Microsoft supports XML as a format to avoid implementing
self describing data formats, data formats with method sets.. I think XML is the closest they will get to data formats that don't need library based interpreters.. Its the equivalent of pl/sql in databases, but instead of PL/SQL there is a general file parser language and the the XML contains the semantics and actions to do what is needed to turn the XML into data structures.. As long as the parser/producer code is in the libraries, you will need more complex libraries to parse all the manipulations of the XML structures that develop.. Making the code more complex, and allowing Microsoft to control the market through the data formats.. If the data formats have their own methods, its harder to control them, because the code exists to read and make use of them as long as the data is around..

If you want to know, its the same as an object with all methods and data made public.. All that is required for the library is a way to decode the methods and execute them..

I keep wondering why people think XML is great and why they don't complete the solution by adding methods to the objects to make them self maintaining.. Its the way objects work.. XML just produces data structures with no brains..

Bill Gates
Wednesday, January 14, 2004

Oh and BTW, XML is a text format, and all text formats
are bad for image data.. That's why image data is stored in
binary format.. Its not a problem with XML, just a problem with who is using it..  But again, the only reason we are dependent on Microsoft is we continue to use data formats, why aren't we using objects (data formats with builtin methods).. All you need to do is limit the scope of the methods to the dataset, ala java style, eliminate
access to memory pointers and special stuff, make the language specific to handling data formats, binary or text based..  Also eliminate from the language the capability of the object knowing about its environment (so that the object can't tell what date it is, what version of software you are using, etc..) so the data can only do one thing, offer access to information in a meaningful format, forever, always.. And it has to have a understandable interface.. You don't have to understand the idosyncratic nature of every version of object because in order for them to retain usability they must implement a basic interface..

To control the market Microsoft would have to change the methods in every object.. It wouldn't be as simple as changing the data format. And if Microsoft chose not to use the method interface, people could play tricks on them, change the internal data formats at random so that their parsers fail them..

If in linux object/xml(text/binary) files were used for storing images, movies, etc.. Programs could be written to convert quicktime and avi files to this new format that is
self describing and self maintaining, self parsing.. Eventually the whole world would convert and no vendor could ever again leverage software usage based on a file format..

Bill Gates
Wednesday, January 14, 2004

How would you like to fund the creation of self maintaining XML based objects with methods.. The idea is basic, in the XML files, store methods in a language that is limited by maintaining data structures, it would not have direct access to memory or to execute code, but it could only manipulate data within the XML object.. This would
make the XML objects self maintaining, they would have to implement interfaces, the internal data structure could change independent of the interface.. For instance a movie object might look like this:

<movie>
      <framedata>
      // binary data here
      </framedata>
<movie interface>
<function frame(x)>
      // frame access function implemented here
      <expand_frames()>
      <return thisframe(x)>
</function>
<function expand_frames()>
      // special code to parse binary data and expand it
      // into a internal data structure..
      // accessible to the object's other functions..
</function>
<function thisframe(x)>
      // references the particular frame in file
      return this->frame[x]
</function>
<funtion storeframe(x,type standard_picturedata data)>
      <this->frame[x] = data>
</function>
<function compress()>
      // special code to compress all frames
</function>
<function selfstore()>
      // instructions to store one's self as an object.
</function>
</movie>

Subsequent calls to the object after it has been expanded,
would allow faster access to the frames.. Imagine that the language set for decompression could be expanded to handle decompression of data that makes use of wavelets, discrete cosine transforms, and the core technologies that apply to every kind fo audio and video format.. As well as text formats..

Some objects could even have functions to generate streaming content.. Like instead of having a particular server, you could ask the data to stream itself (example in a perl like syntax):

open (handler, xml_obj_sandbox(movieobj, "stream()"));
while (<handler>) {
      print $_;
}

Richard Stallman
Wednesday, January 14, 2004

*  Recent Topics

*  Fog Creek Home