Fog Creek Software
Discussion Board




Back to Basics

Joel Spolsky
Wednesday, December 12, 2001


Joel, this piece doesn't meet the high standard set by your previous stuff.  Entertaining, but did I just read that entire thing only to find out that it's an anti-XML essay?

You seem to be saying that people like XML because they are unaware of the performance impact of doing so.  Perhaps you're right.  Nonetheless,

1.  I'm aware.  I once wrote an entire C compiler, from scratch, by myself.  I quit when it could self-compile to the third generation flawlessly, and the project never went anywhere after that.  It was fun, and I learned a lot.  I'll never be capable of completely ignoring the realm of The Bytes.

2.  I like XML.  It's not the panacea that many people think it to be, but it's darn handy to have a common syntax framework that can be used for most situations wherein we formerly reinvented a new custom syntax every time.  I *know* that I sometimes use XML when a custom approach would be faster.  But I don't make this choice unless the impact is lost in some other inefficiency like network latency or disk seek time.  I'm making a conscious tradeoff of CPU time for programmer flexibility.  Moore's Law makes this kind of stuff possible.

If you ever find a need to understand the XML fan in your life, here's a tip or two on what makes us tick:

1.  We are in vehement agreement with you:  I would *never* use XML to store "lots" of data.  :-)  I suppose we might disagree on the scale of the word "lots".  Nonetheless, few people are seriously advocating the use of XML in place of relational databases.  I'm not one of them.

2.  There are a number of specifications which are built on top of XML which actually contribute to the perceived coolness of XML itself.  For example, the XML Schema stuff provides a standard way of describing metadata and data types.  SOAP is built on this of functionality.

3.  It seems to be an acquired taste.  Most people don't like XML until they've been immersed in it for a while.  I realize that this suggests either addiction or brainwashing, and I have no evidence to disclaim either.

Still, the end reaction to this piece is some combination of the following thoughts:

1.  Joel is older than we thought he was.

2.  Joel knows an awful lot about the low-level implications of string handling.

3.  Joel doesn't know a heck of a lot about XML, but he dislikes it anyway.

Cheers,

--
Eric W. Sink
Software Craftsman
http://www.ericsink.com/

Eric W. Sink
Wednesday, December 12, 2001

Good summary Eric.

I'll just add as well that it gets a bit grating seeing inconsistencies in articles like this. Joel goes on at great length about byte by byte copying and appending strings in C, and then says that moving one record in a relational database is a single assembly instruction.

Sorry. Where was all that indexed searching? Row locking? Variable length rows with varchars(), etc. Suddenly it all seems a bit simplified, but justifies his 'XML is strings is slow' argument.

I think a better piece would have talked about how choosing the right data structures, even at the lowest level, is an important problem that can affect the Big-O performance of your higher level coding in many ways.

Rob

Rob Mueller
Wednesday, December 12, 2001

I like the little gcc picture at the bottom.

Michael Pryor
Wednesday, December 12, 2001

By the way, Joel's imaginings of the inner workings of malloc and free are too simplified. It isn't the case that malloc sometimes has to go through and coalesce elements of the free list; you use boundary tags to amortize this coalescing over every free operation.

Joel also implies that there is no good reason to use C strings. Here's a good reason: operating on substrings. In C, you just pass a pointer into the parent string, and you're done. In Pascal, you get to blow O(n) cycles making a partial copy of the parent string. Both C and Pascal representations trade some O(n) operations for some constant-time ones; it is the mix of these operations that determines which representation should be used. Just because Pascal strings make excel scream doesn't mean they'll help your application any.

Cheers,
Keith Adams
MTS VMware, Inc.

Keith Adams
Wednesday, December 12, 2001

I can appreciate that for you great programmers Joel's piece was rather lacking.

However, for those of use who really WANT to be great programmers it is great.

It partly explains why it is becoming so difficult to become a great programmer.

We stand here, diligently learning our craft and ahead of use we see the people we want to follow.  People like yourselves.

The problem is that theres not path to follow.  We don't know how to get there.

Joel's article helps me to understand why.  We are missing a vital element in our education.  We've come this far with Java or Visual Basic and never had to touch a byte.

Ged Byrne
Wednesday, December 12, 2001

For the very same set of reasons that Joel might choose VB over, say, C++ for portions of CityDesk, I consciously choose XML over fixed-format intermediate data transport structures on a routine basis. But not without careful consideration.

That is to say, I am perfectly willing to incur the overhead of string intensive, memory wasting, processor clogging intermediate formats in order to gain benefits which simply can't be gotten any other way. Namely flexibility and getting it done faster (so long as performance is _acceptable_).

I think that Joel might be scolding those who need scolding, but perhaps the swath he carves is a bit too wide? It’s been my experience that those who suggest (or use) XML do so for well considered reasons, and not without an eye towards it’s drawbacks. Perhaps I’ve just been gifted with great co-workers. But I’d like to hope not.

--
Alex Russell
http://netWindows.org

Alex Russell
Wednesday, December 12, 2001

Hi,

I found your article very interesting and fun to read as usual, but I still don't understand your point against XML data storage.

Have a look at XML databases such as Software AG's Tamino, Excelon CIS, DBXML and so on. Those products or projects are using well known or brand new database structures and algorithms dedicated to XML storage and manipulation.

No one is pretending to solve the problem you give as an example by storing the XML string representation of the data in memory, and iterating over the bytes or parsing the string.

XML databases are built by combining a serialized XML interface/view over an optimized data store which DOES NOT stores data as XML serialized string. Iterating over elements is fast. You can build indexes for specific elements or data paths. You can extract only the relevant parts of big XML documents in no time. Etc. etc. I have seen big real world applications built on top of XML databases that were actually, sensibly faster than their equivalent on relational DBs.

Granted, raw XML data in string form has to be parsed to be inserted into the data store. But SQL INSERT requests also need to be parsed before data is inserted into relational databases.

To conclude, I agree with you that naive implementors could have the idea of storing XML data in string form, and that the result would be a total mess. But nobody said it has to be done that way. Trust people that work on the problem all day long, they've found satisfying solutions and will find more in the future.

Apart from that, I loved the C string discussion. We all too often forget this kind of details, thanks for reminding it to us.

Best regards,

Nicolas Lehuen
Wednesday, December 12, 2001

Some points about XML vs. relational databases:
1. You generally don't re-read and re-parse the XML text for each operation. Instead the DOM is built once and stored in memory.
2. Which brings me to the second point: don't use XML if you need to save every change back to permanent storage, i.e. disk. Aside from all the operations required to generate the serialized representation, the disk operations will kill performance.
3. Indeed for data manipulation, XML is good for scenarios where all the data can be kept in memory. Relation databases (and other types as well) generally work hard at optimizing disk access (a lot more than memory access anyway).
4. You *can* do XML processing where not everything can (or should) be kept in memory, using a protocol like SAX. But that's intended for serial rather than random access to the data.
5. Databases are optimized for handling large amounts of data. For small amounts, they are far from optimal (which is why protocols like ADO introduce the concept of in-memory databases that can be synchronized with the backend.)
6. Relational databases (as Joel sort of mentions in his article) are optimized towards data that has a fixed format (size). If your data isn't like that than (that is, it's free-form) you should look for some other solution (object database, XML?)
7. XML is really good when your data is hierarchical and is used in clusters.
8. Databases offer lots of services that have nothing to do with performance: managing concurrent access, transactions, access rights, etc.
In a previous project, I stored all the application configuration information in XML files. Made lots of sense given all the points I mentioned above. Also, made this information readable and editable which helped development.

Some points about strings and memory usage:
1. Given Joel's background as the "VB guy" funny he talked about Pascal strings rather than BSTRs (BASIC strings used by VB and COM automation). Like Pascal, but use a word instead of byte for length. Another optimization: they place the length value *before* the pointer, so they can be used like regular UNICODE strings by the C library functions.
2. ASCIIZ strings have this cool attribute where their length isn't limited (OK, on older x86 machines you had problems when you tried to cross the 64K boundary). A 255 char limitation is really harsh.
3. Unlike GCs, malloc only coalesces unused memory, while GC also moves around allocated memory blocks. And even this is a rare event given the amount of memory on current machines (unless you are writing servers)
4. Many memory allocation libraries, e.g. VC++, don't always align memory properly. This can result in a big performance hit when memory is used for stuff like floating point numbers.
5. You want really fast dynamic allocation? Use _alloca. The price: allocated stack memory is never reclaimed (well, it is when the thread dies) so your application's virtual memory footprint is appt to grow as result. And the stack on win32 is limited to 2MB (per thread)
6. Most string libraries use reference counting a lazy allocation to optimize string manipulations (instead of copying a string, both variables point to the same memory and the count is incremented.) Obviously this technique also has performance penalties of its own.
7. Some really cool optimizations can be achieved using recursive template definitions (see the Blitz++ library at http://oonumerics.org/blitz/ and my own article (if you can find it) http://www.wd-mag.com/articles/1997/9706/9706toc.htm?topic=articles).
    

Dan Shappir
Wednesday, December 12, 2001

In our team I'm insisting on the rule: Optimize Last.
Doing optimization during features implementation is the direct way to fail to get it done.
I can say that in our experience in 99% cases performance bottlenecks were not in places where we expected them to be.
XML or SQL - there is no difference. I remember developers that spent a lot of time doing various bit optimizations and as a result their system was super-slow because of bad SQL queries.
So our team's solution to this problem - don't worry about optimization until you get everything done. After that don't guess - measure it. There are a lot of good profilers available. I always surprised by what profilers show me. And it usually takes 1-2 lines of code change to make system 10 times faster.
If our design is bad - we always can refactor it to something faster. Thanks God we have those UnitTests.

Roman Eremin
Wednesday, December 12, 2001

Whatever the ins and outs of the XML vs. database stuff I think that Joel's deeper point is the one that really rings true for me.

Fundamentally, it is that programming as a discipline is all about levels. Primarily, levels of abstraction. The catch is that the abstraction is not always perfect. There are too many other factors that come into play; space, performance etc etc. When those other factors start to crowd into the picture then it helps to be able to understand and dig into the stack to see what, at a deeper level, is actually going on.

Whatever the merits of the XML vs database example I think that this is the deeper message and that I subscribe to it. It has been observed before (not by me) that a characteristic of good programmers is their ability to easily traverse many levels in their understanding and analysis of a problem. Without an understanding of the deeper levels one doesn't have the wherewithal to make the right judgement calls.

Raoul du Plessis
Wednesday, December 12, 2001

Joel,
nice article.  I've spent most of my recent years in high-level languages like Java but recently I've had to resort to C/C++ for some aspects of my work.  I have some thoughts what languages budding programmers should be taught (whilst trying to avoid a flame war),

I learnt the programming basics of memory allocation, if's, and loops (and so on) in C as a teenager.  I actually understood these things only after going back through them at college.  I truely understood pointers about a year later.  I was taught OO programming in C++ but I only really "got it" when I taught myself Java.  And so on...

My point is that no particular language provides the right environment to learn every aspect of programming (and I think most people agree with this). 

BUT, we are moving to an age when despite the current climate there is an ever-increasing demand for computer programmers but without an increasing supply of interested, capable and adept candidates.  Many of the people now programming are not programmers by nature.  They have little desire to learn to program beyond what they need to know to complete their current tasks.  Hence the proliferation of language like VB, Java, PHP and Lingo.  I think it is worth remembering that whilst these languages hide many details which real programmers should, indeed need to know, they enable more people to enter the field of programming at some level.

I've probably made my point badly so I'll summarise:  Not everyone who programs is a true programmer, hence not everyone who programs needs to understand pointers, memory allocation and strcat!

Jamie Lawrence
Wednesday, December 12, 2001

I went to a lot of trouble to forget all that stuff.

Cut it out. Please.

Tony McConnell
Wednesday, December 12, 2001

I've noticed two types of programmers.  One type prefers going to the metal, thinking in terms of bytes.  The second type prefers data structures, steering their interfaces to mathematical ideas such as 'closure' and adaptibility.

The first type of programmer always criticizes the second type for not knowing the basics.  "Your Lisp implementations hide so much complexity," they charge.  "You almost can't fail to write a fast program in C, but you have to do nonobvious things to make them fast in Lisp."

(I'm too young to know what people actually said, but that's what I gather from random flame wars.)

How does one counter that?  Well, you can't.  A good programmer has to be a combination of the two types, or they'll always be bitten by the particular disadvantages of one type.

That's why I don't agree with Joel, when he says that programmers shouldn't learn Java as their first language.  Programming isn't a hierarchy; it's a graph.  History is of major importance, but it does not require one to learn anything in any particular order.  A programmer's skill is measured by how she goes from node to node on that graph, treating each node as if it were where she began.

forgotten gentleman
Wednesday, December 12, 2001

Joel, thank you for writing this article.  I work for a consulting shop in the development department and most of what we do is build web-based apps using ASP/VB/SQL and most of the people I deal with on a day-to-day basis don't understand the first thing about architecture at this level.

Because of this, they don't understand it's importance, and therefore can't justify the increased time it takes us to build applications that are "tuned".

I 100% agree that any serious programmer needs to understand the fundamentals just as any high-performance mechanic needs to understand the basic physics of the internal combustion engine.

Thank you!

Jason J. Gullickson
Wednesday, December 12, 2001

No decent malloc implementation works like Joel describes.  But when I looked at the C runtime source that was included with Visual C++ in about 1996, it did have a malloc that was almost that feeble.

ObPedantry:

Unix originated on a PDP-7.  It was later ported to a PDP-11, where C originated.

http://cm.bell-labs.com/cm/cs/who/dmr/hist.html

And there wasn't a "PDP-8 microprocessor" until years later.

http://www.faqs.org/faqs/dec-faq/pdp8/

David Wragg
Wednesday, December 12, 2001

OK, with respect to the opinions I posted above, I repent a little bit.  I suspect that in my earlier days as a programmer, I would have found this article to be excellent.

Joel's right about the value of understanding what's going on under the hood.  I remember the first time I realized that malloc is slow.  It's like the first time you read a newspaper article and disbelieve it.

Similarly, have you ever considered just how much work printf() does to accomplish its mission?

My favorite example:  Back when Win32 was new, I accidentally discovered that the thread-safe version of atof() was *hundreds* of times slower than a replacement function I wrote from scratch.  I didn't need all that mutex stuff anyway.

When you call a library function, you're placing your app in the hands of someone else.  Don't trust that person just because his API spec can from an ANSI or POSIX standard.  :-)

--
Eric W. Sink
Software Craftsman
http://www.ericsink.com/

Eric W. Sink
Wednesday, December 12, 2001


Excellent article. One remark about Pascal strings. The reader might get the impression that these are currently used in Delphi. They only exist for backward compatibility. Everyone uses AnsiStrings. Here are some features:

- Up to 2 GB in length
- Length stored in header
- Allows any characters incl binary data
- Compatible with null terminated strings
- Reference counted and dynamic allocation
- No buffer overruns

Here's a primer:
http://efd.home.mindspring.com/primer.htm

The result is that AnsiStrings lead to quite efficient very readable code:

<Code Example>
// Declaration
  FirstName, LastName, FullName: string;

// Code
  FirstName := 'John';
  LastName  := 'Smith'; 
  FullName  := FirstName + ' ' + LastName;
</Code Example>

Yes there might be other solutions which might be slightly faster, but none comes close to the perfect code readability of AnsiStrings. And readable code is, arguably, the most important condition for efficient programming/debugging.

Jan Derk
Wednesday, December 12, 2001

I'm always intrigued by the articles you write on your page, though my software experience is markedly different what I read about on your site.  I write real-time embedded applications, but like to peer into the desktop world on occasion to 'see how the other half lives', if you will.  It's certainly interesting to read that potentially wasting 50% of your allocated memory is acceptable.  I would be chewed to pieces in a design review if I tried to slide that one through, but I certainly understand we're working in different paradigms.

I found this article to be interesting, in particular, because of the apparent contradictions it presents.  Working at the byte and bit level every day gives me an appreciation for your topic, though I do get confused as to how it is overly concerning to you, at least as more than an intellectual exercise.  I frequently read articles on your site that lead me to believe that performance concerns, be they memory or execution speed, are important, but certainly not something to be dwelled upon.  Clean design implemented quickly seems to be the foundation of your software philosophy, which, while unarguably necessary in today's software market, tends not to favor optimization.  How malloc() does what it does seems to be academic, simply because, if you're using dynamic memory allocation, you have to use the routines the OS provides for allocating new memory blocks.  You can certainly attempt to optimize by allocating 2^n bytes per malloc() call, but if you're implementing such optimizations on the basis that when memory gets tight malloc re-organizes, it seems that memory usage is the over-riding concern.  Either that or you're allocating/deallocating too much. 

I realize it was just an example, but it seems from your previous articles that optimization, at least in the desktop world, should only be performed at the biggest bottlenecks in performance, and then only if they cause significant penalties. 

At any rate your point was well taken, it just kind of tickled me that on a website that preaches delivery time over code bloat and performance, I would find an article that even mentioned assembly code.

Carl Sanders
Wednesday, December 12, 2001

Interesting article. The replies are interesting as well. As I think about it though, I can't help but think that any concerns Joel has are offset by Moores Law. In other words, good chip designers increase the rate of CPU performance faster than bad software desigers can waist it.

Rick Schwartz
Wednesday, December 12, 2001

"Excel uses Pascal strings internally which is why strings in many places in Excel are limited to 255 bytes, and it's also one reason Excel is blazingly fast."

(1) It's not.  (2) Pascal vs. C strings probably has very little effect on its overall performance one way or the other.  The 255-character limit is much more significant.

"all those programmers are just lame-asses. They should have figured out how much memory to allocate. But really, C does not make this easy on you"

And what other language does?  If you're doing dynamic memory allocation you're doing dynamic memory allocation, regardless of language.

"Smart programmers minimize the potential distruption of malloc by always allocating blocks of memory that are powers of 2 in size. "

A lot of those "smart" programmers will then be surprised that their "optimization" has little or no effect.  The really smart programmers use lookaside lists or fixed-size allocators that carve large chunks of space out of the pool managed by the general-purpose allocator and then handle allocation/deallocation very efficiently.

"What this means to me is that you can't use XML if you need performance and have lots of data."

What this means to me is that you're overgeneralizing from a quite specific case.  Worse, you've concocted an XML case that's very unrealistic to make your point.


Your point about needing to know the basics is a good one.  I wish more people would think about the "big O" stuff when they're designing and choosing algorithms.  I particularly wish a lot of people knew more about issues like message or context-switch latency, so they wouldn't write applications that require a hundred round trips for even the simplest of operations.  Perhaps most importantly, I wish people would think about where/when performance matters and where/when it doesn't, so they wouldn't do stupid stuff.  An example of such stupidity might be someone who optimizes their string handling in a program that doesn't really do very much of it, and at the same time uses an inefficient algorithm for checking which cells need to be updated.

It's unfortunate that, in trying to make an important point, you do so with such flawed and/or cooked examples.  That really doesn't help.

Jeff Darcy
Wednesday, December 12, 2001

> In other words, good chip designers increase the rate of CPU performance faster than bad software designers can waste it.

Optimist ;-)

Fredrik Lundh
Wednesday, December 12, 2001

"As I think about it though, I can't help but think that any concerns Joel has are offset by Moores Law. In other words, good chip designers increase the rate of CPU performance faster than bad software desigers can waist it."

I strenuously disagree, based on my experience.  First of all, remember that Moore's Law isn't a law; it's an observation which just happens to hold true for now, and stating it's a law would be tantamount to stating that one can plan technological breakthroughs.  It appears to have enough steam to go on for, oh, say, fifteen more years, based on what we currently know.  Secondly, Moore's Law describes a geometric progression in performance.  Meanwhile, I can design polynomial-time algorithms all day to do useful things that will outstrip this, let alone exponential- and factorial-time algorithms.

Paul Brinkley
Wednesday, December 12, 2001

(Yeargh.  I erred; a polynomial-time algorithm would always eventually be dominated by Moore's Law, assuming Moore's Law were to hold up indefinitely.  The exp-time and fact-time algorithms would still dominate this in turn, however.  We now return you to your previously scheduled misinformation...)

Paul Brinkley
Wednesday, December 12, 2001

Wow, that was a lot of fun to read!

But it wasn't really necessary to go through all that just to prove that SQL works faster than parsing/scanning hierarchical XML files? Everyone who has ever played around with this stuff knows that!

One point you didn't make is that the whole point of using XML is interoperability (I think): well, most RDMSes can share data too, using some interchange format (XML, anyone?).

Michel Benevento
Wednesday, December 12, 2001

Thanks for this article.  I hadn't used XML directly until just a little bit ago, and I wasn't sure what the huge deal was (except for having a "standard" way to exchange data).  So many people seem to think that XML solves all their problems somehow.

I just went to a MSFT DevDays, and they had a partner speaker from Costa Rica there.  He made references to using XML as the method of storage in RAM for common programs, hardware accelleration for XML DOM, and that perhaps DBs of the future would store everything as XML.

I thought that was quite silly-sounding, and I'm glad someone is showing the truth of this.  Oh, and I'm sure it will only get worse as people who don't know what they are doing start using Microsoft's (.NET) wonderful System.Data.Dataset, and start saving to XML all over the place...

Michael Giagnocavo
Wednesday, December 12, 2001

One thing you might have missed about XML:  The availability of parsers.  This afternoon I'm writing a program to take a part of a DB (staging) and be able to stick the data into another DB (production).  What'll I use to save the data?  XML.

Why?  Not for interop, since it is just between my two programs, but because .NET allows me to export and import to this format in the easiest way.  I don't really care what the actual format is, it could be pound-sign delimited for all I care.

Michael Giagnocavo
Wednesday, December 12, 2001

After reading all the messages on this topic, I am absolutely amazed at how many people -- most of whom are clearly very intelligent -- missed the point of Joel's article in a colossal fashion.  And if you don't believe me, then scroll up a bit and read all of the messages that attempt to contradict some technical aspect or another of the article.

Okay, so returning a record from a relational database isn't exactly one CPU instruction.  And Excel doesn't use Pascal strings precisely in the way that Joel described.  Fine.  But WHO CARES?  Not me.  Anyone who thinks that Joel's article was about the specific drawbacks of XML, for instance, is missing the point.  The point is that very low-level details can and do affect very high-level strategic software decisions, and in order to build quality software we must have a solid understanding of the basics.

Excellent article Joel, as usual.

Andy Shyne
Wednesday, December 12, 2001

Just in case you mislead some poor programmer into actually calling malloc() with powers of two, it should at least be noted that every malloced block has a bookkeeping overhead that would have to be taken into account, or you could actually end up making things much worse!

Graham Wheeler
Wednesday, December 12, 2001

for (;;) free(malloc(0));

Save the whales, Free the mallocs
Wednesday, December 12, 2001

I don't think that C learning is a must for programmer. If fact, in your article you are speaking  about math background, not software development. Study of string operations can be done without a computer at all. If you are talking about learning basics things why not study Post and Turing machines or write curious algorithms on the paper? From my own experience man with the part of brains that understands pointers can became a programmer without a computer.

Yakov Sirotkin
Wednesday, December 12, 2001

I don't think you went far enough down -- IMHO, programmers should learn a level of abstraction below what they are going to use.

I'm a C programmer by trade and find myself constantly thinking like my local friendly compilier, now for me, I cut my teeth on 68K asembler all those years ago at the tender age of 12.
When I hit University, and my first formal training I grasped C easily. Pointers? - no problem. Pointers to pointers? - bit of a mind bender, but nice enough.
I expect my lecturers will curse me for thinking of C++ objects as just being C structs with functions in them.

So now, when I'm cursing Perl for not having a switch() and having to do it as an if...elsif...else block instead, I always put the most likely option first -- why? simple, less conditionals to be evaluated, and trust me, if you ever have to write ASM, you'll want to cut out every superfluous instruction...

Oh, just for the record scripting in XSLT is evil. Very evil.

Maybe I'm a mini-Joel, maybe I just have aspirations for doing things right first time round -- why make good software, when you can make great software, as my mangler says...

Rowland Shaw
Wednesday, December 12, 2001

ZzZz 

*snort*

oh! the article is over?

Tony
Wednesday, December 12, 2001

I'm impressed!

I was beginning to think that the lower (implementation) levels were beneath Joel but clearly I was wrong.

I come from an embedded development environment and found the discussion of the low level details of C string handling a pleasant trip down memory lane. I often feel like a dinosaur in today's world of C++, STL, Java, CORBA, XLM, ... but your article reminds me, in much the same way that the Pragmatic Programmers book (http://www.pragmaticprogrammer.com/ppbook/index.shtml) did, that the fundamentals cannot be ignored even though we often wish they could.

In contrast to embedded software, desktop software seems happy to throw CPU cycles at problems. We all know that this cannot continue. Furthermore, we all know that algoritm design still has a place, even in the world of the 2GHz CPU.

Now, several people have commented on Joel's side swipes at XML. I do not use XML in my day to day work so my comments may be less than informed on this topic but here goes anyway.

HTML is good as describing the logical structure of a document. Many people miuse HTML in an attempt to "force" to look of a document and, indeed, the inclusion of tags such as <b></b> and <i></i> confused this issue somewhat; but, HTML was originally designed as a mechanism for describing the logical structure of a document, not it appearance. These days, we have Cascading Sytle Sheets to take HTML and apply the look-and-feel desired.

I think that XML is in a similar position to HTML. Whilst HTML is intended to describe the logical structure of a document, XML is designed to describe the logical structure of data. It is very good at this. This does not mean it is very efficient.

I think that XML is best suited to the import and export of data between programs but that it is poorly suited to the internal save/retrieve needs of the application. Anyone building a datbase build around XML as the internal format should think again.

Unfortunately, XML has been hugely over-hyped in the marketplace. This has resulted in some companies insisting on XML as a must-have for their next product when its presence within the product will add little to the capabilities of the product. However, it does allow the marketeers to tick another tick box on a "product feature chart". And, in some markets, that is an important differentiator.

As someone pointed out; as long as the amount of data being managed is small it doesn't make all that much difference how you store the data.

In summary, my thoughts on XML can be sumarised as follows:
+ XML as an interchange format is important.
+ XML as an internal format is significantly less compelling.

Well, that's my 2 cents worth.

Gordon J Milne
http://www.geocities.com/gjmilne64

Gordon J Milne
Wednesday, December 12, 2001

Another useful way of dealting with potential buffer overflows is using functions like strncat that allow you to specify the maximum string length and don't copy more than that number of characters into the destination string.

Gareth Marshall
Wednesday, December 12, 2001

Andy Shyne sez:
"WHO CARES? Not me...The point is that very low-level details can and do affect very high-level strategic software decisions"

Yes, but it's *really* hard to make a point about how important it is to know how something works while simultaneously demonstrating that you don't.  It's a great point, an important point, but IMO a less-than-stellar article making that point.  Thirty-year-old sorting/searching examples make the point just as well, and the evolution of copy-avoidance techniques in TCP/IP stacks might make it better.  Jumping from string formats to XML vs. RDBMS - particularly since Joel clearly has an ax to grind regarding the latter - is not nearly as effective as a single "detective story" about identifying such low-level stupidity as the source of a real performance problem in a real product.  I know Joel has lots of such stories and can tell them well, and I love reading them.

Jeff Darcy
Wednesday, December 12, 2001

Woah, everybody, inlcuding Joel. Complaining about XML as being slow as a data storage medium is like saying a hammer is a poor tool for screwing in screws. Yes, it is. Where oh where did anyone say it was a good idea to screw in a screw with a hammer?

XML shines as a data _exchange_ format. It's that simple.

Seeing that Fog Creek's culture is pretty Microsoftish, I'd point to SOAP as an example of where data exchange using XML shines. In fact SOAP borders on the point of being a kind of simple-minded ORB, but that's a whole other argument.

Nick Bauman
Wednesday, December 12, 2001

On XML:

XML is a markup language.  This should be obvious, since that's what the "M" and "L" stand for.  A markup language is intended to make it easy to move the content from place to place without losing any of the content or the markup.  In short, it's an interchange format.

So at what point did the concepts of markup and storage for fast retrieval start to overlap?  Modest suggestion: Never.

The whole concept behind "XML everywhere" is simply in the nature of humans, and not confined to programmers and other bitheaded geeks.  That concept is quite simple, but no less futile for its simplicity: a universal language.

This is not unique to XML, Java being a previous casualty.  It isn't even unique to computer languages, Latin and French being two previous casualties (where do you think the term "lingua franca" came from, and why is it in Latin?)  The fairly predictable result is that XML is being forced into areas it is shockingly ill-suited for.  When Big Thinkers get to too high a level of abstraction, the altitude starves their brains for oxygen.

Let XML be XML.


On malloc():

Trivial implementations of malloc always start at the beginning of the free-list.  Slightly smarter implementations start at the last free-block examined.  This makes a huge difference in speed and requires adding exactly one static variable and a tiny amount of prolog/epilog code to the list-walker.  Trust me, I've written both kinds.

There are countless ways to make even smarter malloc implementations.  The really sad thing about malloc() is that it's so darn inflexible and simple-minded.  If it doesn't work well enough, you always have to get out the big guns.  You either write your own allocator from scratch, and modify lots of code to call it instead of malloc().  Or you write an allocator that uses malloc() only for big chunks and call it, again editing source.  Or you write specialized allocators and call the specific one you think you need at the time, hoping that whoever comes along later to change the code realizes this and reevaluates the choice about what allocator to call based on changes they make to the algorithm.

Isn't it about time that we got past the "one size fits all" idea embodied in stupid old malloc() and its ilk, and started passing hints or suggestions to the allocator about whether it should be fast, or precise, or aggressive about coalescing, or whatever other hints you can think of?  One of the nice things about smarter implementations of GC is the idea of generational garbage collection.  This is basically an embodiment of the hint that some things will live longer than others, and to not spend a lot of time or effort trying to collect additional space from a population of allocations that have reached a certain age.

  -- GG

Greg Guerin
Wednesday, December 12, 2001

Here is the best way to write strlen and strcat. I wonder why nobody else has figured this out? Note the use of pointer arrithmetic instead of array notation for added speed!

size_t mystrlen(const char *s) {
    if (*s == '\0')
        return 0;
    else
        return mystrlen(++s) + 1;
}

char *mystrcat(char *dest, const char *src) {
    if (mystrlen(src) == 0)
        return dest;
    else {
        *(dest + mystrlen(dest) + 1) =
            *(dest + mystrlen(dest));
        *(dest + mystrlen(dest)) = *src;
        return mystrcat(dest, ++src);
    }
}

Shannon Jones
Wednesday, December 12, 2001

Wow. This article has really given you oldies a chance to blowhard. I just finished my CS degree this year, and, yes we covered this stuff (way back in "introduction to computing 1A"), but get this - nobody cares. If I need to care, I will, but I get the distinct feeling that I won't need to give it a second thought.

Do any of you guys still light a fire by rubbing 2 sticks together? If so, what is the DNA structure of the wood you use?

Mr Anti Luddite
Wednesday, December 12, 2001

I thought it was a very interesting article, although I'm not competent to comment on the XML bit, as the only time I tried to read a book on XML I fell asleep after the first page.

With regard to the bit on strings, I tried the following timing experiment, adapted from a recent thread on comp.lang.c++ - excuse all the hard-coded magic numbers.

#include <cstring>
#include <string>
#include <iostream>
#include <ctime>

using namespace std; // So sue me (c Scott Meyers)

int main()
{
  int  i;
  char source[10000];

  char dest[30000];


  for( i = 0; i < 9998; ++i )
    source[i] = 'a';
  source[9999] = '\0';

  clock_t Start = clock();
  for( i = 0; i < 10000; i++) {
    dest[0] = '\0';
    strcpy( dest, source );
    strcat( dest, source );
    strcat( dest, source );
  }
  clock_t End = clock();

  cout << End - Start << endl;

  Start = clock();
  for( i = 0; i < 10000; i++ ) {
    dest[0] = '\0';
    strcpy( dest, source );
    strcpy( dest+9999, source );
    strcpy( dest+19998, source );
  }
  End = clock();

  cout << End - Start << endl;

  Start = clock();
  string sdest;
  string ssource( source );

  sdest.reserve( 30000);
  for( i = 0; i < 10000; i++ ) {
    sdest.erase();
    sdest += ssource;
    sdest += ssource;
    sdest += ssource;
  }

  End = clock();

  cout << End -  Start;


  char ch;
  cin >> ch;

  return 0;
}


A typical set of timings was:

1602
791
511

I can understand the first C version, using Shlemiel's algorithm, being way slower, but I was surprised that the C++ string version was consistently faster than the C version using 3 lots of strcpy - I would have assumed that the C++ string class would have to do pretty much the same thing internally. Does anyone have any ideas why this is the case?

Andrew Simmons
Wednesday, December 12, 2001

I've read some messages where people say "Who said XML was good for a fast way of storing lots of data?".

The answer, sadly, is many people.  I've met people who think that XSLT scripting is a far better way to implement a page than having a compiled databinding system.

And, at a recent DevDays in Central America, one of MSFT's large "partners" (the name partner is pretty loose, since all you have to do is pay MSFT to get the title), had one of their higher-level people there.  He made a presentation, and suggested (brought up the idea to consider) that we should:

  Use the XML DOM as the way to store our internal data needs.
  Have XML DOM hardware support.
  DBs of the future might want to use XML as the native storage format.

  This was presented to a group of 100 to 200 developers, many who came to learn about why XML is important and how to use it.

  It's a dangerous trend.  Also, with so many people jumping onto .NET, and .NET's way easy support of XML, I fear many people will start using XML to store where they shouldn't.  Again, because XML is hidden from the end user.

  How many people will use SOAP, just because it is easy, and XML, when it is really poor performance compared to other remoting systems?  Send a large array of integers over SOAP and see.

  I'm happy to see someone addressing this harshly, because there are way too many people out there who think XML somehow is a great solution for everything, rather than just simple data exchange. 

Michael Giagnocavo
Wednesday, December 12, 2001

I don't think Joel is a luddite.  And you can't even guess what the world of corporate programming is like until you've met some of its inanities. 

As I understand, Joel describes this:  Programming languages don't come with performance tables.  Speed and scalability are hidden, to be determined empirically.  Therefore it is very easy to be seduced by pretty languages and concepts, because the great flaws are hidden underneath.

"Determined empirically."  Anything determined empirically requires experience, by definition.  And experience is precisely the thing most CS students don't have.

What gives mental experience?  Understanding the fundamentals, just a little bit.  That keeps people from trying out heavy recursion in Java, because it worked SO well in Scheme...

But I don't personally mind, because I've been hired to speed things up.  I like XML, because I just call it a "tree" throughout my sourcecode (I never, never write "XML") and trees are well-understood.  So why don't we stop talking about XML?  Call a spade a spade, and XML a tree.

Richard Jenkins
Wednesday, December 12, 2001

Excellant article! Informative, vitrolic and includes a dig at Outlook, looks like you covered all the important stuff.

My comments on the XML as Interchange not Storage debate: It may have escaped notice but to interchange format requires you to "store" the data. Sure this may being stored to RAM or to a socket rather than disk but it uses the same general algorithmns. And at the other end you still have to copy iy into your own

Note Joel's question regarding COM and why it is slow when it crosses process boundries. The problem here is known as Marshalling. Basically to move things from one process space to another requires the system to perform at least 1 memory copy (and typically several) as well as a whole lot of management crap. XML just makes this worse....

Andrew: the reason ( I would guess) for the performance gains of C++ string over c strings is from the memory management enhancements that are included in the STL (look at stl_alloc.h if you want to go mad...). Also try removing the reserve() call and see if it goes slower...

Julian Bullock
Thursday, December 13, 2001

I do not think malloc( ) coalesces free blocks only when it has run out of space - free( ) tries to merge adjacent blocks ("buddies") to reduce fragmentation, every time. See Knuth's "Fundamental Algorithms" (TAOCP, Vol I) for more details.

This is similar to how garbage collection in Java and in Limbo differ from each other - minor performance overhead on each freed object for an overall improvement in performance and predictability. (In Limbo, simple reference counting is used to decide when to free an object and "mark-and-sweep" is used for cyclic structures - in real life, programmers do not create that many cyclic structures.)

Ranjit Mathew
Thursday, December 13, 2001

I am not sure what you mean by "XML is slower than relational databases"
Both of them are used for different things.  XML has gained popularity because of easy transfer
of data.
Aaagh !. You are comparing apples with knives!.

Aaagh again !!!. In a relational database you dont move like that - add 100 .

Every page in a relational database stores "Page header". This page header has the pointers to where the rows inside each page are. This differs from implementation to implementation. So when you want to fetch the next record you dont add hundred. You move to the next entry in the page header, get the pointer (In some implementations, the ROWID) and then fetch the row.

Regards

Karthik
Software Engineer
Bangalore
India

Karthik.S
Thursday, December 13, 2001

Michael Giagnocavo said:
Use the XML DOM as the way to store our internal data needs.
Have XML DOM hardware support.
DBs of the future might want to use XML as the native storage format


I don't think this is such a bad idea.  I wouldn't say use XML for internal data needs, but the XML DOM is a useful interface for hierarchial data.

Use XML Dom the same way you would use an iterator, not as a means of storage but as a standard interface.

Ged Byrne
Thursday, December 13, 2001

Excellent article as usual!.

While consulting for SEMY Engineering Inc., I've developed a distributed application with MS Excel spreadsheet as client and XML-RPC as middleware. The interesting aspect is that Excel spreadsheet talks to a proprietary RDBMS server running on Solaris.

I found XML technology pretty useful for developing distributed applications. It is really easy to debug XML applications using tools like truss since it is a text-based protocol.

XML enables desktop applications talk to each other across the web.

Jawahar Mundlapati
Thursday, December 13, 2001

Good article, as always. I vote both hands that XML is not half as good as many beleive. Many _managers_ beleive! This is not actually a question for programmers. Though several smart persons found many flaws in your proof Joel, too bad for them to miss the point.
People ignore the fact that if they parse XML into a tree, store it in object database, build indicies and such it not XML already! You can't touch it in you basic text editor any more. XML can only be used as interchange format, and one of the slowest to handle. When management talks XML they frequently just don't have a clue about about other possibilities. They think SOAP is a great idea because you can do easy debugging by seeing the wire protocol. Ha! They say it will work for any web server. Ha-Ha!
I think management just see XML as a must-have for business because they've been brainwashed by higher management or press. They don't know why, and when they try to explain that to programmers it gets ridiculously stupid. Who the hell started all that? And why? Why do we need yet another data format and wire protocol? Wasn't ASN.1 enough, wasn't DCE or CORBA enough? Wasn't Unix using text config files from the start?
Personally I think industry needs to reinvent the wheel to open 'new horisons'. It'll never stop :(

anonymous
Thursday, December 13, 2001

Reply to Andrew Simmons:

the C++ string thing is probably faster because the C++ string class most likely maintains 3 pointers:

char *first;//ptr to first character in string
char *last;//ptr to 1 past last character in string
char *end;//ptr to 1 past end of allocated memory

I shan't explain why 'first' is required! 'last' is needed (rather than just having the length implicit, a la normal C strings) so as to provide a fast 'end()' method. 'end' is obviously required for the expansion of the allocated memory when the string is too big.

I don't think a C++ string class can get away without an equivalent representation to the above, since I don't think appends aren't allowed to be proportional to the number of characters either appended or existing. See here:

http://gcc.gnu.org/ml/libstdc++/2001-07/msg00107.html

This talks about this wrt allocation but I don't suppose it makes a difference where the time is spent.

(I looked at the string header from VC++6 SP5. Their std::string has -- or rather, ends up with -- "char *_Ptr", and "string::size_type _Len,_Res". Pretty much what I described above I think.)

Tom Seddon
Thursday, December 13, 2001

"Or why the original HTML spec for TABLES was so badly designed that large tables on web pages can't be shown quickly to people with modems. "

Hey! Untrue!  The original HTML table spec has the lovely COLGROUP/COL elements to allow single-pass table rendering.  The biggest problem is that Netscape and IE never implemented it, so no-one knows about them. The original table spec has a lot of cool stuff in it.  Go read it before you slag it off...

URL: http://www.ietf.org/rfc/rfc1942.txt

Christian Mogensen
Thursday, December 13, 2001

Re: C vs C++ string handling code above

It intrigued me, so I deided to give it a go. Surprisingly I got the same results as Andrew. I traced through the assembler a bit to see what was going on. Interesting stuff.

My times (debug):
941
471
210

My times (release):
1912
1022
220

Yep, release is slower than debug. Here are the answers.

* The first is slower for the obvious reason (having to search the string for the end)
* C++ is faster than C because C++ knows the length of each string (via start and end pointers that it can subtract), thus it can use memcpy() to append the string data. memcpy() is usually one of the most highly optimised calls in any library C library
* Release is slower because the compiler attempts to inline the string functions, but these are actually slower than the function call. The reason is that the library functions are optimised to do long-word sweeps, while the inlined version is a smaller byte-by-byte piece of code. For the relatively big strings involved, the function version wins significantly. Try adding "#pragma function(strcpy, strcat)" after then #include lines to turn this off. This basically goes back to debug speed then

Hope that helps people.

Rob

Rob Mueller
Thursday, December 13, 2001

FWIW, every optimization for XML I've known demands that you leave XML as soon as feasible.  Parse it quick, put the references into a different data structure.  The structure depends on your access, insertion, and removal needs.

However, it is still XML.  The tree structure remains intact, and is trivial to save to disk -- one function call, no parameters.

Use a fast parser that only understands a very reasonable subset of XML.  Most people cringe after having downloaded a molasses-slow parser, but these slow parsers are meant to be correct implementations, not fast ones.  Correct fast parsers won't be around for a long while.  If you must deal with the occasional document that can't be handled with your fast parser, then if your design is strong enough, the slow parser can handle it.

There are many scenarios where this advice must be modified.  But there are always options.

Richard Jenkins
Thursday, December 13, 2001

About the comment on using XML for internal memory storage -- for tree data, that's fine.  But, the actual suggestion someone made was that this be used, *instead of properties* and so forth -- that all the internal data of an app be XML-based.

That's what I was trying to point out -- people come up with wierd ideas for XML.

Michael Giagnocavo
Thursday, December 13, 2001

I really liked this article, because I'm the type who constantly strives to keep the micro- and macro- level view of whats going on synchronized.  And I've always been a fan of building on a repetoire of basic skills and discipline.

I think Joel's point on raising XML at the end was to point out that folks need to take the same careful view of modern frameworks (XML parsers and GUIs) that they once had to about little things (strings and malloc).

I've seen a trend where too many developers don't have the vision to really appreciate the strengths and weaknesses of frameworks.  As if the hackers at Microsoft or Sun or Oracle are somehow not quite human and that you just had to accept their offerings as is and never question them.

Everything should be questioned.  That's why I like open source -- in theory, the questions can be answered.  But the point is to have the skills to not blindly accept other's offerings.

Meanwhile, the same discipline that Joel applys to malloc should be adapted to other efficiences.  I've had a hard time mentoring other developers to understand things like minimizing RMI/IIOP round trips (in a J2EE application) ... but its pretty much the same thing in a different light.

Howard Lewis Ship
Thursday, December 13, 2001

Wow, such an uproar over what I would consider an article based on a specific opinion.  So instead of presenting yet another view which adds to controversy, I would just agree to every opinion.  I do believe that we are all smart people and would not subscribe to a particular view without due reason.  Every one of us makes a choice for a reason, and in general they valid for that specific context.  It's just that when the context changes that the opinion or view becomes a little obsolete.  Ok, enough of the abstraction and let's bring it into context.  The level of computer science which we want to concentrate on depends on what we want to achieve.  Chances are, schools want to emerge their students into the maximum amount of knowlege given the small amount of time.  The trend in CS is Java and the OO languages.  Unfortunately if you go into this in any depth, you sacrifice the basics.  I agree with Joel in that by giving them Java as a first course, it would not give them enough of the basics.  But if you start off with assembly then you won't be able to cover much by the time the course is over. 

I noticed that Joel contradicts himself by subscribing to using Visual Basic as a programming language.  But he has a point in the article "right tool for the job".

So what is the answer?  Well, it is to become so well versed in Computer Science that you know everything under the sun.  Then, you can choose whatever tool or path is appropriate.  The lack of "basics" understanding derives from the fact that not everyone can afford the extended education (it would take a lifetime).  Even I myself find that I have a narrow view of what is concidered important.  Remember, schools only have you for 4 or so years... and since the evolution of Computer Science, their priorities of what is important may have changed also.

Hoang Do, Gawayne Software
Thursday, December 13, 2001

Thanks to all who replied. I think Rob's suggestion about use of memcpy must be the correct one - when I change the second C version to use memcpy instead of strcpy it suddenly becomes faster than the C++ version. The debug/release thing is a bit of a worry, although the memcpy version is a lot faster in the release build.

Andrew Simmons
Thursday, December 13, 2001

A long time ago (but not as long as you probably think) I learned to do IBM mainframe programming. At the school I attended, the first actual programming class you were required to take was... assembler. You learned assembler before you even learned COBOL or any other high-level language. This served three purposes. First, it weeded out everyone who didn't have the mental dexterity to break down a task into the smallest possible pieces. Second, it made you really appreciate how much work a high-level language saved you. And lastly, it gave you insight into certain COBOL language quirks, which turned out to reflect the underlying hardware architecture. (Of course, by the time I learned all this stuff, the underlying hardware architecture had changed, probably more than once, but the assembler language and COBOL both faithfully followed the contours of the original System 360.)

All three of these are very good reasons to teach students an assembler language first, IMHO. (And no, C doesn't count; there's too much opacity in the library.) You might argue that assembly will teach budding programmers bad habits, as there is no such thing as "structured programming," let alone "object-oriented programming" in assembler, and the assembler equivalents of GOTOs are common. But on the other hand, anyone whose brain is flexible enough to grok the alien world of assembler as their first foray into programming will have no difficulty assimilating the more modern ways of doing things and unlearning any bad habits. It's not as if OOP has no benefits.

Fortunately, I managed to escape programming COBOL after college (I was already writing Apple II software for a living, in BASIC and assembly, when I started at this school). Today, I do hardly any programming at all. But I've found that the more languages you know, the more flexibly you can think about solving problems with technology. It's a mistake to learn just one or two. You should learn about as many as you practically have time for, and you should read about many, many more, and they should run the spectrum from the very low-level to the very high-level.

Jerry Kindall
Thursday, December 13, 2001

I'd generally agree with the main point that if you have no awareness of how something is implemented the capacity for surprise and tumult is higher.

There are other things going on now than just incremental algorithms to move data around.  A well planned system will take advantage of worker threads to go off and grunt on small tasks that if done serially (as in the old days) really do bottleneck.  Then the trick is not so much about optimising loops as it is about optimising signals and dead time.

One of the modern horrors is the String class, everyone in any major project wants to go and generate their own or create their own spiffy subclass, it seems to be the systems programming equivalent of 'My First Editor'.

I would dearly like String handling, the underlying worker threads the garbage collection to be in the OS, then I can just do what I need to do.  Not chisel lumps off bits of rock to make wheels all the time.  Those that mention MFC should be offered the revolver and a darkened room.

Simon Lucy
Friday, December 14, 2001

Hi,

All this topic makes me think about programming languages.
I think a language is made to give us the ability to make a computer act.
Now we have C, C++, OCAML.

When we create a module or an object, we often forget to
take care of all the dependances.
OOP programmers are used to the execution dependancies,
and attributes dependancies.
But there is also memory and execution time that are important. (related to hardware)
We are used to say "we have got enough memory"
and "we don't need to optimize these methods as we don't use them often".

But code/hardware dependancies can have an influence on the
classes that need the fastest execution.

Is there a solution...
If you have one that works or that is only theoric
mail me at:  nipeng@caramail.com

Maybe there is a help using Aspect Oriented Programming.
We can imagine during execution, a malloc that adapts to
fast execution or to security or memory usage.

Has anyone informations about classes blur aspects ?
(space where the A class has an effect on other B class but that the A class does not control, especially memory usage or execution time effects.)

paul Balez
Friday, December 14, 2001

You're talking about systems which adapt during runtime... one must take care, that these systems aren't so complex that they leech away too many resources!  Or worse, if they're too smart for your own good.

I sometimes give hints to the garbage collector to run at certain times, which it usually does.  Usually GCs have their own rules for determining when to run.  However, these systems can be too smart for their own good, if your code doesn't work the "right way."

At some point, I just want a guarantee that the system will have some minimum performance that goes up with hardware upgrades.  I don't want nondeterministic run-time optimization, because it's incredibly difficult to optimize when you don't know what the system does.  If your development system happens to cache something, but the customer's system doesn't, then something that appears fine to you will run dog-slow on their system.

Runtime optimization is hidden complexity.  You can just throw away your knowledge how things should work, because your mental models no longer have relevance. 

It's like that Seinfeld episode where this woman keeps on getting Jerry unfunny comedy props he didn't ask for, just because she thought it would be more hilarious than giving him what he specifically requested.

forgotten gentleman
Friday, December 14, 2001

I enjoy your articles, Joel, but I noticed a small flaw in this article.  ANSI strcat does return a pointer to the \0 character at the end of the new, concatenated string.  So, the need to have a mystrcat is removed by the ANSI standard strcat.  You could just as easily keep appending strings by

char really_long_string[1000];
char *p = really_long_string;
p = strcat(really_long_string, small_string);
p = strcat(p, small_string);
/* etc. */

And, this encourages reusability *evil grin*

Additionally, you could probably get slick with recursion and stack space and save doing realloc or malloc until the end of all of the concatenation.  This is how getline, in C++ iostreams, performs the work of pulling in an indeterminate length line with only one call to new.

Bill Perkins
Friday, December 14, 2001

Bill wrote: "ANSI strcat does return a pointer to the \0 character at the end of the new, concatenated string."

Ummm ... no it doesn't.  It returns the same value as passed the first parameter.

Alyosha`
Friday, December 14, 2001

And that is why we have code reviews and testers :-)

Now, don't I feel silly...

Bill Perkins
Friday, December 14, 2001

Having coded in C/C++ for the last 7 years I understand your mystrcat() code quite well.  I think I know my operation precedence and associativities. But I also know that many people don't, even so called experts.  So I've decided to write all of my code clearer. My version shows the operation precedence explicitly.

char* mystrcat( char* dest, const char* src )
{
    while ( *dest ) dest++;

    while ( *src )
    {
        *dest = *src;
        dest++;
        src++;
    }
    return --dest;
}

It might be slower but most compilers today should have no problem optimizing this.

Now for something even more suprising.  I have found in my experience that using array indices produces faster prograns than pointer math.  I wrote a test program a couple of years ago that tested just that (I don't have it anymore so please don't ask to see it). Now this is very likely compiler dependant but is still interesting.  So the question is, "Is it better to write C/C++ code using array indices or not"?

char* mystrcat( char* dest, const char* src )
{
    size_t d = 0;
    while (dest[d]) d++;

    size_t s = 0;
    while ( src[s] )
    {
        dest[d] = src[s];
        d++;
        s++;
    }
    return &dest[d-1];
}

I think that you should use which ever one produces the simplest code to read, particularly with todays computers and compilers.  The speed difference is not worth the more complex, unmaintainable code.  It's not like the optimization Joel achieves in his mystrcat() function.  So unless you absolutely need that tiny optimization then please write simple code.

Steven Keens
Monday, December 17, 2001

Steven,

Unfortunately your mystrcat() fails to copy the terminating null character.

The idiom
while (*dest++ = *src++);
should be well known to any C programmer.

Ray Gardner
Monday, December 17, 2001

Oops! Your right of course. I forgot the line before the return statement (add it to both versions).  Good catch.  I should try these things before submitting them.  That's just like checking in code that hasn't been tested.  There's nothing worse than that.

char* StrCat( char* dest, const char* src )
{
  size_t d = 0;
  while (dest[d]) d++;

  size_t s = 0;
  while ( src[s] )
  {
    dest[d] = src[s];
    d++;
    s++;
  }
  dest[d] = '\0';
  return &dest[d];
}


char* StrCat( char* dest, const char* src )
{
    while ( *dest ) dest++;

    while ( *src )
    {
    *dest = *src;
    dest++;
    src++;
    }
    *dest = '\0';
    return dest;
}

After thinking about it a little I wonder if the optimization comes from the fact that there is a small "loop unwinding"?

Steven Keens
Monday, December 17, 2001

Perhaps the improvement in speed for array indexes comes from the CPU having special addressing modes for "base + offset" style memory accesses, and that is for some reason faster depending on your compiler/cpu/phase of the moon/color of your shirt?

I would be interested to see how the speed changes if the strings to be copied were, say, 128k long. The "base + offset" addressing modes often are limited in the size of the offset they can handle. The only real way to tell what's going on would be to look at the generated assembly language.

Personally I think that beating this specific example over the head is kind of silly. Joel's point, I think, is to pay attention to what your algorithm is doing. Me, if I had actual evidence that string concatenation was the culprit (ALWAYS PROFILE BEFORE OPTIMIZING!) I'd first think about using ropes instead of polishing the use of flat character buffers. See the SGI STL for an implementation, and the journal "Software - Practice and Experience", Vol 25(12), Dec 1995 for the academic paper describing this data structure.

Chris Tavares
Monday, December 17, 2001

I would like to know if you have any empirical data to support your claim that CS should be taught bottom up. I happen to agree with you, for the same reason I think one should learn algebra before calculus, but I have recently returned to academia and joined a faculty that is doggedly Object-First, and I need ammunition to help make changes. Got any? Thanks.

Chuck Allison
Wednesday, December 19, 2001

Further reply to Andrew Simmons: inline functions.

Chuck Allison
Wednesday, December 19, 2001

I had a superb professor of Computer Science years ago, Harlan Mills of IBM, who grew appalled at the poor foundations his college juniors and seniors came to his classes with.  He reacted by going back to the cradle -- freshmen classes -- and creating a version of Pascal ("CF Pascal") that contained only two data types: characters and files.  Talk about starting simple!  He later built on the approach by suggesting that arrays were far too powerful a data structure for most purposes, because their use is so error-prone, and people mostly used them to model simpler things.  He maintained that sets, stacks, and queues were far superior (in an article that appeared in IEEE Trans. Software in the mid-70s).  His idea paralleled, for data structures, Edsger Dijkstra's famous (and now ancient) "Go To Considered Harmful" note concerning control structures that appeared in an early 70s CACM.  I'm too lazy to look up better references right now, but the fellow looking for support for his intuitions about starting simple might want to look into Mills's work.  (I share your intuition emphatically, by the way!)

Richard Foard
Tuesday, January 08, 2002

Well, I know this is obviously mere self-promotion, but:

http://www.annexia.org/freeware/c2lib/

solves the issues that Joel talks about.

Run the following program:

http://www.annexia.org/freeware/c2lib/realloc.c.txt

The output is:

Members of the Beatles: John, Paul, George, Ringo

Richard Jones
Wednesday, March 06, 2002

This is precisely why we wrote Escapade (yet another server-side scripting language, http://www.escapade.org ) in C, and we made the syntax very limited.  Because Escapade is by necessity an interpreted labguage, we found by modeling before we started writing the interpreter that we were spending a large amount of our time in lexical analysis and parsing, so we simplified the syntax of Escapade quite a bit and gained a 90% increase in speed.

That's why Escapade is faster than Perl and PHP and most other interpreted server-side scripting languages, and always will be until people start putting a little more work on the programmer and stop thinking that because we have big disks and fast CPUs and lots of memory that it's OK to write fat, slow code.

Ed Carp
Monday, March 25, 2002

I always thought that 'bit' was the past tense of 'byte', as in: "Today I byte into the cake, but yesterday I bit into it." And that this was why you always had to use bytes rather than bits...

Stephen Muires
Tuesday, April 23, 2002

http://jaap.flipcode.com/documents/compsci.pdf

Different story, same message. Yours is probably a better read though :)

Cheers,

Jape

Jaap Suter
Sunday, May 12, 2002

I disagree with the article completely.

The article and the preceding comments -- with all the disagreement, refinements, details and situational dependence -- suggest emphatically that this isn't the level of detail at which ordinary humans should be required to work. (And, while there are programmers who are gods, almost all are just human!)

We definitely ought to going in the opposite direction (as users of development systems): more and more abstraction, just precisely and clearly stating requirements and have the "system" generating applications (and a system which, by the way, should be doing its own profiling and selecting of specific algorithms/formats/frameworks/protocols depending on actual run-time circumstance in the light of defined requirements and available resources.)

All the "advance" these days  -- in languages and frameworks -- is oriented towards allowing mortals to do the work of gods but, of course, that ain't never gonna happen. We just end up with the worst of all possibilites -- Shlemiel algorithms in Shlemiel environments -- hoping for Moore's Law to save us.

If we're going "back to basics" let's head on down into improvements in theortical foundations for application development and not back to the metal.

Stephen Marney
Saturday, September 21, 2002

Late response I know, but I just read it.

Fun to read, and people have already made comments about the things that are wrong that I would have mentioned.

(Good idea, BTW, making commenters scroll through all the previous comments before getting to the comment link. Might dissuade some , but that's probably a plus in this instance.)

I just thought I'd add a data point. 

Though I agree that XML is horrendously ineffiecient for data storage,  these days it doesn't really matter.

For a recent propject prototype, I wrote some code that did what was being discussed, doing a search on an XML file, from scratch in Java. It took a total of about two hours to write and test, including researching the DOM API, and downloading exisiting parsers.

Half an hour was spent writing a small program to generate a large enough XML file for the test.

An extra ten minutes was spent optimizing the Java virtual machine's memory usage to ensure it could load all the data without doing too much paging. Everything was written in Java. No optimization was done on the code, (though I do tend to write pretty optimally first time).

The time required to load and parse the data and then find
an item in a 100,000 record file was never more than 15 seconds on a standard Windows 2000 desktop PC executed from the command line.

The difference between parsing and searching 10,0000 records and 100,000 records was minimal once memory paging was eliminated.

This was using DOM parsing and walking the DOM tree.
If I wanted to optimize this I would use SAX parsing and load it into a hashtable or other data structure better suited to searching. Also note this included loading the data-set every time, which chewed up the majority of the time, the searches themselves were under five seconds.

I know a relational database would probably do this sort of search in 15 milliseconds or less,  and it probably would have been quicker in development time to figure out how load the data into an XML compliant database and write SQL for the search.

But the point is that even 15 seconds is fast enough for most users anyway, so even though this is a hugely inefficient way of storing and searching the data, it is very simple code, fast enough for most activities, and you don't need a database engine.

Of course, the other benefit of XML is that I could have done this search in an even simpler way than loading it into a database....

I could have loaded it into a text editor and hit F3.
And I could have edited the data too!
Who needs fancy tools ? <grin>

While I agree with the top level thrust of the article, that knowledge of the underlying stuff helps with efficient coding, in my experience inefficiencies even more often are caused by people not knowing enough about basic optimization theory.

Even the simple concept of moving the calculation of a static test value outside a loop, (which admittedly some clever compilers will do for you) seems to be beyond many of today's programmers.

As someone who has built CPUS from discrete logic circuitry and now programs enterprise systems in Java, I can atest that a deep knowledge of the underlying system is extremely useful when things go wrong.

Frank G. Pitt
Thursday, November 14, 2002

I'd like to take issue with the misrepresentation of GC as something that takes a really long time once in a while:

Sure, the first, old, naive GC algorithms did that, but incremental GC, a technique where GC is performed a bit at a time, once in a while, to avoid long pauses, has been around for years, is the subject of active research, and has been successfully used in real-time systems. Yup, there are incremental GC algorithms that can give you a constant upper bound on GC time per iteration. For that benefit, you might pay some small cost in total percent of time spent in the GC, but for interactive applications, that's usually worth it.

Of course, some malloc implementations can coalesce incrementally too, so it's not that GC can always do better, but the implication that it has to do worse is incorrect and detrimental to the overall quality of software produced, because comments like that encourage a widespread (and incorrect!) stereotype that GC is slow, so people tend not to use it when appropriate, and then suffer the bugs that manual memory management invites.

D.N.R.
Friday, November 15, 2002

I'd like to comment on the assertion made in the article:

"Generations of graduates are descending on us and creating Shlemiel The Painter algorithms right and left and they don't even realize it..."

I've no idea how American tertiary education institues like to teach CS, but I found my recent educational experience at an Australian university to be very un-Shlemiel-like.

In my course of study, the Shlemiels who managed to get by on passes with uninspired code and good "exam regurgitation" techniques invariably hit a brick wall when it came to the unit "Design and Analysis of Algorithms".

This unit taught, in a language-neutral way (i.e. mostly pseudocode), methods for analysing and improving the performance of algorithms. Data structures were a major part of this apporach. The Shlemiels either failed (50% failure rate for that unit) or learnt enough to ascend from their Shlemiel status.

Many other units also dealt with similar topics, with a similar philosopy and similar results (i.e. weeding out or converting the Shlemiels).

The philosopy of my University was to use Java whenever possible (the reasons why are not relevant to the topic at hand). Our exposure to C/C++ was minimal, and tended to be bundled with Unix and Operating System type units.

Most lecturers realised that the concepts were important, not a student's ability to regurgitate C++ code in an exam. Languages come and go. Understanding of computer science principles can be applied to any language.

I agreed with the thrust of the article -- that "that some of the biggest mistakes people make even at the highest architectural levels come from having a weak or broken understanding of a few simple things at the very lowest levels."

However I fail to see how C/C++ is the only way one can understand these concepts. I find it surprising that one can be "physically disgusted that so many computer science programs think that Java is a good introductory language, because ... you don't get confused with all that boring string/malloc stuff..."

I believe that there are conceptual, language-neutral ways to learn "that strings are, at a very deep level, difficult," and other similar topics. Although C/C++ still remains a very important language, it needn't be enshrined as an educational monopoly.

After all, one wouldn't teach the concepts of mathematics by focusing on a particular type of calculator.

anon
Wednesday, January 01, 2003

Very well said anon.

Ben
Tuesday, February 18, 2003

This is a long conversation thread.  \0  Someone throw in the proverbial 0

Anyways, what's the point in Joel-bashing?  What axe does he have to grind by being an RDBMS advocate, and opting not to buy hook line and singer into XML?

He has simply stated his preferences and opinion.  Give him that instead of picking apart the myriad of logical fallacies and inconsistencies of his argument.

Steve Tout
Wednesday, February 26, 2003

Joel writes:
-------------------
Lazy programmers would do this, and have slow programs:

char* str = "*Hello!";
str[0] = strlen(str) - 1;
[...] I used to call these fucked strings [...]
--------------------
Fucked, not because that technique is slow, but because attemptying to modify a string constant can result in a segmentation violation.
What did the "lazy programmers" *really* write, Joel?

Silmaril
Monday, March 03, 2003

A lot of the prior respondents seem to really be focusing on Joel advocating RDBMS over XML,  etc,  and are (IMHO) missing the entire point of the article.

I believe Joel's point was that even people using a high-level language are well-served by having a low-level understanding of how computers work.

Matthew J Sullivan
Saturday, March 29, 2003

>In a relational database, every row in a table (e.g. the books table) is exactly the same length in bytes,
>and every fields is always at a fixed offset from the beginning of the row.

Not true. Ever heard of VARCHAR? Rows can differ in size because columns can differ in size. This is good because there isn't massive wasted space in columns that *might* need to be 100 characters long but *usually* are more like 12, and that's extra helpful when the smaller data blocks allow more rows to stay in RAM. The only bad part is that when you update the row with a longer value, that row has to get stored somewhere else. But that's minor compared to fitting more stuff in RAM - CPU cycles are cheap relative to disk accesses.

Jamie Flournoy
Monday, September 08, 2003

This whole article seems to be another take on "Architecture Astronautics".  You simply run out of oxygen sooner if there are too many Schlemiels in the ground crew...

Joe Foster
Monday, November 17, 2003

Great article. I'm glad to see that it's still getting traffic after two years.

Going in the other direction, there's the need to constantly re-analyse the fundamentals.

Look at x86, Where I work, we're still using development tools from 1988 (looking to fix that some day with a real-time OS) and DOS 6.22, The product is even older. When my boss first created the company's product (on an 8MHz cpu), the time to execute a CPU instruction was comparable to the time write data to an add-on board. Just this past week (now we're up to Pentium MMX running at 133 and 233 MHz) he looked through the inner loop of an adaptive product of ours and found a way to double the bandwidth available to the product. The change, reading and writing from add-on boards takes a lot longer than CPU instructions: he eliminated the unnecessary ones. I think it was about 5 lines out of several pages of code.

Perhaps another example would be C++. I do most of my software development in C. When I picked up a C++ text (The C++ Programming Language, Third Edition, by Bjarne Stroustrap), I found myself learning a  different language from what I remember from college, and a different language from C (yes. I know all the old C stuff is still there. However, it's now more like a compatibility layer). If you look at the previous comments, people mention changes in how strings are handled. Other changes, like templates and the standard template library, have caused fundamental changes that are more or less incompatible with C style of programming (not the language, the techniques) and are even somewhat alien to earlier iterations of C++. I'm hoping to take advantage of this soon, at least in the non real-time parts of our project.

In each case the environment changed. Hardware's faster and tools offer us more interesting options. Just as Joel's article speaks of the need to pay attention to the fundamentals (and many commentors supplemented that with the profile-and-tune technique), so is there a need to constantly re-examine your environment.

On another tack. A previous poster mentioned 15 seconds as an acceptable wait time. Multiply that by 5, 10, or even 100 different 100,000 record files that have to be read and searched through by some program at start-up. Now compare that to 15 milliseconds multiplied by those same numbers. Not only is that a realistic possibility, but suddenly the 15 second wait time looks much less acceptable. I'd say that's a prime example of how a fully knowledgable programmer can set up a Shlemiel for newer, less experienced programmers that have the joy of using his suggested approach.

Dennis Carmen
Sunday, December 21, 2003

Well! Boy am I glad Im a hardware guy! Let me tell you why I read this article. I play games. One of the games has a hardcore Mod community and I wanted to learn how to MOD by learning to code in C++. I've dabbled in code for a while , things like COBOL, BASIC, Perl, PHP, Java but never really touched on C or its siblings. After reading all this mud slinging and chest beating, I came to the conclusion that most everyone missed the point(s) of the article.
1. YOU DO need an understanding of the underlying structure if you want efficient code.

2. Faster hardware and more memory/storage is not a good excuse for righting overweight code.

On the second point( and this is directed to all those truely lazy Coders), This game I want to mod is done in C++. It is slow, innefficient and full of redundant code. Fortunatly, some of the modders are beginning to see how to "trim the fat" and are beginning to reorganize not just the code but the structure. My point? This program was written useing todays programming laguages for todays systems but doesnt run faster than stuff written 10 years ago. Its bigger, yes, but why? It has no more "features" other than better graphics compared to older games( most of that is thanks to Direct X anyway).

After looking at the structure, I was left wondering "gee do programmers today get paid by the line?".
Gamers get a laugh out of running games written to run on old pentiums or 486's because they run so fast you cant really play them. That says a lot about the way they were written(along with the fact that more often than not, patches were not required, they just worked). The old stuff that listed the required ram and minimum CPU. The New stuff lists a "minimum" Hardware requirement that would give performance in the "barley useable " catagory and a "recommended" hardware that more likely represents the machine it was written on, not for.
I don't no much about programming, but I do know that "programmers" in the Game world better get their Sh_t together soon. As the graphics get better and hence bigger, the code is going to have to get cleaner and faster. Programers need to learn to rely on better programing skills rather than faster hardware and more memory if they want to create truely awesome software( and I don't mean prettier to look at) With AI that is truely intelligent, effects that look so real you forget where you are, and code that DOESNT CRASH on the VERY FIRST RELEASE!
I tell you if hardware was designed the same way some of the last generations of software was, we would have to replace our computers once a week and it would cost about the same as a toaster(cause thats all anyone would pay for it).
Pay attention to detail ! Thats what his message was!

A. P.O'd hardware guy.
Saturday, December 27, 2003

As a programmer versed in most 3GL languages, I think that abstraction is indeed dangerously mistaken as virtue in itself, instead of simply a means of reusing code that can be dangerous if wrongly applied.

String handling is actually an excellent example of this. Most of the introductions to C++ I've read could not resist the temptation to demonstrate how neat operator overloading is by creating an abstract string class, with an '+' operator that would concatenate strings.

But the issue of concatenating strings of variable length and dealing with the subtle implications for memory management must be appreciated and not brushed under the carpet.

In C, I've often 'solved' the string problem using the following structure:

struct str {
    int l;
    char *p;
};

All string operations manipulate the data members explicitly. There is no point in hiding them, as they are the central point and simplest imaginable interface to the abstract string. If a particular operation occurs often, I simply create a straight function to deal with it. Good things are to be taken in moderation, and  with OO concepts that's no different.

The concept can be easily extended. In one application, I juggle lots of dynamically allocated strings around, and there I include a flag

struct astr {
    int l;
    char *p;
    int free_p;
};

in the structure to tell whether or not the memory pointed to by p is to be freed if the structure itself is freed, in other words whether str is the owner of p. This gives me things like easy creation of str objects and an easy way to do explicit copy-on-write.

Elsewhere, I do lots of concatenation on these things and there it pays to keep track of the amount of space available, to avoid needless reallocation. The structure in that case looks like this:

struct astr {
    int l;
    char *p;
    int free_p;
    int allocated;
};

And so on.

It simply makes no sense to use strings that support everything, and hides the complex parts, and all the ways the possible features may interact. Strings are far less complex once you decide what subset you really need.

There is simply no one-size-fits-all for complex things. In some cases, you need speed, in some cases, versatility, in others, quick prototyping.

How should an abstract object such as a string or a memory allocator know how it will be used? Well, it does not -- unless the programmer has means at his disposal to tell the object about that. Either by using different subclasses, or by supplying different parameters.

But then, the object already becomes a lot less abstract, and must expose some of its inner workings in order to allow the programmer to make a good decision.

What does this all mean? Premature optimization is the root of all evil, as Knuth has said. I am convinced that this is not only true when optimizing for time or space, but also for optimizing prematurely for code reuse in an effort to optimize for that most valuable resource -- programmer time.

Focus on what you want to achieve, and generalise only after you've spotted lots of redundancy, there is a common pattern that can be efficiently modeled, and it actually pays off to refactor the code.

Cheers,


Emile.

Emile van Bergen
Saturday, January 31, 2004

I totally agree that using Java as an introductory programming language is pure idiocy. Not just because it takes you farther away from the metal, but because Java adds layers of cruft that can just confuse someone who is just starting to learn programming.

Students would be much better off if their first programming language exposure was in ANSI C, or even Pascal. Even though Pascal also shields you from the bytes to a degree, it at least keeps you concentrated on the basics of what makes most typical programs tick: data structures and algorithms.

OOP is a paradigm that is better learned *after* you already have a good handle on those two concepts.

This is not intended as a slam on Java. Only on the use of Java as an introductory language.

Dougal Campbell
Monday, March 08, 2004

And I would say that you are completely wrong.

I am teaching a bunch of non-programmers how to program, and OOP is an extension of the way we normally think of things. It allows them to think of what an object should do, and how to segment it into proper pieces, which they can in turn write.

Is OOP the right way to teach programmers or future software developers? Maybe not, but it works a hell of a lot better for Philosophy or Science majors.

Frank Rodriguez
Monday, June 14, 2004

with this bit of code you show:

char* bigString;
int i = 0;
i = strlen("John, ")
     + strlen("Paul, ")
     + strlen("George, ")
     + strlen("Joel ");
bigString = (char*) malloc (i + 1);
/* remember space for null terminator! */
...

you should use the pre processor sizeof instead of strlen - then, at runtime, the strings won't be scanned to find length as you say is necessary. so instead of the strlen line, use this sizeof line:

i = sizeof("John, ") - 1
    + sizeof("Paul, ") - 1
    + sizeof("George, ") - 1
    + sizeof("Joel ") - 1;
    
then that'll get done at compile time, so it won't slow down the running at all. that line will get simplified down to:

i = 25;

in the code that's run i think.

ben
Sunday, June 20, 2004

*  Recent Topics

*  Fog Creek Home