Fog Creek Software
Discussion Board




Caching as a cause of bugs

Today I found (and solved) an incredibly hard to track down bug.

The problem was
- the app generates a whole series of compex data structures, (it could equally well have been getting stuff from a file/database, although it wasn't in this case, etc.) - key point was it was time consuming
- when data changed, it tried to regenerate only the bits of the data structures that were affected (sort of like a cache of last time's results)
- unfortunately a bug in the "cache" algorithm, meant it didn't always regenerate it right, or the right bits (in the particular example it was kind of like the cache got changed after populating it, without getting invalidated)
- the bug was incredibly hard-to-discover and even harder to track down, as it was not easily reproducable, as it depended on what happened several steps before

(the way I tracked it down was I discovered an edit operation that nuked the cache, eliminated the bug's effects, then on gut instinct, tryed flushing the cache after every edit to see if this eliminated the bug).

Anyway, the point I'm coming to is caching type algoirithms seem like a horde of bugs waiting to happen

- You have at least 2 copies of data, and they can get out of sync
- You have the possibility of bugs related to the cache management itself
- You have the possibility of invalidating the cache(s) wrongly. incompletely, not enough, etc.

...And worse because caching is usually very state dependent on things that happened before, often a long time before, you can easily miss such bugs during testing.

(FWIW I have many programs which contain various cache-like things, many not about database/file caching)

No question here, just a thought.

S. Tanna
Wednesday, July 02, 2003

Just thought Id agree :)

I dislike state flags for much the same reason, the bottom line is that wherever possible programmers should deal with what _is_ rather than an abstract of that.

FullNameRequired
Wednesday, July 02, 2003

You could look at it another way: it means you probably should've had better layering and better unit tests.

Brad Wilson (dotnetguy.techieswithcats.com)
Wednesday, July 02, 2003

It already has good layering, if anything too many layering (although I can't see an obvious simplification, as it is complex with 3 levels X 2 of data, caching on each level, and the code is about is neat as it'd ever be to implement all this).

As for unit tests, you could argue any undetected or late detected bug is because of inadequacies in unit tests.  Everybody's unit tests are inadequate - unless you really detect every bug with them - which would imply you are writing totally bug free software.

(Additionally some bugs will never get detected with unit tests, as they are the result of integration, this particular one, would probably fall into this category).

The main point that I am making is I believe cache-type algorithms are more likely to have undetected bugs. Sort of an accident waiting to happen. This might suggest more careful testing and review of these areas, and/or avoiding implementing them where possible and/or other conclusions about this more-than-averagely vulnerable points. 

S. Tanna
Wednesday, July 02, 2003

S. Tanna, I agree too. How's that for controversy and flame wars. :-)

The only partial solution that I can think up is to make caches of critical data structures somehow self-validating. IE, do something akin to a CRC of the data contained in the cache every time the cache contents change, and verify the CRC for correctness on the fly. A mismatch between expected CRC and calculated CRC indicates bad cache data.

If there is no way to define some kind of low-overhead self test of the cache data, I posit that the cache is too unreliable to use for any important application. And  I think that unless there is some kind of implicit self-test mechanism, this variety of undetectable error will always be possible.

Also, the efficiency gained by the cache should be evaluated. Does it buy needed performance in exchange for the greatly increased vulnerability to bugs?

Bored Bystander
Wednesday, July 02, 2003

In this particular case: CRC doesn't really apply - as it's more of a calculated data cache (calculations take a long time).  There are self-tests but they're more along the lines of "this data is not stupid" type tests - and the "corruption" of the cache was subtle and gradual never triggering the self-tests

In the general case: A CRC or something might help for some classes of cache bugs. It's a good precaution, but not a pancea.

Particular case again: The cache is fundamental to app's performance, which is a major differentiator for the product. So I have to get it right, not just eliminate it. Believe me, you wouldn't want to use this app without the caching thing. Of course, that leaves a potential vulnerability siting in the app - yes it's a risk - but it's still a rational business decision.

S. Tanna
Wednesday, July 02, 2003

I just realized, first paragraph of my last post was not very clear. 

What I mean by "corruption" was it wasn't actually corrupted, but subtely affected thru the correct interfaces using in slightly the wrong way (so a CRC would have always said it's okay, as it would have been recalculated - unless a 2nd copy of the CRC was stored elsewhere, which would introduce a new potential problem in itself). 

I realize that this explanation  probably still doesn't make  much sense, but if you saw the code, you'd understand what I mean :-)

S. Tanna
Wednesday, July 02, 2003

wow Tanna, I wish I could give you a job.

You sound so _intelligent_, _capable_ and just all-around competent.

if you dont mind me being nosy, how much do you earn?

FullNameRequired
Wednesday, July 02, 2003

Just out of curiousity, was the caching code hand written or generated?

Without knowing the specific situation, it's hard to draw conclusions, but often caching code can be generated, and it's a hell of a lot safer to do so. Yes you can still get the same class of bugs, but there's a lot less risk of a typo destroying the whole thing, simply because you have fewer lines of code in the generator than in the output.

Hand writing repetative but critical code is a disaster waiting to happen.

Again, I don't know the specific situation, so it may not be repetative code or whatever, so don't take this as a criticism. :)

And the horse you rode in on
Wednesday, July 02, 2003

It's not repetative, it's hand-written, I'd say it's cache-like, rather than a traditional cache... and what FullName doesn't know (if he isn't being sarcastic!) is it was my fault that the bug existed in the first place. 

FullNameRequired - it depends - not that much - and much less than I used too. I do shareware (mostly) and some (not too much) consultancy.  I only do consultancy (and shareware) on projects that I'm interested in - so I'm happy.

S. Tanna
Wednesday, July 02, 2003

Hi Tanna,

"nd what FullName doesn't know (if he isn't being sarcastic!) "

definitely not being sarcastic :)  <g> going on your posts directly relating to programming Id say you have as good an understanding of fundamental programming skills as anyone here.

"is it was my fault that the bug existed in the first place.  "

:)  Ive long ago decided to judge programmers based on the bugs they manage to find and fix, not the bugs they create.
(thats the only way Ive managed to maintain such a big ego throughout my career....)

Whereabouts are you based Tanna?

FullNameRequired
Wednesday, July 02, 2003

A man with two watches doesn't know what the time is.

punter
Thursday, July 03, 2003

One more vote for "needs better unit tests". I've done similar things (cached the result of a lengthy computation for re-use) in a couple of programs, and I always had a way to disable the cache, and run things "the slow way" and compare them to the cache-enabled version.

Like FullNameRequired says, deal with what IS, rather than an abstract of that.

-Mark

Mark Bessey
Thursday, July 03, 2003

Two tips:

1. Whenever you build a cache, give it a flag that makes it not cache at all.

2. Whenever you build a cache, give it a flag that makes it validate everything it does.

If either of those is difficult in a particular case then what you have isn't really a cache :-).

You might be able to get by with #2 and no separate #1.

Gareth McCaughan
Thursday, July 03, 2003

Reminds me of the good-ol-days, when running C on real hardware (as opposed to Vaxen and such) was new.  Before they (we) learned the volatile keyword was needed.  All I can say is, I'm really glad I knew assembly when "while(!(*hardware_register->done_field)) never finished, even though the debugger showed the bit had been cleared.

Snotnose
Thursday, July 03, 2003

I'll second #1 above.  We got that behavior somewhat by accident in our cache (it's time-based, so adding things with zero time-to-live means you'll never get a hit) and found it extremely useful in isolating bugs.

I just hope I'll remember how cool it was, when I write something else cachelike that doesn't give it to me for free.

Mikayla
Thursday, July 03, 2003

This is probably obvious to all of you, but it sure bit me the first time I wrote any kind of cache: Don't cache mutable objects!

<grizzled veteran mode>
I wrote this caching layer in Java that returned plain 'ol LinkedLists. It never dawned on me that someone would modify said lists once I returned them from my caching layer. So there I went, caching objects, and returning mutable Lists. All my unit tests passed. All the unit tests for the app passed. On my machine... On the official build machine, however, certain tests involving the cache always failed.

It took me days to put it all together. The order in which unit tests will be executed in JUnit is either very complex or not specified, I get different answers when I ask different people. On my machine the test that modified one of the Lists returned from the cache was called near the end. On the build machine, that test was called early, and it mucked up the cache for everyone else. I changed my cache to always return Collections.unmodifyable() and everything worked again.
</grizzled veteran mode>

IntelliJFan
Thursday, July 03, 2003

My approach to things like this is to implement a simple, "naive" implementation that works well as the first shot.  If it turns out that optimization such as caching is necessary (per the "Optimize Later" rule, often it isn't), I have the first working algorithm to run against the more complex algorithm in order to verify results. 

I believe this approach is described in "Writing Solid Code" where the author mentions that Microsoft tests the optimized Excel table calculations against a simple, slower implementation in order to verify that the optimized algorithm produces correct results (this may no longer be true -- the book is somewhat old).

SomeBody
Thursday, July 03, 2003

*  Recent Topics

*  Fog Creek Home