Fog Creek Software
Discussion Board

coding for worst cases?

I've been fleshing out some algorthms for use in a project, and something interesting came up: How much code do I write to handle worst-case cases? I have a situation which should happen very infrequently. The problem is that should this condition ever arise, it's possible for performance to decrease exponentially.

What do you guys usually do when faced with this? How much time do you spend writing infrastructure to handle a problem that 95% of your users will never need solved?

Mike Swieton
Tuesday, November 6, 2001

I'm not a developer, well no more than any other system admin anyway, but I would say that the question is how bad is your "worst case" here?

Does the program slow down under this condition? Then it should be fixed but its maybe not your first, or even your fifth priority imho.

Does it cause a crash? If its unreasonable to expect your program to work at all under certain conditions, it should at least handle this problem gracefully. At the very least, it shouldn't hang my workstation/server or hose my data.

Robert Moir
Tuesday, November 6, 2001

Well here's my situation: I'm storing filenames in a database, and right now it's the question of integer identifiers. In theory, it's possible to have 2^32 unique entries. But if I just start the first id at 0 and increment, removed entries could, in theory, leave a lot of space unused and still report as full. If I randomly generate the id, if the database is more than 1/2 full, then the performance there would begin to drop off quickly.

These situations are unlikely, but what if it arose? It's quite possible. I don't think I'll really deal with it, but it was an interesting question: when you have a case whre the performance would drop exponentially (say a sort dropping to O(n^2)) but the case is rare, how do you deal with it? What if dealing with it now would take a lot of time to get the perfect system, but doing it later if needed would be worse?

Mike Swieton
Tuesday, November 6, 2001

I'm not sure I understand the problem.  Normally the incrementing index idea would be the right one, I'm not sure what you mean by "removed entries could leave a lot of space unused"?  Seems to me space used by a deleted record would be reclaimed with either index generating scheme.  What DB are you using?  Are you using variable sized records and so you're worried about memory fragmentation, is that it?  If you randomly generate keys you're going to need a hash or a search to ferret out conflicts and *that* sounds expensive.  An incrementing index is O(1) on key generation and 99.9% of the time you don't have to worry about conflicts.  If you're creating a 1000 new records every second, it will take you over three years before you will start reusing IDs.  Not enough for you?  Go to 64bit IDs if your DB supports it.  =-)

Tuesday, November 6, 2001

I understand what he means... :) What he's saying is that when records are deleted, those record IDs are not reused, and when the ID autoincrementer reaches 4 billion and something, it'll give up, even if there are tons of unused IDs scattered around the range.

If you expect hundreds of thousands of records, choose the algorithm that executes in constant time. Even though at 1000 records a second the system would have to take a quite a few long vacations to avoid running out of IDs in 3 years (always beware of statistics), 4 billion is still a lot of files, and you'd probably run into other problems, like file system or storage limitations, for example, long before running out of IDs. And in your case, you'll have lots of advance time to prepare yourself once you start getting closer to the limit.

Your general question is one that I'd probably want to ask a manager, since it's not a technical decision: "Can we accept that some of our customers will find our product slow and unprofessional, or should we spend an extra xx days making it work for everyone (and forget about these other neat features we had planned to put in)?"

Other than that, design for your current requirements and expectations.

Johannes Bjerregaard
Tuesday, November 6, 2001

Hmm, on each removal, couldn't you store the identifier in a list of unused identifiers?  That way, when you allocate a new identifier, you can check the list first and grab the first one you find.

If insertion speed is occassionally a critical issue you can sometimes omit the list-checking step.  If removal speed is an issue, you need a new DB. :-)

forgotten gentleman
Wednesday, November 7, 2001

In dealing with issues like this, it's useful to do a simple cost-benefit analysis. Say you have a proposed design where the product simply stops working after 2^32 files have been inserted. Ask yourself:

1. How many customers will ever encounter the problem? (Be conservative and double or triple your guess.)

2. What is the cost to each customer who hits it? (Hint: it's probably more than the cost of your product.)

Multiplying these gives you the product cost. Then, for each proposed solution, ask yourself:

3. What is my cost to implement / support it?

4. If the solution imposes costs on other customers, how many customers and what is the cost to each?

5. If the solution creates new problems, include the costs of those problems (a la 1 and 2).

You now have the cost of the problem, and the cost of each potential solution. (Don't forget the include the "do nothing" solution.)

Choose whichever solution has the smallest total costs.

In real life, one seldom has enough information to get really good estimates for all of the above quantities. However, even using really bad data, one solution frequently comes out winning by several orders of magnitude, which lends credence to the supposition that it would win if we knew the correct inputs.

The whole process is usually doable in less than 1 hour.

In the example given, I suspect that using 64-bit identifiers would win handily, if your database can support them.

Jim Lyon
Wednesday, November 7, 2001

We ran into this exact issue, and decided to use 64-bit numbers.  It was somewhat more work, because one of the platforms we are using doesn't have native 64-bit integer support (MS Visual Basic).  We have to use the VB variant type (subtype Decimal) to support our record IDs.  We just ran into a bug today where a developer used a VB long (32 bits) which overflowed when a large DB ID got loaded in.  On the other hand, we were all very well aware of the syndrome that let to the Y2K bug, and determined not to repeat it.

BTW, I will never again use Visual Basic 6 to write a new software deliverable.  We wrote a very sophisticated framework using COM / ActiveX to construct clients without writing code and the whole frigging COM thing is a nightmare beyond belief.  We are using Java for all future GUI development platform efforts (actually all future development period).

Matthew Cromer
Wednesday, November 7, 2001

Man, how embarrasing to get your math wrong!  You're right, at 1000 new IDs a second will overflow a 32 bit int in just under 50 days.  I wasn't careful on my stoichiometry and sorta neglected to divide by 24 hours in one day.  Thanks for the correction.  =-)

My point was, do the math (and then check your work!) and see whether or not running out of IDs will ever be a problem for you.

Wednesday, November 7, 2001

If your DB supports 128-bit identifiers, you can use the UUID algorithm (supported natively on Win32, and spec'ed as part of DCE for you to implement on other platforms).

You're guaranteed no collisions, and the sun should go nova before you run out of ids.

Chris Tavares
Monday, November 19, 2001

*  Recent Topics

*  Fog Creek Home