Fog Creek Software
Discussion Board




Anyone knows a huge software crisis?

I'm just looking for a good example of software crisis that killed ppl or led to a tragic lately. if anyone replies, appreciated.

MegaPop
Wednesday, March 05, 2003

Just had a manager go into the "we are all going to die if this isn't fixed" mode?

Marc
Wednesday, March 05, 2003

Whatever happened to all the disasters that  were supposed to be caused by "Y2K"?  There were all these stories about how there were millions of computers running old programs (particularly in government agencies) and there just wasn't enough time to fix them all.  And then Jan 1 2000 came and ........ nothing.

Now, we all know that the "Y2K problem" was grossly exaggerated by consultants hoping to charge companies a thousand dollars a day to fix the problem, but still,  I have to admit that I'm quite surprised at how few reports there were (essentially none) of major disasters that were the result of  Y2K related problems.

Kruger Industrial Smoothing
Wednesday, March 05, 2003

A particular NASA satellite met an untimely demise on the surface of Mars because of a programming glitch....

I'm sure it was quick and painless, though.

Go Linux Go!
Wednesday, March 05, 2003

I'm sure you can find examples from the medical fiield, though it is likely to be poor software causing or amplifying human error that being directly responsible.

There are plenty of examples of companies going bankrupt or losing vast amounts of money (Greyhound buses is a famous example).

On a much less dramatic plane our college has just given all the staff their new computerized schedules to siuch disastrous effect that all but three staff are now scourig the job pages, including staff with many years of increments who were intendinig to stay on until retirement and are now thinking of doing a bunk half way through the semester.

Stephen Jones
Wednesday, March 05, 2003

Go to Amazon, look up this book:  "Fatal Defect: Chasing Killer Computer Bugs" by Ivars Peterson (ISBN 0679740279).  Also look through the "similar items" listed there.  That will give you a lot of tales of the sort you're looking for.  (After collecting a few book titles, check your local library to see if you can find them there and save a few bucks.)

Kyralessa
Wednesday, March 05, 2003

Isn't there a usenet group about this sort of thing?  Comp.risks or something close to that.  That could be a useful starting point.

Bruce Perry
Wednesday, March 05, 2003

I've read (though sadly I can't remember where) that a hospital had a bug in a prescription database system so one particular type of report after a software upgrade was printing the comma in a unprintable character on the printer they used, which meant that the dosage "10,00" became quite different indeed. At least one patient slipped into a coma because of it but I don't think anyone lost their life.

As for the Y2K problem mentioned in a reply further up, one of the reasons you've heard so little about all the potential problems that could've happened is that quite a number of companies worked to fix these problems. That was the whole point of the scare to begin with, removing the problems. I read (again, WHERE?) about a power plant that had an outage for a few minutes while they rebooted all servers as someone had installed all the patches, they just had answered "No" to the question "Reboot now" and forgotten to reboot the servers afterwards. That article didn't say anything about related damages though.

Lasse Vågsæther Karlsen
Wednesday, March 05, 2003

The Risk Digest ("Forum On Risks To The Public In Computers And Related Systems", http://catless.ncl.ac.uk/Risks/) should have some examples.

Roel Schroeven
Wednesday, March 05, 2003

The most famous would be the Therac-25 (a medical linear accelerator). Between June 1985 and January 1987, six known accidents involved massive overdoses by the Therac-25 -- with resultant deaths and serious injuries. [1]
[1] http://courses.cs.vt.edu/~cs3604/lib/Therac_25/Therac_1.html

John CJ
Wednesday, March 05, 2003

Does it have to be real? Law & Order had an episode about people killed by Insulan overdoses at a diabetes clinic caused by a virus in the measurement equipment.

Shrinky
Wednesday, March 05, 2003

Last week Cornell University mailed out acceptence letters to 550 students who weren't really accepted.

http://www.nytimes.com/2003/02/28/education/28CORN.html?ex=1047099600&en=dcb0a9808b350b4a&ei=5062&partner=GOOGLE

Joel Spolsky
Wednesday, March 05, 2003

And in other older-than-old news today:

"The cause of the accident that destroyed the first prototype of the Swedish JAS-39 Gripen Multirole combat aircraft has been traced to a software problem, program officials said last week".


http://catless.ncl.ac.uk/Risks/8.32.html#subj2

Patrik
Wednesday, March 05, 2003

Hey Kruger!
"Whatever happened to all the disasters that  were supposed to be caused by "Y2K"?  There were all these stories about how there were millions of computers running old programs (particularly in government agencies) and there just wasn't enough time to fix them all.  And then Jan 1 2000 came and ........ nothing.

Now, we all know that the "Y2K problem" was grossly exaggerated by consultants hoping to charge companies a thousand dollars a day to fix the problem, but still,  I have to admit that I'm quite surprised at how few reports there were (essentially none) of major disasters that were the result of  Y2K related problems."

I hate people who say this. It comes up in the news every so often.

Have you ever stopped to think that maybe the reason that Y2k was a non-event was precisely *Because* all the software folks put in overtime to get everything fixed?

Philo

Philip Janus
Wednesday, March 05, 2003

A medical simulation and visualization company wanted to hire me awhile ago and I gave them The Joel Test out of curiosity. They scored 3 out of 12. Their software is used to simulate and plan surgeries of brain tumor removal and siamese twins separation, among other serious things. I still shudder when thinking about it. This is a real story, not a joke.

rexguo
Wednesday, March 05, 2003

you'll have to google it but IIRC the London ambulance service had to abandon it's new overcost overlate dispatch system because of a UI flaw: if there were more incidents than visible rows, the operator didn't know because there were no scrollbars or other indicators.  IIRC 11 people or something died that day because ambulances never reached them.  Or something like that.  Not making it up, honest!


Wednesday, March 05, 2003

Philo,

That's certainly one theory.  It doesn't hold up very well, however, once you realize that countries that made virtually no effort to combat the Y2K problem didn't implode.

Admittedly, many of those countries don't rely on computer systems to the extent that the more industrialized nations do, but I've yet to see any proof presented in the past couple of years that disaster was averted by the legions of COBOL programmers unleashed on the Y2K bug.

I'd certainly be interested in any sources that indicate otherwise, though.

And any of you Y2K programmers who want to 'fess up should feel free to do so...we promise not to ask for the money back.  ; )

Dunno Wair
Wednesday, March 05, 2003

Y2K was always a mystery to me.  I believe it was one of our fears, started by rumor.  Don't know if it's out yet, but 'Bowling for Columbine' looks like a good movie.

BC
Wednesday, March 05, 2003

It didn't kill people but it was a big economic tragedy: the Ariane 5 (a rocket from the European Space Agency) blew up in the French Guiana in 1996, barely a minute after having been launched due to a software error.

According to the urban legend the crash (pun intended) was caused by trying to stuff a 64-bit number into a 16-bit space.

Daniel Tío
Wednesday, March 05, 2003

Forgive me if I don't look this up, but I don't think it's an urban legend. I studied the Ariane 5 disaster during my software engineering course. As I recall, the software had originally been written for a less powerful rocket, with the result that it couldn't handle numbers bigger than a certain size when dealing with telemetry calculations. They took that software, used it in Ariane 5 without checking what would happen if it was fed numbers more appropriate to the new, more powerful rocket. Result: When the thrust got above a certain level, the chip entered "debug" mode and started sending diagnostic information to the engine control systems. Unfortunately, the engine control didn't know it was diagnostic data, and treated it as normal instructions.

Adrian Gilby
Wednesday, March 05, 2003

Philo,

Y2K was a massive con because it really only had serious consequences for large transaction-oriented enterprises like banks, airlines and government departments. Those places were well aware of the problem, like many others they deal worth, and had programs in place to verify their processing routines in ample time, often as part of major upgrades.

However all the outsourcing firms and other con artists convinced legions of smaller companies that it was a life and death matter for them, and extracted money from them for work that wasn't necessary. If a small business sends out an invoce with the wrong date, it's not a major drama.

Must be a manager
Wednesday, March 05, 2003

I believe the Patriot missile system use in the Gulf war had a problem with its clock getting out of sync with the radar system, if it stayed on too long.  The system did not attack the missile that killed several American soldiers because of the timing error.

There is also a funny story that when on of the first North American missile radar systems went online an alert was  triggered because no one considered the radar reflection from the moon.  Nothing happened but consider the possiblities. 

John McQuilling
Wednesday, March 05, 2003

My favorite example is the Mars Climate Orbiter which crashed into Mars in 1999.  Primary cause was a mixup between metric and english units.  I'd argue this was a software issue.

If I remember right, one team used Newtons and one team used feet-pounds to calculate force for course correction thrusters.  The result was a series of incremental errors leading to a small but critical trajectory difference that caused the orbiter to burn up in Mars' atmosphere.

http://www.wired.com/news/technology/0,1282,31631,00.html

The Voice of Rationality
Wednesday, March 05, 2003

Regarding Y2k which I did some work on:

1) Not all the problems happened at 1/1/2000. Many systems started failing as they got close.  For example, bank cards issued in 1997 with 00 expiration dates did not work everywhere, the cards were reissued with 1999 expiration dates while the problem was fixed.  I suspect that a lot of busisness systems fell in to this category and got fixed as they failed.  Some systems got fixed after the fact also.  (I just saw a 3/4/103 date in a report this morning)

2) At many businesses the systems have to be fixed regularly.  It is not as if the systems are written and run perfectly,forever. Many of them have to be tweeked pretty regularly.
Some of this is business changes, but some is dumb coding errors that take a while to show up.  I have seen many systems that have 2 - 3 programmers that maintain them, full time.  In that environment Y2k was more work but it was not unusual for the systems to be patched regularly so they just fixed this set of bugs. 

I was surprised not at the business programming working ok but all the embedded systems also worked (I bet some failed but nothing serious).  I believe there was some coordinated testing of the banking system and power grid that handled that.

John McQuilling
Wednesday, March 05, 2003

In 1981, Air New Zealand flew a DC10 (passenger jet) into Mt Erebus in Antarctica.  I think it's the only time they've killed people.

IIRC, Thomas Mann, who was the Commission of Inquiry concluded that:
  - they used a computer both for programming navigation and checking fuel economy
  - they put a fuel economy route into the aircraft, different from what the crew thought it was doing.

A variety of non-computer things were also wrong.

James
Wednesday, March 05, 2003

One Y2K bug I got dragged in on was a car rental system, my software picked up the rentals and created the accounting gunk and statistics (still does I think).

No rental with a return date over 1999 contributed to statistics or could be found in my databases.  I knew _I_ didn't have a Y2K problem.  A short time browsing the original database found that the rental software company had found a novel way of solving the Y2K problem by doubling 19 to 38, so all the rentals were due to return in 3800.

Simon Lucy
Thursday, March 06, 2003

Wasn't it in the Falklands war that a British ship was sunk because the missile defense system had been programmed to recognize missiles from Nato allies as friendly, and the Argentinians were using French missiles?

Stephen Jones
Thursday, March 06, 2003

An ESA spacecraft exploded after a software bug triggered it to lose control.

RRKSS
Thursday, March 06, 2003

A nice link to follow:
http://staff.washington.edu/~jon/pubs/safety-critical.html
"Safety-Critical Computing: Hazards, Practices, Standards, and Regulation"

It includes the infamous Therac-25 story

Marcos M. Rubinelli
Thursday, March 06, 2003

A lot of fatalities seem to be related to industrial control environments.  Industrial robots being the obvious candidate for causing fatalities, as they are both common and very capable of hurting humans.

There is a quick summary of accidents in the Safer C book in the first chapter detailing accidents from the fatal to the comical. 

These are fairly old however, the book itself was published in '95.  The figures about the 'industrial control environments' come from a 1988 report by the RSRE. 

There were however an estimated 200 people killed or maimed in this way by then however.

Colin Newell
Thursday, March 06, 2003

Alan Cooper, in his book "The Inmates Are Running the Asylum" gives an example of a passenger airline in Columbia that flew into a mountainside, killing everyone on board, because of bad software. In this case, the pilots had told the airplane's computer the destination, and had then punched in the code for the beacon which the airplane should automatically follow. Only they entered the code for a different beacon and the airplane took the wrong course. The accident was due to human error, but the human error was made possible by stupid software, which should have been smart enough to warn the pilots that the selected beacon didn't match the selected destination.

There was also that case in the 1980s where a US Navy ship shot down an unarmed Iranian passenger airliner because the ship's software erroneously identified the plane as attacking. It was a clear case of software failure, since a weapons system should be smart enough to distinguish between an attacking military aircraft and a passenger airliner that is in a normal ascent.

Greg Shoom
Thursday, March 06, 2003

joel, the cornell oops was a people bug and not a software bug so it doesn't count. feh :P

-nmr

nate
Friday, March 07, 2003

*  Recent Topics

*  Fog Creek Home