Fog Creek Software
Discussion Board

Making really Zero-Defect roftware

I ran across this acticle on how NASA makes their onboard shuttle software. What makes it interesting is that they really have to make error free SW, one fault and seven astronauts die.

Does anyone know other articles of how really error free SW is made (I didn't find google too helpful)? I mean how those programs are made that simply can't have errors (life supporting medical systems, aeroplane onboard SW, nuclear reactor control SW,...).

Thursday, November 13, 2003

Critical system software is designed as software should be:
45% design
10% coding
45% testing

As opposed to the usual:
60% design
60% coding
60% testing

(and we cut anything over 100%)


Thursday, November 13, 2003

Making defect-free sofware is easy.

10 PRINT "Hello bugfree world"
20 GOTO 10

Anything beyond that will probably have some problem or another, it's just that you may not find it for a while.

Thursday, November 13, 2003

My Rules For Writing Ultra Reliable Software:

(1) Have a rock solid software process.  Embedded systems people aren't "code poets", they're engineers.

(2) Use a watchdog timer.  If the system chokes, the WDT will reboot the system.

(3) Make sure you check for and dismiss bogus input values.

(4) Exceptions, exceptions, exceptions.

(5) Document everything, and save it all (even the notebook you scribble in).

Jack Ganssle has a lot of good stuff about high-reliability systems:

Myron A. Semack
Thursday, November 13, 2003

Now you've done it.  You hit on a real life example of why companies don't have zero defect software.  When this story came out, it was sent all over our company (A top 5 IT company) as an example of what could be accomplished.  At the time I was working with some pretty smart people who took the article and the system we were running and did a comparison. 

Digging from memory -
1 - Governments are not efficient nor cost effective. They don't need to be.
2 - The difference between providing the wrong answer to "is the item in stock" and will the shuttle stay in flight is significant
3 - Not launching is bad.  NASA looks dumb scrubbing a launch at t-minus 12 seconds.  Companies go out of business for failing to deliver on time. 

Taking into account the processes, documentation, approvals, and procedures for loading code in NASA, we would have consumed the entire budget for the five year contract, within three months.  In order to provide our contracted level of support, and meet the NASA process, staff, support and resources would need to be "super-sized."   

We might not like it, but Microsoft proved people buy software that is "good enough."  (remember the "just reboot" days?)    Most software is as good as it can be given the other constraints a company believes they are under.  If they incorrectly define those constraints, they go out of business.    NASA, Nuclear reactors, Missiles, etc.  These have far different constraints than inventory management, call center support and claims processing. 

We may want to think it is as important that Grandma get her medicare check, but it's not so important we will pay $1,000 to cut a $50 check.

Thursday, November 13, 2003

nasa and his ultra reliable process.

how many probes/shuttles they missed? I mean percentage values. I think it's at least over 1%.

that is not ultra reliable.

Thursday, November 13, 2003

Nuclear reactors in the USA are not controlled by software - they can be monitored by software, but not controlled.  See your friendly neighborhood NRC representative for further information. 

Joe AA
Thursday, November 13, 2003

Another point to remember about critical safety software is having multiple fallback levels in case of software failure. In the Shuttle there are 4 computers, one primary and three alternates, all of which are being fed the same inputs but only the primary system controls the ship. If the primary system fails one of the alternates takes over.  And there are levels below that, resulting eventually in the pilot having manual control over the ship. (In fact, on final approach to landing the pilot switches to full manual mode to land the orbiter, even though the computer is perfectly capable of executing the landing. Why? Because he/she is the *pilot* :) )

Further, such critical software runs on very controlled systems with well-defined inputs, outputs and timings.

Mark Newman
Thursday, November 13, 2003

"See your friendly neighborhood NRC representative for further information."

...because they are SO receptive to technical enquiries about their operations these days.


Thursday, November 13, 2003

They are, Philo, if you know where to look.

Andrew Hurst
Thursday, November 13, 2003

Ha, I see what you mean now.  I thought you meant talk about their process, not the whole terrorism thing.  Nevermind me, misunderstanding comments.  Move along...

Andrew Hurst
Thursday, November 13, 2003


Scooby, you know Thelma doesn't like it when you borrow her laptop to browse the forums.

Thursday, November 13, 2003


Nasa's software is rock-solid. Challenger's problem was that a manager type launched in too-low temperatures, that the company said specifically were out of spec for the solid rocket boosters.

Whichever apollo that was that burned up on the pad was caused by 1) a lack of understanding about the flamability of a pure-Oxygen atmosphere and 2) a mechanical failure of the door to open.

Apollo 13 was, iirc, a mechanical failure in one of the oxygen tanks, or some similar.

Columbia was a heat shielding failure.

Note that *none* of these are software or computer errors. I'd also like to point out that while NASA has a failure rate of about 1 in 45 on the shuttle, I find that to be very high success rate, seeing as how it's a completely new frontier, and we're only starting to develop new tools with which to explore it (Saturn V, Shuttle. What else gets serious use in the U.S.?).

Mike Swieton
Thursday, November 13, 2003

Nasa's software is NOT rock-solid.

One mars probe crashed because of a mistake in converting between miles and kilometers in software.

Another mars probe died because of a software failure.

A deep-space probe had its control software lock-up, ausing the atitude control thrusters to go to a constant-on.  By the time they got the system rebooted, it was out of fuel.

Myron A. Semack
Thursday, November 13, 2003

These are not all in the category of fatalities, but check out:

Thursday, November 13, 2003

The best way to ensure zero-defect software is to outsource it to a CMM-5 certified code foundry in India or Esatern Europe. The folks there have much more experience with this sort of thing and are better programmers. It costs less money too and is delivered twice as fast, which is another benefit.

Thursday, November 13, 2003

Again ot:

NASA's success rate with the orbiters isn't that good; man-rated boosters and spacecraft are supposed be in the 99%+ reliability range, with unmanned boosters ranging between 90 and 95%.

And there have been software failures aboard the shuttle, one or two of which were "eyebrow-raisers" but nothing that seriously affected flight safety or the mission.

Mark Newman
Thursday, November 13, 2003

Mike's got a good point about being a new frontier.  We all take air travel for granted, but how many planes crashed and people died (test pilots especially) before we got to this point?  Travel across the sea is fairly safe -- now.  Go back only a couple hundred years and see how safe it was.  Yes, new frontiers are tamed through costly trial and error.

Thursday, November 13, 2003

Hmm, Go back a *couple hundred* years to see how safe air travel accross the ocean was hey? You wouldn't happen to be from K-Pax would you?

Thursday, November 13, 2003

A key point I haven’t seen mentioned here is that both software *AND* hardware in critical engineering systems are normally done very differently than for conventional systems. 

Hardware is usually effectively several generations old and MILSPEC rated.  You might have a ‘286 type chip that, though recently built, is made of radiation resistant materials.  Because of the large component and trace size it is also more resistant to radiation or EMP effects.  The chip has usually been well debugged, more modern chip technologies are more likely to still have hardware level bugs.

It is designed with extreme shielding, all sorts of filtering, overrated components, redundancy, ECC hardware, and anything else they can think of, then tested heavily.  The software is useless if the hardware fails or electrical noise flips a bit.

With really critical stuff, you might have multiple processor technologies (Intel and Motorola) with separate programmers working independently on software for each processor based on “clean room” specs, with results that are to agree with many fall backs and fail safes.

Hardware and software changes aren’t made quickly, features aren’t thrown in just for fun.  It is a very slow and laborious process.  The specs are incredibly detailed, no room for question is left, there is massive testing, etc.

Nuclear plants generally have pretty minimal computer control, mostly because the certification process for computer hardware and software is extremely expensive and takes forever.  And that generally is the rule – the more critical, the more expensive, the fewer bells and whistles, the older the technology, the more obvious the system, the more non-computer backups.

NASA in the “faster, better, cheaper” program didn’t go by those rules for some of the recent unmanned probes.  Sometimes it worked, sometimes it didn’t.    The Mars Pathfinder used a single hardened PowerPC chip that was far more recent and powerful than typical hardware used in these cases (look at, for instance, search for “powerpc”).  Extremely tough hardware and smart but simple design is why we are still hearing from Voyager as it passes out of the solar system. 

(Short off topic rant:  While there are some things NASA can be very proud of, I also believe NASA has made far too many managerial blunders.  The shuttles are OLD technology from the ‘70s, they were NOT destroyed because “space is new” or “space is tough” – they died because of idiotic decisions.  End of rant.)

Thursday, November 13, 2003

I think that one very strong cultural barrier to delivering "zero defect" software is that developers continually tend to overreach their own competency. And the developers with the most imagination and desire for self improvement in the axis of technical knowledge will tend to be the "worst" at this attribute.

In most fields, pushing the envelope is a good thing and indicates personal character and motivation. In IT, pushing the envelope leads to creation of an unmaintainable Gawdawful mess that transmutes into a personal annuity. Hence CMM and process orientation, because otherwise developers run amok.

Examples of this are seen in the "C++ metaprogramming" threads. I have been a sometime C++ maven for 12+ years and I have never heard of this *crap* until I stumbled on these threads. Most developers I've known are hard pressed to deliver a clean application much less obfuscate data structures, code and objects with highly abstract concepts.

What tends to happen in this field is that the techniques wind up ruling the development process as necessities, IOW techniques are elevated from merely sometime-useful tools to the status of absolute dogma. C++ is a very good example of this. A simple case: if MI exists, you WILL use it because it is THERE. Absolutely no language feature must be ignored. Etc.

Bored Bystander
Thursday, November 13, 2003

"Critical system software is designed as software should be:
45% design
10% coding
45% testing"

If I quoted a 100 hour project to a client and upon reading the project document they noticed that only 10 hours were actually spent writing code, they would flip out.  I'd never get any business.

I *wish* I could be so meticulous.  But in the civilian world, customer are not willing to pay for that level of detail in dev work.

It would be nice though...nobody likes feeling as though the work they put out just barely passes muster.

Thursday, November 13, 2003

> Potemkin

Lol, trolling?

I would say if you pay any firm right and give them the right constructive leadership and resources, there's always a 1% chance something useful and powerful will result from all that hard work :-) Doesn't really matter where they are from.

Li-fan Chen
Thursday, November 13, 2003


Damn straight! I remember years ago when every recent graduate was overloading every single freakin operator in the most obscure ways possible just because they thought it was cool. Index a hast with a string using []? No problem! Append to a list using +=? Sure, let's do that! Perform obscure data conversion operations involving network access? Overloading the cast operator is the right tool for that job! Hey Joe, what's the syntax for overloading the ternary operator again???

It's a cult which annually seeks to best the Worst Abuse of Language award with new and ever more obscure spaghetti architecture hacks.

Dennis Atkins
Thursday, November 13, 2003

Norrick, I think that is the point.  In the real world (as opposed to the "government world") nobody is willing to pay the price for "defect-free" (if there is such a thing) software

Mike McNertney
Thursday, November 13, 2003

"nobody is willing to pay the price for "defect-free" (if there is such a thing) software"

I dont think thats necessarily true...or, even if it is true its irrelevant.

The hard fact is that defect free software is _not possible_ in any software of a reasonable level of complexity.

its just not.  absolutely impossible. 
aiming for defect free software serves the same purposes as attempting to attain inner're never gonna get there, but you _can_ improve things a lot.

I always wince when I see programmers saying things like "no one is willing to pay for defect free software" because one day someone (as nasa has done) will call our bluff.
We _cannot_ produce defect free software.

Thursday, November 13, 2003

"nobody is willing to pay the price for "defect-free" (if there is such a thing) software"

"We _cannot_ produce defect free software. "

Both those statements are untrue. I just bought "tony hawk's underworld" for my playstation II and it seems to be defect free. I probably wouldn't have paid for it if I had read a review claiming that it would make my TV crash. Thus I guess I'm willing to pay 50 bucks for my defect free game.

From what I know about the typical games development process, it is quite unlike CMM/SEI, NASA development, or even the 45/10/45 idea Philo proposed.

Maybe NASA, Peoplesoft, and the company you guys work for just need to hire better programmers?

Thursday, November 13, 2003

"and it seems to be defect free"

and you just bought it?  so how long have you tested it for?  and _already_ you are willing to declare it defect free? _your_ standards maybe I can produce defect free software..

Thursday, November 13, 2003

And yet another thing about "programmer overreaching": if you could only 'make' programmers not overreach their current skill level, then much commercial and internal software could be QUITE cheap, even without heavy formal process. The effort wasted on activities such as (Dennis' excellent example of) C++ noobs abusing the operator overload facility would then be channeled into writing and debugging modest, functional code.

Alas, I have yet to meet a programmer who overreaches their native abilities who is aware that they do so. Most programming output is flimsy junk because most programmers flatter themselves that they can be inventor, lord and master of an intricate little clockwork.  Just once, I want to see a programmer make a solid case, beginning to end, of an esoteric technique like aspect oriented programming or this meta-crap, translating into a customer benefit.

Yeah, I know, I'm thinking too much like a PHB.

Bored Bystander
Thursday, November 13, 2003

Norrick, you're looking at it backwards, as most do.

It's not "if I have a 100 hour job and only spent 10 hours coding..."

It's "if you have a job that you estimate will take 10 hours of coding, you should budget 100 hours for high-quality results."

Of course, people will scream. But what's funny is that when all is said and done, if you have working code you most likely *did* spend 100 hours on it.


Thursday, November 13, 2003

"I mean how those programs are made that simply can't have errors (life supporting medical systems, aeroplane onboard SW, nuclear reactor control SW,...)."

I'm not familiar with life support software in particular (there are several different classes of medical devices).  However, the vast majority of medical systems software is not validated by the FDA.  It is usually the software development process that is FDA approved, not the software itself.  Medical systems software is far from being defect-free, but it is built with a strict process and is designed, coded, and tested with the goal of preventing the kind of errors that would result in injury to the patient.  Medical devices do fail (far more frequently than most people would imagine).  The key is that they fail without harming the patient.

Matt Latourette
Thursday, November 13, 2003

To best way to make zero defect software is to make sure all the people involved have strong incentives to make the software defect free.

Reward people for X and they will provide X.

If X is unquantifiable and subjective in it's very nature then we either need to mash it into something that is quantifiable or we need to appoint an official judge, who will assess the presence of X and take the responsibility if his assessment was wrong.

It's never about the process it's always about the people.

You know it makes sense, Im Sam Kekovich (not really)

Friday, November 14, 2003

I don't believe all this talk about process in government stuff.

Five years back I was hired to design an embedded system for a military aircraft upgrade. My system was part of the control chain. I was hired by a company that worked for another company that worked for another company that had a big fat military contract. After all the middlemen took their cut, I got a modest rate. I delivered the project on time and fully functional. The amazing thing about all this is I don't have any clearance, I don't have any process certifications, and it was just me working alone in my bat cave. Now, i do do extensive testing and have something I call a process, but no body checked that out. They just needed to find someone who could actually deliver the system! They didn't care how I did it! That system is keeping fighter jets up in the air! The thing I wonder about is how many people who *don't* know what they are doing are sitting at the end of a chain of middlemen?

Dennis Atkins
Friday, November 14, 2003

at least one!

Friday, November 14, 2003

"Scooby, you know Thelma doesn't like it when you borrow her laptop to browse the forums"

Never mind that Shaggy, just get your kit off and assume the usual position.

Friday, November 14, 2003

Every piece of software has at least one line of code that can be removed and at least one bug. 

Therefore, by extrapolation, every piece of software can be reduced to one line of code that doesn't work.

Friday, November 14, 2003

> Making really Zero-Defect roftware

How can we expect to make zero-defect software when we are having trouble making zero-defect subject descriptions? :-D

Friday, November 14, 2003

"Norrick, you're looking at it backwards, as most do.

It's not 'if I have a 100 hour job and only spent 10 hours coding...'

It's 'if you have a job that you estimate will take 10 hours of coding, you should budget 100 hours for high-quality results.'"

I'm not looking it at it backwards; the hypothetical client in my post was.  Believe me, I am all for spending 90% of my time on design and testing.  My point was that if that particular breakdown of work was transparent at the outset of a project, most clients would throw a fit.

"Of course, people will scream. But what's funny is that when all is said and done, if you have working code you most likely *did* spend 100 hours on it."


Friday, November 14, 2003

I have also found that computer games are actually held to and achieve much better reliabliliy than commerical software.  They also tend to be complex things that are re-used a lot, ex. QIII engine.

I suppose that they are given more freedom to "do the right thing" (not just being thrown into an office), and are under firmer quality requirement (you can't sell buggy games).

Does anyone here have any experience of quality control in the games industry?

A N Other Student
Tuesday, November 18, 2003

*  Recent Topics

*  Fog Creek Home