Fog Creek Software
Discussion Board

OS reliability studies

Hi all. I was doing some research on comparative reliability numbers (basic MTBF stuff) for different operating systems. Thus far I've totally struck out on finding any _objective_ studies or information that tried to evaluate it.

Any links or references you all might have would be greatly appreciated...

Jeff Kotula
Monday, July 19, 2004

You won't find objectivity here.  They were all chased away by childish slashduh subscribers who posted anti-ms trash until all objective people left.

Last of the Objective-ites
Monday, July 19, 2004

Objective metrics? It is unlikely to exist as many attempts at doing so show.  The problem is most studies are either financed by the MS or provided by Linux.  Both look at the best case of specific instances and try to draw broad conclusions. 

This is true for most software.  Consider which is better between Quiken and MS Money.  Each can show you how they have features the other doesn't but which adds value is a user specific requirement. 

Good luck.

Monday, July 19, 2004

What is failure of an operating system?

You'll need to be a lot more specific.

Stephen Jones
Monday, July 19, 2004

Yes, MTBF (Mean Time Between Failures) metrics typically apply to hardware -- you know, stuff that actually wears out over time. 

So you can measure how many times a relay can be cycled before it fails (failure here defined simply as 'does not make a good connection').  Do that for a set of representative samples of that relay, and you generate a 'normal distribution' of the failures of that relay.  The 'mean' (aka average) of that distribution is your MTBF.

Since software doesn't 'wear out', you need to specify what you mean by a failure.  'Latent' bugs are discovered over time.  Additional functionality is desired, ordered, added, and sometimes adds bugs.  And not every bug is a 'failure' anyway -- there are lots of work-arounds in software which allow you to continue using most of the functionality.

Even if you start with a "Minor, Major, Fatal" failure partitioning, how do you ascribe errors to the OS?  The OS is just fine, until somebody runs an application that uses the OS services.  So is a fault the application or the OS?

And then, most OS's run MULTIPLE applications at the same time.  My Windows is currently running 53 processes.  Which of those actually creates the fault?

And then, most OS's use dynamically loaded libraries -- both some with the OS, and some user generated.  How do you separate those out?

And then, Unix uses one set, Windows a different set, Linux yet another set.  Not to mentioned statically linked library code.  Then there are 'other' OS's like VxWorks, eCOS, Beos.

Even Unix has multiple 'flavors' from different vendors -- Solarix, IRIX, HP-UX, Linux.  And Linux has multiple VENDORS for heavens sake.  Are they all the same?  Are they all built the same?

Well, in my long-winded way, I've tried to show you this is not a simple problem.  I don't think there yet exist any 'canned' solutions to it.  And I've tried to point out some of the factors you need to take account of if you wish to do this sort of thing.

Good luck.

Monday, July 19, 2004

I'm pretty sure on /. they distilled it down to Linux rulz.

.net, the equivalent of MS Bob.
Monday, July 19, 2004
Of course, there are reboots from installs, etc. but this site gives some perspective on the maximum uptime of current commercial O/S's.  It's hard to tell when an O/S has a flaw, if it's the O/S, a poorly written driver, some software, or a hardware glitch that the O/S didn't handle well, etc.

Monday, July 19, 2004

Yeah, I know it isn't exactly clear, but I was hoping to find some controlled experiments that at least characterized OS failures (blue screens, automatic reboots, etc.) by how often they happened in different environments.

Actually, I am interested in a hardware-like MTBF measurement. Windows is used as an operating system in devices and hence is no different than an electronic device that contains firmware. "Failure" here really means "requires some attention in and of itself".

(Maybe this is a separate thread, but I find it pretty disturbing that the software industry considers itself _so_ different that it won't even develop its own measures for reliability, etc...)

Jeff Kotula
Monday, July 19, 2004

>>> (Maybe this is a separate thread, but I find it pretty disturbing that the software industry considers itself _so_ different that it won't even develop its own measures for reliability, etc...)

Why would they mess with a good thing?  For now, people mostly accept software as it is - bugs and bad license agreements and all.

Overall, there's not much financial incentive for reliability improvement, so we're stuck with unreliable software.  Bruce Schnier's idea that insurance companies will eventually get involved may be correct:

Monday, July 19, 2004

See 'Leaky Abstractions'.

If you WANT to use a hardware metric against software, it is critically important that you understand the places this 'Abstraction' leaks, or does not apply very well.

Just because you can TREAT something 'like' it is hardware (you pay money for it, you embed it in a working system, you do 'periodic maintenance' on it (whatever that means)) does NOT make it hardware.  Software (or Firmware) really does not 'wear out'.  It has no 'bathtub curve' failure mode.  It does not have a 'normal distribution' of failures in a sample of components.  These are MTBF properties that software does not have.

Software DOES have properties that can be 'maintained'.  Discovered bugs can be fixed.  New features can be added.  "How long it can stay up" -- 'Uptime' is one possible metric for its 'quality'.

I'm not saying there are not useful OS metrics out there, or that they are impossible to come up with, or that software people don't care about them.  I was just trying to say "Most people don't think in MTBF terms when it comes to software.  Here are the terms they DO think in, and it might be useful to you to use the same terms."

Monday, July 19, 2004

I suspect what you want is a comparison of Windows vs. Linux and other OS.  Finding an objective study on that may be a bit difficult.

ZD Labs did a study on the reliability of Win98, NT 4.0, and Windows 2000 back around the time Windows 2000 shipped.  If you search Microsoft's website for "Windows 2000 reliability" you should be able to find the Word document that ZD Labs produced.  The doc goes into quite a bit of detail on how the test was set up - definition of failure modes, software used, etc. 

Monday, July 19, 2004

I saw the ZDNet study, but all it said was that Windows 2000 was a lot better than NT or 3.1. The mechanics of their study were pretty light too.

Regarding the appropriateness of MTBF: Hardware certainly is different from software, but it is also true that not all hardware is the same and exhibit different failure modes. This doesn't invalidate the use of MTBF because it is meant as a top-level measure of reliability when used at the granularity of systems. MTBF for software could be considered "mean time between crashes under normal usage scenarios". I think this is an entirely valid metric.

Jeff Kotula
Tuesday, July 20, 2004

*  Recent Topics

*  Fog Creek Home