Fog Creek Software
Discussion Board




Single point of failure

I was reading Joel's account of installing his colocated server.  He spent extra money on dual power supplies, dual network adapters and hot swappable drives.  Then he goes and pipes this though a single $35 network switch.  If that fails his server won't be accessible at all and the ability to reboot it remotely will be worth nothing.

Just some friendly advice from another software developer who wears a netwotk admin hat every now and then...

Tim Bond
Wednesday, February 12, 2003

I'm a software developer, not a network guy, but I would say its one of my "hobbies".  I've NEVER heard of a switch "dying".  The most advanced ones use minimal software to do their job, so I don't see how they would break. (other then bad configuration, or Joe Admin tripping over it).

Vincent Marquez
Wednesday, February 12, 2003

We lost one switch once, because of overheating due to a fan. Fortunately our colocation provided us with a replacement until we bought a replacement (on eBay, of course.)

Leonardo Herrera
Wednesday, February 12, 2003

Isn't this akin to saying "asprin is cheap, it has no use!"

Prakash S
Wednesday, February 12, 2003

> I've NEVER heard of a switch "dying". 

Im a programmer too, and by no means a seasoned network technician. But where I work we have had network outages because of switches dying on us. It happens, and those dying switches were not the
$35 home-office-style ones. These were big brand name ones.

Patrik
Wednesday, February 12, 2003

The crucial redundancy Joel's bought mostly prevents long hours of development restore I think. Besides this website I don't see how bad it would be to lose the net connection once in a while. But it's a risk.

Those 10/100 mini-switches don't have fans.. that's not to say it can't fail from over-heating. But it was probably spec'ed to not over-heat in home office environments. Since Joel's using this tinni winni in a well designed colo--we can expect a nice cool environment that's well within the switch's operational parameters (low dust/lowhumidity/low temporature). But yeah, a redundant route is really really useful eh? It depends on whether there are crucial services that Joels' company or customers' need to be always on. If not, then who cares.

The redundancy that's available in that Dell does prevents database corruptions and crucial long-running processes from  being interrupted. You can restart interrupted backups no prob. But some transactions are better left running even if they can row back.

Li-fan Chen
Wednesday, February 12, 2003

I think the cheapest low cost fail over is to leave another similar baby switch at the colo right next to the working switch. And have the colo technician switch to the new switch when a watch dog service fails on all ports. A watch dog service is something that has the permission to ping all of the NIC cards connected to Joel's Dell. If one or all NIC cards refuses to respond to pings or watch dog queries it could be an indication that the switch is broken.

Li-fan Chen
Wednesday, February 12, 2003

We had two switches crap out at the same time about a month ago, but I don't blame the switches.

As it turned out, both of our A/C units in the server room "just stopped" (according to the facilities guy) and the room reached approx 110 degrees(F).  Oddly enough, neither of our temperature alert systems notified us of anything, either.

Thankfully, when the first switch overheated and shut down, my website monitor started screaming at my mobile phone.  So, I'm thankful the switch crapped out.

The moral to this story is....Use an abacus.

Jeff MacDonald
Wednesday, February 12, 2003

Switches die, two network cards and multihoming are cheap.  I've obviously spent far too long in infrastructure work, because it amazes me that this isn't obvious to people...

Rodger Donaldson
Wednesday, February 12, 2003

It is being hosted.  Usually the host is responsible for the network being functional. Your box is locked in a cage. 

You can touch it, but no one else, and if it is a windows box you can call in and have someone push the button when it goes nuts and won't respond remotely.

But generally for colo:
1.  You are responsible for your box
2.  The host takes care of the network

Crusty Admin
Thursday, February 13, 2003

If you've never even "heard of" let alone seen a switch die then you must be very new to networking.

Robert Moir
Thursday, February 13, 2003

I think a few people are missing the point.

Firstly I think that this is a production server only it is ALL abut the uptime.

Secondly it is not really about the cost of the switch or its reliability.  Those little switches are great I have never personally known them to break but all hardware can break so sometime/where it will.  I have never personally known a power supply to break but I'm sure that they do.  Joel obviously thinks they will too since he spent the money to get a second one.

Thirdly, and I think that is is the core point, redundancy is a very cheap way to get the chance of system failure to very low levels.  If the chance of a failure for a component is 1/1000 for some period of time then the chance of two failing in that period is 1/1000000.  The cost of making a single device 1000 times more reliable would likely be much higher.

Finally a system is only as strong as the weakest link in the chain and single points of failure are invariably the weakest links.

Tim Bond
Thursday, February 13, 2003

Actually the "Single point of failure" appears to be a reliance on Microsoft products as evidenced by:

The Joel on Software Forum
A public forum for open discussion of topics raised on Joel on Software.

Provider error '80004005'

Unspecified error

C:\WWW\DISCUSS.FOGCREEK.COM\WEBSITE\JOELONSOFTWARE\..\include.asp,

line 367

Crusty Admin
Thursday, February 13, 2003

"If the chance of a failure for a component is 1/1000 for some period of time then the chance of two failing in that period is 1/1000000. "

That is only if the chances are completely independent. For two same model units in the same cabinet this does not seem to apply.

Just me (Sir to you)
Friday, February 14, 2003

*  Recent Topics

*  Fog Creek Home