Fog Creek Software
Discussion Board




I'd like to divert your attention ...

... for a moment and give Joel a break from the berating he is taking today. 

I'm working on a series of articles on network availability through vendor diversity.  I am gearing it to an audience that I've never written for (IT and network managers).  In the past there have been good constructive criticisms of other people articles here. 

I was hoping I could ask the same of you all.

The first in the series is available at

http://www.baus.net/archives/000051.html

Thanks for your help...

christopher baus (www.baus.net)
Monday, January 26, 2004

Great article.  This should get a bold.

One thing that is missing from the article is the idea of risk analysis.  It is usually more expensive to set up and maintain a multiple vendor site than a monoculture site. Does that cost balance with the expected risk and cost of an outage?  The same risk analysis that justifies a redundant system may not justify a multiple-vendor redundant system.

To go back to the NASCAR analogy, it's like being stocked up with both Japanese and American parts.  If it turns out that the American parts have a systemic failure, well, no problem ... we could switch over to the Japanese parts, right?  Even though they use metric units and we only have USA tools ...

Now, I understand that in your line of work (network administration), routers tend to use open protocols and are more or less interchangable, with a lower training cost for each new vendor's product.  Contrast this, to, say, designing a system that can use Oracle or MSSQL or mySQL (for cheapie people) on the back end -- this will take a bit of effort.

Also, keep in mind that in some cases, using multiple vendors doubles your security target profile, not halves it -- there are more possible unknown bugs that could be exploited.  Additional complexity may also make it harder to validate that a certain network is secure.

Alyosha`
Monday, January 26, 2004

It's an interesting problem and well described in your article. However, I'm not sure I agree with the your conclusion. I'm reminded of the Mark Twain quote: "Put all your eggs in one basket. And watch that basket!"

Also, are you sure that the Shuttle computers are differently designed? I assumed they were identical were running the identical programs and the redundancy was to prevent hardware failure, not logic problems.

pdq
Monday, January 26, 2004

"Put your eggs in one basket," would be the Warren Buffet method of system administration. 

I am recommending the John Bogle method of system administration. 

You don't know where the security flaws will be ( or in John Bogle's world, what the winning stocks will be ). 

In my opinion "watching the basket," takes more resources.  And at least involves being very intimate with the source code.

christopher baus (www.baus.net)
Monday, January 26, 2004

Another example which maybe easier to understand, would be to use IIS inconjunction with Apache.

If a flaw is found in IIS, at least your Apache servers can remain live while IIS system is patched.  No you are not putting all your eggs in the Microsoft basket, but as one of their smaller customers, all the kicking and screaming probably isn't going to get Microsoft to move very much.

christopher baus (www.baus.net)
Monday, January 26, 2004

I guess it really depends on how critical uptime really is. If it's really, really critical, like people are going to die if your server goes down, then having different types of software would be prudent. However, there is an increased cost and some additional risk with the added complexity.

I'm a software guy so all my instincts say to keep it as simple as possible. The more complexity the more chance for error. I recognize that this doesn't match up well when security is involved.

What would be nice would be some case studies or statistics that show that having duplicate software costs X times more, but is Y time more reliable.

pdq
Monday, January 26, 2004

Unscheduled downtime for web servers and/or routers can be VERY expensive for a certain class of users.  For internet backbone providers, e-commerce vendors, and financial trading network providers, it could be the difference between being the red or in the black.  I suspect the Cisco flaw from last summer cost providers such as Sprint in the millions of dollars. 

I feel it time that network managers insist that HA options work seemlessly between vendors.  In my next article I argue against closed, proprietry HA protocols such as HSRP.

christopher baus (www.baus.net)
Monday, January 26, 2004

Stylistic gripe (because I don't have time to actually read the content): lose the colons after the headings. The fact that they are big and bold and separated by blank lines above and below already makes it quite obvious that they are headings. The colons are redundant, and after about the third one, very annoying.

Martha
Monday, January 26, 2004

Agree with Alyosha.  While this sounds good in theory (given infinite time and infinite money), there hasn't been any analysis in deciding if it's actually *worth* doing. 

Crimson
Monday, January 26, 2004

Agreed.  This article presents the theory, which I haven't seen before.  The biggest problem is that Cisco has a patent on HSRP and their routers do not play nice with others in terms of HA. 

I am hoping that articles such as these start to apply pressure to vendors so they are forced to play nice. 

What needs to happen is a third party labs need to verify that the cost is worth persuing.  This is the basis for my second article.

I also feel this could be boon to smaller players such as Juniper.  This could be fodder for their marketing folks that can't get a foot hold in Cisco only shops.

Outside the routing domain, I think this could make a lot of sense for HTTP servers were load balancers can be used for failover.

I'll remove the ':' 's.

Thanks.

christopher baus (www.baus.net)
Monday, January 26, 2004

This is probably the best for your company (fine with me) but it sounds slightly unproven that you need to mix sofware from multiple vendors.
I don't know this area but I would not be surprised that you need to have exactly (normal number of units)x (number of vendors) to provide working solution. Or you will have to test load balancing when devices from multiple vendors should work together. If this will be required I guess the only reasonable solution for vendors would be to share the same code :)


I.e. some imagined service that needs cluster of 2 Web servers. So use IIS and Apache behind load balancers from several companies: as you recommend I would need (2 server cluster) * (vandors of HTTP servers(IIS,  Apache)) * (2 differnt load balancers). So you need to support 8 servers 24x7 instead of 2 just in case something went wrong... It's definitely will worth doing for some cases...

WildTiger
Monday, January 26, 2004

Yes that's what you need.  This wouldn't be a problem if vendor's interopted as well on HA.  They don't.  I want to make the case that they should, and this is the reason.

In my opinion, running a load balancer with out a hot standby is a really bad idea.  Your network has one nasty single point of failure.  Sure any web server could fail and all is well, but if load balancer fails.  Well Good night. 

Plus network gear is getting cheap.  8 boxes will soon be nothing.  Google runs some 10000. 

Any significant egress is going to have two routers, likely running HSRP or similar.  Right now they both come from Cisco.  Cat 6500's are common for medium sized loads. 

There is no reason that one of these is not a HP Procurve.  HPs command language is very similar to Ciscos even.  I believe the biggest problem is that the Cisco sales guy will talk you out of it because they already has an HA setup.

That is of course until both of those routers go down because of a software or security flaw.  SLA are missed, and money is lost.  Again this matters in certain markets.
 

christopher baus (www.baus.net)
Monday, January 26, 2004

actually looking at your calculation, I am only recommending 4 boxes...  Two load balancers and two servers. 

christopher baus (www.baus.net)
Monday, January 26, 2004

Actually I am surprised at the concern of running different OSs.  If you consider what a typical server cluster looks like.  Cisco switch, FFIV load balancer, Linux, Sun, Window's servers.  There is already a bunch of different vendors in there. 

They currently just do not interoperate in HA.

christopher baus (www.baus.net)
Monday, January 26, 2004

There is also the problem of code reuse.  Just because you get a different vendor doesn't necessarily mean you are getting a different set of flaws. 

I can't remember what the recent (past couple of years) instance in which 90% of people discovered they were vulnerable to a particular problem because of basing their code on a flawed example.  I think that it was probably either bind or ssh but it's irrelevant.  It can and does happen.  Not unlike the current situation with cars, or home appliances, a lot comes down to badging.

Another problem occurs when a particular class of bug is 'discovered'.  There was much fun when the format string class of bugs became popular.

It's obvious that you realise this, it's just you don't explicitly mention it.

Colin Newell
Tuesday, January 27, 2004

Yea I thought of that. One problem is that a lot of code is inherited from the BSD code base. 

christopher baus (www.baus.net)
Tuesday, January 27, 2004

In the end it is always is a cost versus benefits consideration.
But these is more:

First: You must stress that we need to be talking about independent >redundant< systems.
Every one of the subunits in the redundancy set should be able to take over of the >whole< job from the siblings. If not, than you have just multiplied your attack surface without gaining any benefit at all. Worse, you have increased the probability for exploits by adding those of system B on top of those of system A, and when either one is down the whole operation still grinds to a halt. Bad!

Second: Be very sure that the redundant diverse systems are truly diverse, and not just skin-deep diverse! This is not as easy as it sounds. The systems from different manufacturers/publishers might have more in common than you realize. In the extreme, you just get an identical system with a different packaging. And here lies a snake in the grass with regards to standards and standardization: while it enables more exchangeability it at the same time >reduces< true diversity! Often beyond the pure standard protocols and API's (that can have security weaknesses in themselves) there are reference implementations, public code etc.
In the extreme some development cultures (e.g. Open Source Software) very actively promote the sharing of implementations. This is why when there is are vulnerabilities exploited in certain subsystems, you see this long list of different OS'es requiring patches, because >they are all running identical code<. In the past I have referred to OSS as the only software ecosystem that promotes monoculture by design, more than just as a byproduct of market success.

Third: The more diverse your redundant systems, the higher the cost of diversity. Anything but the most basic systems are not "black box, plug, play and forget". Systems required operations and maintenance. The more diverse the systems, the more diverse the skill set will be to provide the tender love and care. E.g. in development: every tool that increases cross-platform code-reuse, decreases diversity. This is where the costs can rapidly start to rise out of control.

While diversity can be a nice abstract pattern, the realities are such that beyond the most basic of operations, the costs of true diversity will rapidly become extremely prohibitive.  While the costs of "skin-deep" diversity can be lower, it does so by compromising the very point that was the reason for its deployment.
So, for systems that carry a negligible acquisition, development, operations and maintenance cost, true divers redundancy is an interesting option that can increase availability. If any of the costs in the equation are non-negligible, then a careful cost/benefit analyses has to be carried through. I suspect that unless you are in a very special business with extreme requirements, in most cases the result will favor a "pick one system and run it as best as you possibly can" approach.

Just me (Sir to you)
Tuesday, January 27, 2004

Great article but I do think you need to put some notes in about checking that the code base is different in both case. It's the same problem you can get with redundant ISP links. You may think you have two different links to two different ISPs but at some point the network may all go over the same local bell companies wires. It may once upon a time have been seperate but some smart engineer has now merged the two cables you used to have into one on the new super duper concentrator.
There are probably several TCP/IP stacks around but most of them are based off BSD (I'm not sure the windows one IS. There is an apparent myth that this is true, but I think it's just the header files have BSD copyrights for some of the constant names. Anyone confirm or deny this?)
Also one final problem area that you don't touch upon is when two redundant systems have a flaw in the underlying protocol which lets them talk to each other, which has a failure mode where they both DoS each other.
Hopefully this last point is covered by proper engineering on the part of the protocol designers but I have seen it happen.

Peter Ibbotson
Tuesday, January 27, 2004

NASA used to do this all the time.  Two (or more) separate teams would independantly develop software for guidance control, etc.

If one system failed on a particular input due to some obscure bug, the liklihood of the failover hitting the same bug would be very small.

Why did they stop?  It was really, really expensive.

MR
Tuesday, January 27, 2004

MR, that's my point.  You can get it for free, just by buying gear from multiple vendors, yes assuming vendors don't share code like the BSD or Linux source.

I think developers should be in favor of this as it effectively doubles the number of engineers involved in a network point.  Either by having two vendors involved, or one vendor that claims to use two disseperate teams. 

Just me (sir to you),

I understand your point in regard to running Apache and IIS at the same time.  You could argue that there is a larger attack surface, although the term attack surface I feel is very confusing, and is often used as a way for network security consultants to communicate with the PHBs that hire them. 

"Mr. CIO.  Your attack surface is huge!  For a mere $500,000 I could reduce that for you saving you millions.  Look at the RIO!"

If you are running a protocol such as HSRP or VRRP, you have one live device, and one hot standby that is idle.  Your "attack surface" is identical to running one device with out a hot standby.

I'm working on updated the paper based on all the good input I've gotten here.  I plan on publishing the finished product to my company's web site.

christopher baus (www.baus.net)
Tuesday, January 27, 2004

I'm a coder but not an IT person; I found the article very informative and understandable.  Thanks.

One note -- I assume you are going to have this proofread before live publication?  I saw several typos/word errors like this, in just a quick scan-read of the article:

"No single vendor provided high availability can reliably handle critical and inevitable logical flaws "

"The crux of the problem lies in the difference types of failures"

"For instance, if a car is not well of balanced"

Biotech coder
Tuesday, January 27, 2004

Thanks for the suggestions eveyone. 

I've been sending out feelers to get a wider distribution for the article.  Any suggestions?

christopher baus (www.baus.net)
Wednesday, January 28, 2004

Christopher,

you are very right and this was exactly the first point I was trying to explicitly state: You need fullredundancy. In the Apache/IIS example: if you have a hot standby, preferably with automatic failover, then that is >a good thing<. You have reduced the probability of failiure cascade and you have not increased your attach surface (I am sorry if this qualifies as PHB lingo, it captures the general concept nicely). If however you need the two systems to handle the load, and you are just loadbalancing between them (both live online), than that is >a bad thing< since you have now increased your attack surface, and when the weakest link goes down you are still out of the air.

Just me (Sir to you)
Wednesday, January 28, 2004

Ok I see your point.  Sorry about the PHB rant. 

What if the hot standby is using a different software?  Maybe I need to make that point.  The hot standby doesn't necessary have to be from the same vendor as the live box. 

The biggest problem right now is the use of HSRP.  Cisco has a patent on it.  HP and others seem to be ignoring this by implementing their own protocol XRRP, which is a basically VRRP. 

I think customers should put pressure on Cisco make their router failover work with other vendors such as Juniper or HP. 

christopher baus (www.baus.net)
Wednesday, January 28, 2004

The point I'm trying to make in next article is that the Cisco patent is hurting their own customers.  They are strong arming them into buying hot standbys by Cisco, when the might be better served with a hot standby from Juniper

christopher baus (www.baus.net)
Wednesday, January 28, 2004

*  Recent Topics

*  Fog Creek Home