Fog Creek Software
Discussion Board




Colocating: Redundancy

Get a second server.  Even if it is significantly weaker than the big burly box.  If the Big server goes down, have a load balancer set up that can redirect traffic to the smaller one in a pinch.  Hardware fails.  Plan redundancy into the system.

Put your primary Email server at the colo and use it as a smart host for your office.  The minor delay in getting messages is much better than them getting dropped when your T1/DSL/OC3 goes down.  Since Joel has seen that happen in the past, it is likely to happen again no matter where they get connectivity from.  This should be a different box from the web server.  Port filter between the email server and the web server.  Keep your network from being crunchy on the oustide with a chewy center.

Make it really easy to run security updates.

Turn off all unnecessary services.  FTP is quite hackable.  Run all connections through an SSH tunnel.  Sorry, I don't know how to do this on MS, can someone with Win experience describe?

Plan code updates as part of your mainenance.  Make them easily runnable in batch mode so they can be run at 2AM (low load time) w/o hand holding.

Great article, Joel.

In the Bay Area, check out InReach.  They were pretty good to us. They are in Oakland.

Adam Young
Wednesday, February 05, 2003

Redundancy and reliability are great for mission critical systems, but doesn't load-balancing across 2-3 servers usually cost a freakin' fortune?

Hard to justify for something that isn't high-value mission critical.

Brian Hall
Wednesday, February 05, 2003

These are all good ideas! If you're looking for a sysadmin, hire Adam.

We'll have a standby backup server (using the old hardware) at the office -- I want geographical diversity because downtown New York City *has* been known to be unavailable for weeks at a time. But that increases the time to go live again in the event of a failure.

Speaking of which... what do people use for their DNS TTLs? I've been using 1 or 3 days, but that means if an IP address changes it takes a while to get through to the world, which can be bad if an IP address must change due to some catastrophe.

Joel Spolsky
Wednesday, February 05, 2003

I hadn't thopught of this before, but One thing you could do is use DNS round robin to deal with the TTL.

Say you have three locations:  SF, Denver, NY.  You put each as a DNS entry and each gets 1/3 the traffic.  Big problem in NY, and only 1/3 of the people have an invalid DNS entry.

We just accepted the delay since for the most part, it was out of our hands.  When you make a DNS change, you can force a push of the info at that time, but it still takes time to propagate to everyone's DNS servers.  You could probably trian and error this with an alternative domain name (You did register joelonsoftware.net, .biz, and .org right?)

BTW, I am actually a coder, and only ended up with SysAdmin duties because I didn't have a Sys admin.  But thanks for the Vote of Confidence.  Necessity is a mother.

Adam Young
Wednesday, February 05, 2003

>>
Redundancy and reliability are great for mission critical systems, but doesn't load-balancing across 2-3 servers usually cost a freakin' fortune?
<<

Whether or not it costs a fortune depends on the route you take. We use Turbo Cluster from Turbo Linux, but are about to go with the Red Hat LVS approach as it seems Turbo Linux has dropped off the face of the planet.

Anyways, we have a heterogenous cluster in the backend working through redundant cluster managers and multiple firewalls. The cost was 1995.00 dollars for 10 boxen not including the cluster managers. That's not a bad cost at all considering that at the time, I wasn't comfortable with the LVS project.

Cheers,
BDKR

BDKR
Wednesday, February 05, 2003

<quote>
Get a second server.  Even if it is significantly weaker than the big burly box.  If the Big server goes down, have a load balancer set up that can redirect traffic to the smaller one in a pinch.
</quote>

It's easy to stick a load balancer in-front of a bunch of servers. It's much more difficult to make sure all servers have the exact same data at any time.

Jeff
Wednesday, February 05, 2003

Hardly difficult. If you've got the time and budget to play with load balancers you've got the time and budget to buy in a product or write a script that updates all the servers in a web farm with content added to the one server you designate as the "check-in" server where developers add all new content.

Robert Moir
Thursday, February 06, 2003

I'd like to insist about the difference between Load-Balancing and Fail Over.

Load Balancer are still expensive (they were VERY EXPENSIVE 3 or 4 years ago).

I hope it's because the load-balancing strategies (giving weight to one machine depending on its hardware, actual load, etc ...) are complex to implement.

I believe we must be very careful speaking about high availability, because when you start playing SPOF hunting, the fee grows incredibly fast.

When you realize 15 minutes off is acceptable for your system, which is true for nearly EVERYONE including amazon.com (if I want to buy something and the site is down, I'll just try later), you begin to get the point.

On the other hand, it's impossible to get totally fail proof. I assume some south corean sysadmins know exactly what I mean.

Having a site offline is bad, no doubt. But there are easy and cheap things you can do to handle it.

- RAID disks (preferably hardware).

- double power supply.

- A good feeling with the NOC crew of your collocator.

- Good security practices.

Concerning this last point, the use of SSH or Terminal Server through a secured channel (IPSEC or RDP encryption) can be enough.

Ralph Chaléon
Thursday, February 06, 2003

>It's easy to stick a load balancer in-front of a bunch of >servers. It's much more difficult to make sure all servers >have the exact same data at any time.

Not really.  Use some basic servers and mount  a couple of disk packs onto them (one for all the live data, second one is for redundancy), then force all changes to be made through a function which will write them to both disk packs.  Or even just do a periodic rsync, since the second disk pack is backup data, not live data.

JP
Thursday, February 06, 2003

Coming from the land of hardware load balancing and web server redundancy to the land of 'how can we get balancing and redundancy on the cheap' I was happy to find a reliable and workable solution (on linux).

Become the authoritative dns for your domains (you resolve all requests) and run dbjdns ( http://cr.yp.to/djbdns/ ) instead of bind and use a small TTL. Using at least two web servers (with djbdns on each, primary and secondary) and your web server of choice on each. Setup djbdns to round-robin your dns requests to your various web server instances (w/inexpensive dual nic card servers you could run 4 instances total of your web server of choice, each individually IP'd).  Add your 'web server health monitor of choice' and instruct djbdns to stop sending requests to afflicted servers ( http://cr.yp.to/djbdns/balance.html ). Add this to as much redundant hardware as you can afford (switches, routers, firewalls, etc.).

You now have effective load-balancing which is very similar in concept to a Cisco hardware solution (LocalDirector w/separate 'nanny') but much easier to manage. An aside, anyone running Bind should have their head examined. djbdns is much more robust ( http://cr.yp.to/djbdns/blurb.html ), secure, lightweight, guaranteed ( http://cr.yp.to/djbdns/guarantee.html ), quite easy to administer, virtually bug-free, and financially free.

~dave

daglenn
Thursday, February 06, 2003

Argh, djbdns homepage is http://cr.yp.to/djbdns.html .

~dave

daglenn
Thursday, February 06, 2003

Load balancers aren't just expensive; many of them are extremely immature; for example, you'll read the glossies that claim they can route traffic based on load on the system, but when you unwrap them, you find they go horribly wrong and you end up round robin.  Then you discover that they have all sorts of conniptions with cookies and SSL.  Oh, the pain of load balancers.

Which isn't to say they aren't getting better, but you really need to try before you buy; my allergy is based around a number of clients who've have more outages caused by load balancers than resolved by them.  Moreover, load balancers don't solve the problem of requiring geographical diversity - put a load balancer in New York rotating between Canada, New York, and San Fran, and all the traffic goes to the load balancer.  Which is now your single point of failure.

So you'll still want DNS with a short TTL when your load balancer fails; you'll still want DNS with a short TTL when a whole datacenter fails.  And unless you're prepared to spend a fortune on equipment, testing, and configuration, DNS is a win for most sites.

Finally, load balancing across DR and Production sites is really attractive, but it does carry a risk - it's very easy to end up relying on your net capacity, and when one centre goes offline, you suddenly discover your performance from the other system(s) is unacceptably slow.

Oh, and make sure when designing the system that you can come out of DR.  I'm aware of one company who's core systems replicate from production to DR, but in practise, their DR is a waste of money.  Why?  They have no idea how to get transactions in the DR environment back to production.  So they... never invoke DR for the core system.  Not a problem with a common or garden content site, but a bit more important for financial systems...

Rodger Donaldson
Thursday, February 06, 2003

There are two types of load balancing.  Hardware load balancing is an expensive option and I wouldn't recommend it unless it is critical your site has 0% downtime.

An inexpensive alternative is Software load balancing.  For those of you using Microsoft Windows 2000 Servers their Advance Server platform offers built in Network Load Balancing service.  We've been using it for 3 years, actually we started with the free NT add-in WLBS and then switched to NBS with 2000 Advance Server.

It has worked flawlessly and our reports do show a 50/50 split between connections.  When a user requests a page from your site NBS will actually server text and images from both servers based on Network and CPU utilization.  This can become an issue with secured sockets but can easily be handled by forcing an IP address to one of the two servers when using HTTPS.  It's as simple as placing a checkmark in the options.

The benefits are obvious for such a small price point.  0% down time for server upgrades, traffic is split between 2 or more CPU's; which translates into purchasing much cheaper servers for your web site, and should your site grow to a point where your servers can't handle the load; simply add another.

Harvey
Monday, February 10, 2003

My feeling is this configuration is asking for trouble.  You've got your web server going through a $35 linksys switch which happens to be the same one I use on my desk.  These things aren't anywhere near being reliable 24x7 equipment.  Just think, one bad solder in the power supply and poof, you just lost everything.  At bare minimum you want a switch with duel http:  <A HREF=//www.hp.com/rnd/products/switches/index.htm>HP makes a decent switch</A> for this application.  Not the best, but not bad.

Sure this is fine for non critical cheap stuff, but since you trying to bill yourself as a software guru, then you might concern yourself with availability.  I am actually dumbfounded that you were running off your office T1 before.  Is this something you want to actually admit? 

I'm trying to get this right but it sounds like you are running unfirewalled using only the Win2k filtering capabilities.  I wouldn't exactly recommend that configuration.

Having one web server is not sufficient.  Try remotely patching IIS or MS SQL on a live server.  Fun stuff.   

I think you'll be luckly to make it a month without revisiting the site. 

web service developer/admin
Saturday, February 22, 2003

*  Recent Topics

*  Fog Creek Home