Fog Creek Software
g
Discussion Board




SQL Server hot backups in a different City

For you reliable SQL Server mavens out there... Suppose I wanted to build a Win2K-based web service using SQL Server to store the data. But I'm a reliability nut. So obviously I'll use industrial strength servers with RAID, two power supplies and network cards, etc, and they'll live in secure colocation facilities.

To further minimize failure points, I'll have a hot backup. But the twist is that I figured as long as I'm paying for a hot backup, it would be more reliable if it was somewhere else, say, on the other coast.

So here's the plan I'm working on. Server A in New York, with IIS and SQL Server. Server B in Vancouver, with IIS and SQL Server. Server A is somehow "writing through" any database changes to server B. I know I can do this with transaction log shipping; is this a good way to do it? Is there a better way?

Then if Server A blows up, I simply ask my ISP to route the packets intended for Server A to Server B. (I assume they can do this if it's their backbone).

Ideas? Suggestions? Comments?

Joel Spolsky
Friday, November 7, 2003

Sorry, i'm not a sql server expert, but we have our's clustered, so that if one goes down, the other one picks up the load. they're right next to each other in a computer room. i don't know how far you can cluster sql servers.

my apologies if i'm just showing my ignorance of the whole matter, i'm better (maybe not much) with oracle questions.

ko
Friday, November 7, 2003

I haven't tried it yet, but this product claims to do what you want :

http://www.xosoft.com/products/f_WANSyncSQL.htm

Damian
Friday, November 7, 2003

I wonder how much it costs, there are no prices on the web site.

Matthew Lock
Friday, November 7, 2003

Ahh, I knew there was another one I had seen...

http://www.incepto.com/

Damian
Friday, November 7, 2003

You can have a SQL cluster through different networks using VPNs on Windows 2000 / Windows Server 2003. I would highly recommend using WS2K3 instead of Win2K. It's much more reliable and runs SQL Server and IIS faster.

Robert McLaws
Friday, November 7, 2003

If you're writing the front end, why not just maintain two database connections?

Make sure that every time you need to store something, you do the operation
on both connections. (What if you fail on one, but not on the other? Dunno. There might be an obvious answer to this in the nature of the applciation.)

And you also make sure that you only read on the local database connection, rather than the other one.

And rather than getting one ISP to reroute traffic, you could just have a very
short refresh interval on your DNS zone - and get something to automatically update your website's A record when your primary server goes down.

(Or, if I'm allowed to suggest an alternative configuration, try Erlang's mnesia database, which handles all this sort of thing quite transparently in the API. Of course, it's an entirely different story, because it's a non-SQL database.)

Greg Baker
Friday, November 7, 2003

Evidently, log shipping requires SQL 2000 Enterprise Edition.  I hope you're making a ton of money selling your wares - or maybe you still get that juicy discount at the MS Employees Store? (http://www.microsoft.com/sql/howtobuy/default.asp) ;-)

Seriously though, if you can't afford SQL 2K EE, it might be wise to examine vanilla SQL 2000 replication.  If that doesn't work for you, I might suggest modifications at the application layer - by creating a facade that impersonates your database component(s) and in turn it serializes the calls over some sort of message oriented middleware like MSMQ.  You then have a listener on both servers that receives and processes this call (as a queued message) in a single COM+ transaction.

Then again, maybe log shipping is the way to go. ;-)

Ankush Narula
Friday, November 7, 2003

"I assume they can do this if it's their backbone"

I think you should find out, first. And if the same company is running both, isn't that in itself a danger?


Friday, November 7, 2003

>If you're writing the front end, why not just maintain two database connections?

Make sure that every time you need to store something, you do the operation
on both connections. (What if you fail on one, but not on the other? Dunno. There might be an obvious answer to this in the nature of the applciation.)

The drive, the network and the database will both introduce high lags.

The drive? It will, because you are running raids (especially raid 5), and what takes one drive write to disk in no time takes a lot of time on another server.

The network? Japan has much higher lag to a NY/US based web app server than a Canadian one.. and a New York one is just a bit lower than a Vancouver one. You add up both lags and there's a good chance it will average out to overall higher updates and insert times in ADO/ADO.net.

There's also no guarantee that 2 databases are without locks at the same time.

Sure do a test, I am just stating the obvious. And it might not be fun writing a thin wrapper between your code logic and the actual ado calls to allow for dynamic deployment of DML calls to all available servers. Probably better to let SQL Server do the almost live transaction synchronization for you.

Li-fan Chen
Friday, November 7, 2003

The auto-failover routing thing is best left done in the application or DNS depending on your timeframe for failover.

Doing routing changes to accomodate failovers tends to be "interesting" on cross-coast setups.  When you're failing over to a machine one port over, it's a bit simpler.

just have your software deal with slave and auto-promotion to master in the event the current master falls of the planet.  Then just order your slaves.

David A. Ulevitch
Friday, November 7, 2003

My hosting service can do what you request (www.redundant.com).  But I've chosen to push the redundancy up a level in the abstraction stack and handle it myself.

I hand off a proxy to our clients.  It asynchronously checks the status of our main site and our back up site.  They are maintained by separate companies, on separate networks,  and in different states. 

If traffic can't be routed to one server farm for ANY reason, it automatically fails over to the second.  Not only do I have a hot standby server, I've got a hot standby cluster in a totally different location.  For cheap I might add.

This is good enough for major financial service firms and again it is dirt cheap compared to paying the host to do it with something like network peering.

christopher baus (www.baus.net)
Friday, November 7, 2003

I might add that most of our down time comes from our service providers, not from our servers failing.  That is one reason I don't trust this "peering, automatically reroute traffic stuff." 

The one time our provider need their routing failover to work (when that huge Cisco bug came out earilier this year), guess what.  It didn't work. 

Oh keep in mind you best have a VPN between these two sites if you plan on shipping data over the internet.  Leaving those replication ports facing the internet could make some fun for the script kiddies.  (Slammer anyone?).

christopher baus (www.baus.net)
Friday, November 7, 2003

in the last 2 years 3 of our raid controllers failed

n/a
Friday, November 7, 2003

Actually we had a raid controller go in one of our Dell rackmount database servers.  Lost all the data.  That was a lot of fun. 

christopher baus (www.baus.net)
Friday, November 7, 2003

Doing the failover is tricky and whole industries are based arround it.
Products like distributed director from cisco or akamai's dns service.

Doing geographical failover / loadbalancing (once you have two servers why not have west coast users hit the west coast server farm) is pretty tough and for tcp based services is best done with DNS records with zero TTL.

Colin Whittaker
Friday, November 7, 2003

Look into an EMC CX200 or an IBM FastT600, I believe both of them will sync with another across a network.  You won't have to worry about sql replication or log shipping, this syncing happens at the disk level.  Plus, these are reliable raid solutions and have insane i/o levels. 

"Throw some hardware at the problem because at the end of the day the high end consultant costs as much as the hardware and you get to keep the hardware."  Just heard that somewhere and it makes some sense.

Mike
Friday, November 7, 2003

I was recently on a project where we needed similar capability. We ended up using a product called DoubleTake. I was not the guy who set it up, but my understanding of the product is that is sits in the file I/O layer, and just ships the I/O on the database and log files over the network, and it is echoed on your hot standby server's datafiles. It seemed to work pretty well, and it rarely fell too far behind in the updates. I thought it was a lot slicker than log shipping. It works on regular files, too.

http://www.lyonware.co.uk/DoubleTake/DoubleTake-Software.htm

Rob VH
Friday, November 7, 2003

Sybase Replication Server works with SQL Server and should do what you need.
http://www.sybase.com/products/highavailability

MR
Friday, November 7, 2003

One key factor is how "hot" you want your backup to be.  Replication is pretty easy to setup.  Someone mentioned posting the same information to both servers simultaneously.  I did something like this years ago, but it was done using stored procedures.  After the proc ran successfuly, it just called it's counterpart on the other server with the same parameters.  The only server that would do this was the primary server so we didn't get into an endless loop of updates.

I also worked on an election system which required absolutely no loss of data.  I was only dealing with a few tables, but for it, I created a trigger on our main database which would take inserted values and insert them into the appropriate table on our backup server.  I only had to deal with inserts, though since a vote could not be deleted or changed.

Replication does not have anything to do with the actual failover though.  You could just provide two connection strings and use one as primary and if it goes down, just switch to using the backup.  Optimistically, your backup would handle the load until you can get the primary server running and sync'd up again.

Wade Winningham
Friday, November 7, 2003

We use transaction log shipping and it works just fine.
We control which server is hot through our own routing though, that way we don't have to rely on a provider.

bob
Friday, November 7, 2003

The easiest way to do it would probably be to use Transactional Replication, like you said.  One alternative would to use Distributed Transactions with the MS DTC and Distributed Queries to make sure the data gets distributed between the two.  You just link the servers by using sp_addlinkedserver, create some logins with sp_addlinkedsrvlogin, and you're ready to roll.  Probably not as easy as transactional replication, but the possibility is always there.

Cody Powell
Friday, November 7, 2003

Joel, if you want ultimate reliability, give Akamai a look.  They have the ability to cache record sets or even entire databases in their network.  All you need is one server, in NYC or Vancouver, to act as the authoritative server for everything -- web pages and databases.  This is a bit of a paradigm shift, but it works very well.  The cost is about as much as a 1.5Mbps commitment at any hosting provider.  Just a thought.  I'd be happy to discuss this further with you if you are interested.

Dann Sheridan
Friday, November 7, 2003

I'll start by saying I'm no expert in this but I've always felt it can be a can of worms. To paraphrase one of Joels articles every layer of abstraction just adds more exceptions to handle...... Doesn't another article discuss developers who think distributed systems are simply a bigger version of standalone systems....
Transaction log shipping and rep server (from Sybase or others) both do what you asked for (keep two databases in sync), more normally rep-server type products for near real time synchronisation and transaction log shipping for bulk delivery (e.g. every 1/2 hour). Normally these will only pick up committed transactions so for automated failover what happens if client commits a transaction on the primary, there is then a failure at the primary site before the replication, client starts using secondary site (which didn't receive that transaction). Is this an acceptable situation for your application. Rep server type products can often be set to treat the replication as part of the initial transaction however this has a cost in response time.
Next question - you are talking specifically about internet connectivity. The internet being the glorious system that it is it is possible for two different clients to have different views as to which db is primary, depending on the connectivity between the clients and the databases.
The problem is the solution to these all come down to application level decisions - without knowing the product space there are no hard and fast rules to guarantee a highly reliable, performant system.
Depending on application requirements, quite often a low-tech solution of a read-write primary, a read-only secondary and a manual process to switch from primary to secondary (ensuring synchronisation, shutting down the primary and enabling read/write on the secondary) is adequate - particularly because you can allow the users read/only access whilst the switch over is handled.
If you need better reliability than that, having the client communicate to both databases and explicitly commit on both is often the most performant solution, the client can then gracefully give up on one if it goes down. The problem then is handling when the system gets split into two.
If you go to a 3-tier solution (with two or more middle tiers for redundancy) you can be cleverer still as you can put more logic in the middle tier regarding what a failure is and what to do about it. The advantage with this is it is no longer down to the client as to which db is primary - clients talk to any middle tier, the middle tier components talk to eachother and (probably) all databases (so no use of rep server or log shipping) and the inter-middle-tier communication controls what to do if one or more middle tiers is unavailable. Also since the inter-middle-tier communcation is within the system, depending on your infrastructure it may be more reliable than the public internet.
Too many questions, all from a paranoid point of view, but it comes down to what do you want from high availability.

Neville Rowe
Friday, November 7, 2003

Joel,

I'm pretty new to the SQL world myself, but I have found great information in books like Inside SQL Server 2000 (MS Press, Kalen Delaney).

Also, there is a pretty good article "High Availability Solutions" in the October 2003 issue of SQL Server Magazine ( www.sqlmag.com ).  It gives a good summary of the three HA options for SQL Server, and commentary on what is appropriate in different situations.

The primary factor is what version of SQL Server you are running :  Standard vs Enterprise.  Enterprise is the only edition that allows Clustering.

Based on your description (servers located on opposite coasts), Clustering isn't a good option anyways (due to network latency).

Log Shipping or Replication are your two other choices.  The difference is that with Log shipping, you will need to point your app to the different SQL Server unless you are putting in some sort of front end proxy to check which server is currently the "lead".

It sounds like Log Shipping is a good choice.  Basically, Server A writes a copy of the transaction log to Server B.  The caution is that Log shipping does not automatically detect if Server A has failed, and to bring Server B online is a process which requires some manual intervention.

I have heard of people writing monitoring software (pulse detection) and using Alerts to send the DBA a page/email if the server is not responding.

Good Luck!

Geoff Noden
Friday, November 7, 2003

Presumably this SQL server is not a user-level service; it's only used internally by the web application that is seen by the users? Are you just looking to fail over the SQL server, protecting against a single-system hardware failure but not worrying about a fire/sprinkler/poweroutage in the cage scenario?

After failover, I'd be worried about WAN latency/reliability and how the database drivers and application could handle it. Usually apps aren't designed defensively with respect to the database connections, because typically the database calls are reliable and of a pretty short duration. Start introducing widely variable response times and intermittent drop outs and you can have a lot of "interesting" user experiences...

Before spending a lot of money on a hot-swapabble database, I'd test the application under the usage levels you expect with the database hosted in a remote location.

Whatever you do, develop as detailed a procedure as you can for what to do in the event of a failover, test the crap out of it, and test it regularly as well. No point in implementing all the hardware if you're not ready to use it at all time; that's the key peice that people forget when trying to implement highly available systems. ISP's changes routes, lose firewall rules, systems get out of sync in patch levels...if you don't test, you can't have any faith that the thing is going to work when you need it.

Rob Meyer
Friday, November 7, 2003

I've managed a scheme like this just as you described.  In our situation it was slightly different.  We had merge replication between the east and west coast data centers and geographic load balancing.  However, if one coast went down, they'd be routed to the other coast.  There was potentially a slight latency issue with the user's data potentially not being on the other coast in time, but we never ran into that problem.

In your scenario, I'd recommend transactional replication because you may also want to apply schema and stored proc changes on the primary server and have those replicated to the other server.  I do not believe (and I could be wrong. It's happened once) log shipping supports that.

Phil Haack
Friday, November 7, 2003


Why use software solutions when hardware is available to do the job?  You will only increase the failure rate.

Mike
Friday, November 7, 2003

We use log shipping on SQL 2000 Enterprise Edition and it works pretty well for warm backups.  The one thing to watch out for is that you can't use any non-logged operation or it breaks the replication.  (e.g., TRUNCATE TABLE)

Also, failover isn't automatic.  It requires some intervention but doesn't usually take too long.  Then you'll have to repoint your application to the new database server.

Redundant computers with Microsoft cluster server and common disks makes failover faster but that doesn't matter if your disk crashes.

Jeremy
Friday, November 7, 2003

Joel,

I would recommend using SQL Server merge replication to satisfy your requirement in this instance. We use it to sync two databases that are approximately 150 miles apart. Across the continent should work ok, even if the link goes down, replication will continue to log changes and then publish them once the link is back up.

I wouldn't recommend transactional because it uses in essence a two-phase commit and if the second server is down, the first server won't be written to.

I learnt alot about replication using links from the following URL www.mssqlserver.com, the owner's name is Mike Hotek and he seems to be a good source of replication information. HTH.

Also I am located in Vancouver, BC Canada and would love to help out on this project if I can. Drop me an email at Kevin@MooreSSI.com.

Kevin

Kevin Moore
Friday, November 7, 2003

Yes, no, all of the above, and none of the above.  I've been doing replication for over a decade.  I've been doing "log shipping" long before it was ever called that.  I also have several hundred clusters in production.  I also teach SQL Server High Availability, wrote the article referenced in SQL Server Magazine, and do consulting day in and day out on that stuff.  Enough of the background stuff......

You can do a "normal" cluster at a distance of 80 - 100 miles.  Why the limit?  The LooksAlive test (a simple ping from one node to the next) has a requirement of 250 ms on the round trip before it is considered dead.  Therefore, you are limited to 80 - 100 miles when using a dedicated, dark fiber link between the two servers.  You can extend this significantly further using Geoclustering, but unless you have a pair of EMC SANs (one on each side), have SRDF installed and running, etc., you can't do it with a simple cluster.

You can in fact do log shipping without having to have the Enterprise Edition.  The log shipping utility in the BackOffice Resource Kit 4.5 will work against SQL Server 2000.  There is also a simple log shipper in the SQL Server 2000 Resource Kit.  Or, you can simply create your own.  This isn't rocket science.  If you can take a transaction log backup, copy it to another machine, and restore it, you are doing log shipping.  Log shipping can extend the range pretty much indefinitely, but it does have latency.  ALSO, you can't simply redirect the IPs in the event the primary fails, because you have to take the manual step of recovering the database after all of the tran logs you can get have been restored.  You are vulnerable to twice the interval at which your backups run AT A MINIMUM.  The further you displace the two servers from one another, the higher the latency and greater the potential risk.

While I do a LOT of replication, I do not recommend it as the first solution.  It looks really nice and pretty, until you flip the switch and promptly watch your entire production environment melt to the ground.  That is NOT a problem with replication.  Your entire application and entire infrastructure need to be FULLY tested with replication in place, BEFORE it is ever deployed.  That means you need to unit test every type of transaction through every single component involved in your application, they then need to be load tested, if they pass those two, they then need to be fully disaster tested to ensure that you aren't going to have any surprises when you go to production.  For HA implementations, unless you have a LOT of experience doing replication, I VERY STRONGLY recommend sticking with transactional replication.  95% of the replication I do is with merge, but the learning curve for merge is 100 times as steep as it is for transactional replication.  There's a reason that in the 1 day seminar that I teach as an intro to replication, 1/2 of the day is spent on merge and that barely scratches the surface while the other methods can be adequately introduced in an hour or so.

So, for you implementation, I would recommend the following (assuming you have full access to the servers to do this and you don't have heavy budgetary constraints):
1. Cluster your primary location
2. Log ship to a secondary location

The cluster requires Enterprise Edition and you only license those instances of SQL Server which are running.  For example, if you have a 2 node, single instance cluster (1 instance of SQL Server installed), you buy licenses for 1 instance of SQL Server.  It doesn't matter how many pieces of hardware you have in the cluster.

If you are using Log Shipping out of the box, you need Enterprise Edition on the secondary.  If you are using a variant, Standard Edition will suffice.

What does this look like during a disaster?  Node 1 fails.  The SQL Server shuts down and ALL connections to the database server are terminated.  The cluster fails over, SQL Server goes through restart recovery, and it comes online.  Once back online, clients have to reconnect.  If the entire cluster were to die, then you would have to manually finish restoring all of the transaction logs to the secondary and then recover the database to make it available for connections.  (Failing it back once the cluster is operational again is a nontrivial operation which requires an outage window to accomplish.)

If you are comfortable with and have FULLY TESTED transactional replication, you can implement that from the primary site to a secondary site.  In the event of a failure at the primary, your ISP can do a redirect to the secondary. (It will have a different name and IP, so you have to account for this yourself.)  Failing it back is also a non-trivial operation.  If you deploy transactional replication, make certain the distributor is either on a 3rd machine or is on the same machine as the SECONDARY.

This was a VERY simple dissertation.  There is a LOT more to consider, but I don't feel like writing a book.  In short, there is no free lunch.  Each approach has its pros and cons.  Each approach, used by itself, can be killed VERY EASILY.  Even using multiple approaches together, if the drawbacks are not fully understood and planned around, can be killed VERY easily as well.  I'd strongly recommend reading the referenced article in SQL Server Magazine, but keep in mind that is a very brief intro to a very complex architectural problem.  Even covering HA at a base level with respect to SQL Server takes 3 FULL days in a classroom.....and that isn't your typical class where an instructor reads slides.....it is a class that can be best described as "open wide, because we're going to shove about 15 days worth of material into 3 days".

Mike Hotek - Solid Quality Learning
Saturday, November 8, 2003

Round trip time from coast to coast is 80-90 ms, using vanilla Internet, not  a dark fiber link. Packet size 32 byte. Round trip time from the West coast to Central Europe is roughly 200ms.

So, why do you think 80 - 100 miles is the limit for a "normal" cluster?

Kostya Kulikovsky
Saturday, November 8, 2003

Presumably because you're quotiing average times and he's thinking of maximum times.

Incidentally I can't get any ping outside of Saudi to come within the 250 ms he mentions.

And pinging inside the country comes well under but with lost packets.

Anybody remember the 500 mile email failure by the way?

Stephen Jones
Saturday, November 8, 2003

Look into merge replication.  It's a pain to setup correctly but I've had good success with it and sounds more appropriate for your situation.  I spent months learning it for a client and am somewhat of an expert with it.  I'm the only person I know that has used it in production )that includes MS support staff).  Shoot me an email, I'd love to help!

Ryan S
Saturday, November 8, 2003

You can, but I would NOT recommend merge replication without a SIGNIFICANT amount of background.

The guesses about use are inaccurate.  I have over 150 systems in production that are running merge.  I know of at least 20 systems that MCS (Microsoft Consulting Services) has implemented.  I know of another dozen that have been done my other groups within Microsoft.  I also know of more than 800 different people around the world who are using merge and have it running in production.  So, it is VERY widely used, but it is about as close as you can get to a nearly vertical learning curve for at least the first 3 - 6 months.

It is not a pain to configure.  It is a very complex technology that doesn't allow you to "just turn it on" and see what it does.  It can and will blow up rather spectacularly if you don't do it properly.....and that is from VERY hard experience.

As for the 80 - 100 miles, it comes from Microsoft recommendations.  It can be found in the clustering documentation, in the High Availability courseware, in MOC courseware, in Books Online, and you can also get to a discussion about it on EMC's website where they discuss geo-clusters.

Mike Hotek - Solid Quality Learning
Sunday, November 9, 2003

I'm not a SQL Server expert, and this might even be off-topic for this discussion, but if you are planning on doing this with the server setup detailed in the 'New Server at Peer1' article you might want to take another look at that lil' Linksys switch you bought to connect all the server's ports to the one ethernet drop you're getting from Peer1.  Right now, if its power supply is fried or something else happens that incapacitates it, your server is dead in the water. No connectivity, no management capability, nothing.

While the Linksys might turn out to be reliable enough, I would not want to have it as a point of failure between an expensive server and redundant data center infrastructure. Instead, I would try to have them all connected to the Peer1 network directly, to different switches if possible, so even a port failure on Peer1 switches won't leave you without connectivity to your server.

My apologies if this turns out to be off-topic, redundant or something :)  Good luck with the SQL Server replication.

Jonathan
Sunday, November 9, 2003

How much performance are you willing to sacrifice for integrity and durability?  How much complexity are you willing to accept to reduce the tradeoff?

You want to do transcontinental failover because of a concern about a disaster taking out the primary location.

Your failover location cannot start until it gets the last log entry from the primary location.

You define log writes to be complete only when you have verification that they have been replicated at the secondary location.

Your primary location is now vulnerable to latency and disruptions in the connection to your secondary location.

You reduce the latency by installing multiple log relay servers 80-100 miles away and defining a write to be succesful if it hits any of these relays.  You write to the relays in parallel.

But you are still at risk if the secondary location happens to be down when the sending relays attempt delivery.  So you have a layer of receiving relays throughout the country that accept deliveries from your sending relays.

Question: If you anticipate a disaster with a blast radius of 50-100 miles from your primary location, what else should you be worrying about?

Alex Jacobson
Sunday, November 9, 2003

We had a client which had replication running, and got a data corruption.  The data corruption replicated ... both databases storing.

We now do log shipping.  At most they lose an hours work.  And log shipping is 100%, replication is sometimes only 99.999% which isn't good enough.

Oliver Townshend
Monday, November 10, 2003

Mike Hotek,

Can you please point me to the books/articles that you mentioned. If it's on MSDN, I can get to it.

Thanks

pdq
Monday, November 10, 2003

Not a SQL server expert (when I have to bet my job on a database it's Oracle on Solaris), but I have worked for a couple of telecoms with hosting businesses (you've heard of them). You do NOT want to trust those people to do adaptive routing based on the state of your server, even if they offer that service, which they probably don't.

Jeff Carroll
Tuesday, November 11, 2003

Some good suggestions at:
http://vyaskn.tripod.com/maintaining_standby_sql_server.htm

David Cameron
Tuesday, November 11, 2003

I've done something like this, and used "plain vanilla" replication (in SQL Server version 6.5).

There's a bunch of gotchas if your webservice is "writable", i.e. when you fail over, you want your clients to be able to store data on your failover database and have it automagically synched when "Main" comes back to life, mainly to do with primary keys.

FWIW, here's what we did :
We separated the "user" tables from the "static & sluggish" tables. "User" tables remain writable once the database has failed over; "s&s" becomes read-only. So, "user" might contain orders and invoices, whereas "s&s" might contain products and categories, which are changed by some master instance.

We used a primary key allocation scheme on the two boxes which excluded clashes for all the "user" tables.
We set up some convoluted 2-way replication scheme (we had a kick-ass dba, btw - don't try this without one !).

We moved the "static & sluggish" data into a separate schema, and defined "1-way" replication with the "main" database as the master.

We created a bunch of stored procs to explicitly check that both databases were configured properly, and ran them every night.

I honestly don't know if it worked - our main database never went offline. What I do know is that it took us an amazing amount of time and effort to do - if there's a product which does this out of the box, I'd recommend buying it and be done with it. I estimate that in all, we spent as much time on the failover stuff as we did on the "normal" features, and it made routine maintenance tasks far harder than they should have been. So, make sure this pain is really merited in your case !

neville kuyt
Tuesday, November 11, 2003

Are you sure you want to limit yourself to two servers on either coast?  You should think about two clustered servers in one location and another at the colo site that gets transaction logs.

Make sure to periodically test the redundant db!  It's always bad when you go to the backup and it's garbage...can happen if some monitor process dies so nobody knows something is going wrong.

Craig Thrall
Tuesday, November 11, 2003

how fast can your ISP re-route the incoming packets from Server A to Server B ?

your optimal solution should include that lag time as a parameter.

ie. assuming, hypothetically, you have ideal real-time instantaneous synchronization of Server A to Server B and it takes your ISP more than 48 hours to re-route incoming from A to B...

also, you'll need to define minimum acceptable time-frames for each step in the recovery operation.

ie:
What is the acceptable time between:
* when Server A goes down and you are notified ?
* re-route packets to server B ?

also, how big is the database ? how many transactions/sec ?

its always good to know these things before selecting a particular technology solution ...

erhan hosca
Tuesday, November 11, 2003

Joel,

If this is a new app. You might want to consider writing it for MySQL 4.0 on Win2k(3). The vanilla transactional replication works well and it's easy to configure.

http://www.mysql.com/doc/en/Replication.html

Just an idea!

Colin Faulkingham
Tuesday, November 11, 2003

I've implemented log shipping.  The previous poster who said that it requires Enterprise Edition is somewhat mislead.  The tools to do it graphically require Enterprise Edition.  Let me rephrase that, the tools to do it in the least reliable, least transparent, hardets to debug possible way require Enterprise Edition.

The tools to do it for SQL Server require a little know-how of how SQL Server works and how to write T-SQL.  I've implemented log shipping on every version of SQL Server on the Windows platform (including 4.21, 6.0, 6.5, 7.0 and 2000).  It's not hard.  Here's how you do it in 2000:

1.  SQL Server MUST be running with a domain account, or at least an account that can transparently authenicate between both SQL Servers.

2.  Either copy your daily full backup to the backup server, or back it up directly to your backup server.

3.  Load the daily full backup onto your backup server without recovery.

4.  Every <generic time interval> you dump your transaction log.  Load the log dump onto your backup server with no recovery.


Very simple and straightforward. 

From an engineering perspective, this approach as the advantage of being very transparent (It is very obvious when it fails, it's easy to figure out why it fails, and the problem space fits in your head).  It's also ludicrously reliable.  Once it's set up and running it runs like a Swiss watch.

The big question you need to answer is:  What are you trying to fix?  If you're trying to eliminate data loss, then you've got offsite backups, why bother?  If you're trying to eliminate downtime, this might be a good solution.  It won't be a good solution if the data loss was your DBA truncating a table, because chances are the truncation will propagate over the network along with your data.

Maybe a better solution is a modified log shipping:  Move the full backup over to the backup server and load it daily, but just copy the daily logs over and don't load them, just keep them in a ready state.  That way if you've got a data loss problem in the middle of the day you don't automatically load the data loss onto your backup server.  Transaction log backups generally take a trivial amount of time to apply (obviously this depends on your transaction volume and database size), but it's a better solution than having to start over from scratch for a larger database.

Drop me a line if any of that wasn't coherent.

chris.

Chris Miller
Thursday, November 13, 2003

Hi Joel

Log shipping is pretty expensive feature which was designed just to achieve the thing you need. The problem is that it's quite expensive (SQL Server 2000 Enterprise Edition, processor license ... hmm...). It's also unknown (to me) how it would work in unreliable WAN environment, and you might need to setup VPN between locations. You may test it out using free SQL evaluation version.

But if you are about testing, you might also want to try other replication plans, three comes to my mind:
* use regular SQL replication. Not difficult to setup, works well on unreliable WAN environment, VPN still recommended (but not requested), SQL Standard Edition will suffice.
* use DTS plus some schedule. Need some manual work (ie. development), and results might be not what you expected.
* write your own replication over HTTP(S) (and maybe SOAP). That means that you will automatically send (from server A) HTTP(S) requests containing all updated data to server B, and near to B there will be some "web service" reading these updates and sending them to database. A lot of development, but results might be exactly what you wanted, and will fit to unreliable internet-based links. VPN not necessary, as you can use HTTPS connections.

The main problem with all these plans (including log shipping) is volume of updates sent from A to B. The other thing is latency, ie. how long it takes to update server B when some data in A was removed, changed or inserted. You just need to think about it.

Good luck :>

B.

Bronek Kozicki
Friday, November 14, 2003

Neville Kuyt and Chris Miller offer the most relevant advice, I think. This comment is way late, I'm interested in what you ended up with...?

My two bits:
Redundant site cold or warm standby - use log shipping (OOTB or roll your own, it's easy).It's absolutely imperative you have a sensible disk layout - seperate everything, perhaps even direct the backups to a seperate (local) server. Differential backups are also a possibility here.

If you want your redundant side hot and handling data modification, it is most sensible to factor the databases by site. Analyze your data model and usage - make sure:
- no transaction is dependent across the site divide. Don't reference rows factored to the other site(s)
- temporal dependency of transactions can't cross over sites when there is possibilty of loss of data incident due to an outage. This is way harder. A little two-phase commit (DTC) every once in a while can simplify things, like a syncpoint. Update something and make sure you use a marked transaction (thinking about coordinated recovery here)

More excess verbiage based on actual practical experience:
There is no panacea, but if you steer clear of problem assumptions (e.g. insisting there is really only one database), it is doable.  Distributed transactions for everything do away with most of the thinking, but are so painfully slow - use them if you don't care about transaction rate. Merge replication is not a panacea, even if you factored the database. And you still have the demon of distinguishing network partitioning from server failure. Use of replication may have all sorts of difficult consequences for release plans.

Also, reverting may be harder than you think. I had a loss of data incident occur last month during a planned failover/revert. App using MSMQ, log shipped, both databases in read-only mode around the critical point - but the app discarded some pending transactions during the switch and had already acknowledged receipt to MSMQ - doh!

John M. Owen
Saturday, November 15, 2003

If you go to  www.sqlservercentral.com  in thier script library they have a script with this title
"Script to create a standby database server"
I have not tried this personally but others have said it works well.
I believe this uses FTP between the two sites to copy the logs.
Does log shipping without using that feature of MSSQL.

HTH,
Ross Boe

Ross Boe
Tuesday, November 18, 2003

The SQL thing isn't the major problem at your configuration. (Yes I'm one of the linux guys, who plays with his windows)
The Problem is IIS.
I don't know the actual security risk state, but the program was bugged very hard (even the FBI seems to have noticed), and there where a lot of security risks. (I know some big companies which are fireing everyone who installs IIS, and if it's just for a test in the local firewalled network. And they are using a lot of windowsstuff)

I would suggest an(y) other webserver, for example:
http://httpd.apache.org/docs/windows.html

Programers are good at programming,
hackers are good at hacking,
but who is good at security?

Michael Fuchs
Saturday, November 22, 2003

Hi,

Could you let us know what how you're solution turns out?

(Performance, managability, your proceedures under failure)

-Tim

Tim
Monday, November 24, 2003

You may not like this answer, but the simple solution is to use Oracle. Oracle with Data Guard allows you to fail over quite easily. Oracle has worked out all the kinks with this and it is a part of many deployments. I have helped design these implementations, supported them, and seen what doesn't work.

* How much loss are you willing to accept on failover?
As you can see from many of the other posts, even with the best network, the speed of light is your biggest enemy when going long geographic distances and keeping things totally lossless. The more $$$ you throw into network and storage, the closer you can get.

In an Oracle RAC environment with NAS or SAN storage set to sync writes (no write completes until completed on both storage devices) you can achieve lossless. How much $$$ you throw at the net and the storage is almost directly in proportion to the performance and reliability you'll get out of the system. Common sense, you get what you pay for.

If you're willing to loose 5 minutes of data, then you can configure oracle's data guard to perform at that level as you prefer.

The really cool thing is that you can do RMAN (specialized hot) backups of the database from your standby system, thereby never hitting your production system with the load of a hot backup.

If you're still listening, you can get a good view of the network problem at:
http://otn.oracle.com/deploy/availability/htdocs/dataguardnetwork.htm

A very good technical overview of the data guard solution can be found at:
http://otn.oracle.com/deploy/availability/htdocs/DataGuardOverview.html

With the latest database, you can do the setup and configuration all from the GUI or the command line.

One thing that you can be assured of with Oracle. This solution works, it is fast, and there are lots of low and high dollar customers who are using this solution in production today.

Mike S.
Thursday, December 25, 2003

There's nothing more annoying when a guy asks how to solve his problem with Application A and people respond with "Try application B instead".
Like on this page, several people suggest Oracle, mySQL, Apache and such...

Dr. Lee
Sunday, May 23, 2004

*  Recent Topics

*  Fog Creek Home