Fog Creek Software
Discussion Board




Recommendations sought for Fog Creek developer PCs

Arg!

For the second time in the last year, a hard drive failure has led to a painful chain of events that resulting in several wasted days. Of course we have backups and no data was lost. But we've lost another few days that we'll never see again: downloading backups, reinstalling them, rebuilding systems, installing OSs.

So I've decided that our new policy is that ALL non-laptops at Fog Creek will have RAID. There are some pretty cheap IDE RAID cards now so this shouldn't be a problem.

Basically I see three options for developer machines, and I'd like to hear some opinions or other suggestions. (Quick reminder: Fog Creek's software runs on Windows so we need Windows 2000 or XP machines. Don't suggest Linux.)

OPTION 1: Terminal Services Client. We get a killer server with lots of CPUs and huge RAID. Developers run junky old computers with Windows Terminal Services and keep all their important stuff on the server, accessed through Terminal Services.
--> Problem with this option: I don't think Terminal Services will support triple-monitor clients. Or does it?

OPTION 2: Network storage. Everybody has killer machines with no hard drive, and they boot off the network. All storage, including everything, comes from a big fast RAID 0+1 network file server (with striping for speed and mirroring for safety).
--> Questions: is this going to be slower than local hard drives? What if we used 1000 Mbps Ethernet, would it be as fast as a local drive?

OPTION 3: Private IDE RAID. Everybody has killer machines with an IDE RAID card. It's not hot swappable but at least it's mirrored.
--> Has anybody done this? What's your experience?

Last Question: I recently noticed that as soon as you start talking about IDE RAID and multimonitor systems, the Configurators over at dell.com can't really handle it. Does anybody know a good honest PC shop that will custom build PCs for us -- the kind of place full of geeks who read hardware web sites all the time and get into deep debates over which kind of system bus is fastest.

Joel Spolsky
Thursday, January 23, 2003

Off topic Joel, so delete it if you want to, but how did it take you "several days" to reinstall the OS, and why were you reinstalling the OS in the first place? If you look at the thread on formatting a hard disk, you will see nearly everyone recommending using Ghost images.

You partition your HD so all user data is kept on the D drive, and then make a Ghost Image of the  drive, with the D partition blank. Then if your Hard disk fails you  simply put in the spare one, install the Ghost Image (should take about 30 minutes or less depending on how many Gigs you have) and then reinstall your last back up of the data.

The only danger with RAID is the failure of the controller card, though I suspect with striped RAID it shouldn't make a difference.

If you want fast reliable hard drives consider SCSI. SCSI drives are more expensive than the equivalent ATA drive for equivalent performance bu the best drives on the market are not made in ATA.

As for the question of local storage v Network storage, I suggest you get Robert Moir's opinion. It is however one of those "religious"  questions, and you will have to make up your own mind,

Please forgive me if I'm being impertinent or just stating the obvious

Stephen Jones
Thursday, January 23, 2003

I'm not sure if this is useful or not, but here's a quick rundown of what my school does:

The labs for many classes run off of networked storage. If you don't have many people on, it works fine (It's on a hub, so it dies under half-load). It is noticably slower than labs which run things locally (There's a Linux lab which has its own drives, Mac labs also. This isn't quite apples-to-apples, but, it's a significant difference).

I don't know what kind of hardware would be needed to make it run quickly (I'm afraid I don't know if my school's on 10bT or 100bT in those labs).

My personal gut instinct is per-computer RAID. However, if the controller goes, it can hurt your data. I suggest you read the ask slashdot (take it with the usual slashdot-sized grain of salt) at: http://ask.slashdot.org/article.pl?sid=02/12/13/0313254&mode=thread

The issue there is that a HighPoint IDE raid failed and left the disks unrecognized as part of the array. Don't know if he fixed it or not, but either way, a problem you probably want to avoid :)

What if you pull down a ghost image of a data partition (OS partition and data/source code,etc. partition) nightly onto a big server with SCSI raid? That should perform better than full-out network storage, but won't give you as reliable backups (being daily or periodic). The advantage is you're not SOL if you're big backup box goes down.

Hope this is helpful.

Mike Swieton
Thursday, January 23, 2003

I would go the network route and give each workstation it's own hard-drive.  Then each developer would simply access the codebase on the server and you only have to back up that directory or drive etc etc.  RAID would help to speed things up on the server as well.  A fast network is important.  Don't want to get bored waiting for file/data transfers.  For testing purposes I also have a couple of 10/100 cards available.

As everyone told me in the thread I made earlier, use Norton Ghost.  I have yet to buy it, but it sounds like a time saver and may rid me of my formatting habit...

c:\format c: /u

As far as custom shops... I don't trust them.  Will they be in business a year from now?  What kind of warranty do they offer? etc. 

If you have the time I would suggest building one on your own.  I love to build my own and have built 10 workstations in the past 3 years (None of which has had a problem).  I use the following sites to order parts:

http://www.newegg.com

http://www.tccomputers.com

May be worth a peek if you have time.

Dave B.
Thursday, January 23, 2003

I also recommend private RAID.

PC's for Everyone (www.pcsforeveryone.com) can configure a PC for you with 3ware hardware IDE RAID.

While it's not on PC's for everyone's online configurator, 3ware also has hot-swappable IDE drive cages for use with their cards. I'm pretty sure you can call them up and see if they can procure and add the drive cages. They're pretty flexible, at least for their in-store customers.

James Park
Thursday, January 23, 2003

Raid is not the solution, it just moves the problem from the disk to the controller. I know places that have less uptime since they moved onto raid, YMMV.
Get in a backup system that can do fast full restores (this can be Ghost), a good LAN (fully switched 100Mbs should be ok) and a few cheap PC's with fast NIC's and large IDE drives for overnight backup. If at all possible put the backup PCs in a different building.

Just me (Sir to you)
Thursday, January 23, 2003

Backups aren't good enough. I want RAID because I'm starting to learn that even when you have perfect backups, the time it takes to get up and running again is always too long. It shouldn't be, but it is. For example all the people who suggest "ghost" -- this just doesn't cut it, there are too many steps involved in ressurecting a PC. Similarly the concept of having an OS partition and a Data partition: to ressurect such a system, you need to reinstall the OS (60 minutes for Windows XP including service packs) and only then can you start getting the user's data -- but what about all the little utility programs and customizations that the user did? The registry? etc.

As for RAID controllers failing... OK, doesn't this just mean that you replace the controller and you're all set? Is this really more common than hard drives failing?

Joel Spolsky
Thursday, January 23, 2003

It has to be asked - have you actually used Ghost (or any disk imaging software) Joel?

John Topley
Thursday, January 23, 2003

Restoring from backups doesn't have to take a huge amount of time, if you're willing to do full-workstation backups and you take the time to design your backup strategy.

I use Retrospect to do incremental backups of my machines.  If I need to restore, all I need to do is install the base OS (25 minutes for Mac OS X Jaguar), install the Retrospect client (1 minute), re-activate it on the backup server, and have the server start a full restore.  Then I just keep feeding the server tapes until it tells me it's done.

It'll still take a few hours, sure, but it shouldn't take more than a single day's worth of time.  And the machine will be configured *exactly* as it was the night before the crash.

To make the backup and restore process go faster, be sure you're on a gigabit network (they're cheap now, under $100/card and about $100/port on the switch) and have a nice, fast tape drive or FireWire RAID or whatever on your backup server.  Also, do full backups on a regular basis, in addition to nightly incremental backups, so you don't have to put in 15 tapes to restore a single system.

Oh yeah: Whatever you do, look into FireWire storage solutions.  FireWire is fast and robust; it's essentially serial SCSI.  You should be able to find a high-bandwidth FireWire RAID cabinet that Just Works fairly inexpensively.  And with today's 200GB ATA disks, you can get almost a terabyte of storage out of such a beast.

Chris Hanson
Thursday, January 23, 2003

Hmm, doesn't sound like people are listening to the question I asked.

I don't want a backup strategy, I have an excellent backup strategy, its just that Murphy has taught me that failed hard drives take an average of 2 days to recover from. I know they shouldn't, but they do. I want to switch to a system where failed hard drives take zero time to recover from.

Joel Spolsky
Thursday, January 23, 2003

If you want to do it by each machine, you will want redundant drives on redundant controllers.  One controller controlling two mirrored drives is useless if the controller gets nutty.  And please drop the ide drives, go scsi, idea mtbf is way too low.

Even if you build a giant terminal server you will have to do something robust like this on it.  Terminal server can be very nice, the disadvantage is if some program on the terminal server causes problems it causes those for everyone, not just the one person that needs it. 

Crusty Admin
Thursday, January 23, 2003

I'm assuming that those two days that Joel quotes include the ordering, delivery and fitting of a new drive...?

John Topley
Thursday, January 23, 2003

"Hmm, doesn't sound like people are listening to the question I asked"

I think they are trying to suggest (rightly or wrongly) that you have asked the wrong question.

Bob Greene
Thursday, January 23, 2003

Joel, to answer your replacing controllers question,

I've used 3ware cards before and taken 2 drives created on one card (2 port version) and moved them to a new computer and new card (4 port version) when I upgraded machines. This went off with no hitches.

I've had drives fail before, replaced them, and the 3ware card rebuilt the array with no hitches.

Again, the reliability of my upgrade and recovery is just my personal experience. You might want to check newsgroups for other people's experiences.

As controllers failing, I mean what is the probability of a solid-state device like a disk controller failing? And please people, no stories of my friend's friend's friend's friend had 10 controllers in a row fail within 1 second of booting up.

As for IDE vs SCSI on the desktop, well what does SCSI get you other than higher RPM and higher cost? Higher RPM is nice, but do you really want 2-4 15K RPM cheetahs tearing holes in your eardrums all day?

Joe Smalle
Thursday, January 23, 2003

Joel: If you are really nervous about getting up quickly then the way to go is Option 2.  The Issue:  you have successfull entered a possible single point of failure for your whole team, which may not be fully managable, unless you use mirrored servers. So you don't have get one server but two, each configured for max redunancy and on seperate gig switchs then you can insure uptime (barring building/system wide disaster) and that is expesive. 

Note: (or Option 2.1 + Backup) Ghost Enterprise states it can do incremental backups to a central server (you need a base line Ghost on the Enterprise server and a good Network connection) Have not tryed it my self so I will not say how that works for quick recovery.

Mr. Smalle:
Past experience has taught me that SCSI out last IDE type drives in workstation and server environment.  I have personally lost or had to repalce more two to three times as many IDE drives then SCSI drives (but I must admit the sample failure size for both is quite small).

cheers
mad

A Software Build Guy
Thursday, January 23, 2003

Gee, folks, having problems here?

Gosh!!

>>>I'm assuming that those two days that Joel quotes include the ordering, delivery and fitting of a new drive...?

Explain to me how a data backup gets you up and running? Explain to me how having a backup of all your work gets you up and running? In the last week I have installed about 3, or 4 new packages. Each one of those packages has updates, and probably more than a hour of “settings” that I will loose if I re-stall the package. As mentioned, I can use ghost to re-install windows with ALL OF my cool settings. However, that ghost image is only as good as the last image. I have a “clean” base ghost image. However, that does NOT address Joel’s question.

As mentioned, in a day I have all kind’s of ocx, and other registered “ActiveX” controls that I use during software development. In other words, I have to install/register all kinds of stuff during the day.

Only two days? Joel is being conservative here. While Ghost rocks for full windows re-sintall (less than 5 minutes), this does NOT address the problems of each users setup. Unless Ghosting is done everyday, or as Joel is suggesting that a raid/mirror be used, then you still have all that setup time.

In fact, even if ghosting is done every day, you still have one day of exposure anyway. (and, making a ghost image everyday is not always practical anyway) Ghost is ideal for the “base” install.  Ghost is ideal for the company that needs word, Excel, and the email stuff pre-setup. The rest of the users data such as my documents and Email folders will reside on the server. This setup is ideal for the average Joe computer user. The average Joe company does not need much software, and what they have DOES NOT change. Developers are not daily computer users…sorry folks!!! (remember his name is not Joe, but Joel!!).

Unless you ghost every day, or Mirror as Joel suggests, then how can ANY OF YOU people think that a data backup helps here? How can ANY of you think that only 1, or 2 days is NOT the lost time? The problem here is not data, the problem is that in one day, all kinds of things get CHANGED  and INSTALLED on a pc. How do you save THAT TIME!!!

Can anyone explain how one gets up and running with a good data backup that is not a mirror? Please do, because that dead obvious request by Joel has not really been answered.

It is so dead obvious that Joel is looking for a backup solution that not saves the daily work data, but the huge investment in EXPENSE of people time of configuring and setting up  pc to work during the day.

Joel’s suggestion of Terminal Server was very good indeed. However, end users and developers require TOTTALY DIFFERENT kinds of pc’s (again, that is probably why he asked about TS). What is usually adequate for a end user is not for a developer. Hence, I don’t know of any software house using Terminal Server. It is usually deployed to lower costs in a company. TS also means that users are generally restricted in what they can do (I DO NOT mean security here!!). Installs of software, shoving in a CD into the drive with your favorite ActiveX library etc is not very workable. In other words, in a environment where things change, and you install software all the time, then TS will not cut it. Users NEVER have to install software in a large company when using TS. TS is restrictive in use, and in fact this is one reason why to use it!!

I love TS, but for a development house…hum, it probably might not work. If the all tools remain constant on each pc, then again TS might work. I

The other big bonus of Terminal Server is that then you can get to your desktop anywhere in the country anytime. High speed net means that you can work at home etc.  TS also means that much data and work gets central (by force!). It can also increase company security for intellectual rights (For example, loosing a notebook due to theft will not result in loss of much work, since everything important is on the server).

You also have to ask what, and how much you want on your notebooks.

However, as mentioned, end users are at the opposite end of grand canyon as compared to developer requirements for a PC. Thus, general install and testing of stuff is NOT going to be an area which TS is going to work. (perhaps you could do that on the local pc’s, or a few special pc’s in the office for that purpose).

I will ask around, but it seems to me that a hard disk replication to images on the server would be the ideal solution. (and perhaps not even require raid).


Albert D. Kallal
Edmonton, Alberta Canada
Kallal@msn.com

Albert D. Kallal
Thursday, January 23, 2003

>>>>>>>>>. Similarly the concept of having an OS partition and a Data partition: to ressurect such a system, you need to reinstall the OS (60 minutes for Windows XP including service packs) and only then can you start getting the user's data -- but what about all the little utility programs and customizations that the user did? The registry? etc.<<<<<<<<

Joel, the point is that you make the Ghost clone AFTER you have installed all you mention. It takes me 3 hours to set up my C partition with the utilities, shortcuts, and MS Office but only 12 minutes to do a complete restore.

Gigabit ethernet over Cat 5e (better than Cat 6) will cost you about $60 a NIC but the switches are expensive . I'd go for Gigabit copper to the server and 100TX to the switch.

The reason for getting SCSI drives is partly speed, but more so reliability, since top class drives are normally SCSI. If you're doing it for the desktop maybe not worth it, but probably worth it for the server. But you know your exact requirements.

How often is each worker checking work in though? If they're all worlking separately most of the time then local storage would seem to be the best.

What I do not understand is why on earth you would need each workstation to have a complete mirrored image. Get the workstations all the same, and simply keep a spare hard drive with the cloned copy on. Use another DISK not partition for the data. Then if the system disk goes just put in the spare one. If the data goes you'll have to use the backup, but if you say the data is not the problem.

The real problem here Joel, is that some of us don't seem able to understand why you need two days to get things back running.

Incidentally are you talking about your developers machines or the order and inventory database?

Stephen Jones
Thursday, January 23, 2003

Option 1, CITRIX, not Terminal Services.

As long as developers can install software too.

Alberto
Thursday, January 23, 2003

I've just read Albert's reply. Possibly I am not taking into account the speed at which you do changes.  If it is infeasible to keep the Ghost backups up to date, then a RAID mirror would seem OK. For belt and braces also do a Ghost backup to a file server over the network every night for every workstation as well (though you might find the extra odds of creating an electrical fire outweigh the addional safety :-) )

However as you were elsewhere referring to the user not having an HD at all, I presumed that the non data setup was standard for everybody.

Stephen Jones
Thursday, January 23, 2003

>>Raid is not the solution, it just moves the
>>problem from the disk to the controller.

Maybe I misunderstand RAID but if your RAID controller fails can't you just:

1. take the busted RAID card out of the machine.
2. Plug either of the mirrored HDDs into the normal IDE controller on the motherboard
3. Turn the PC back on.

Takes 5 mins. No lost data. Replace RAID card at your leisure.

Or is it more complex than that?

Andrew Reid
Thursday, January 23, 2003

I can get my system reinstalled in 1/2 a day!

As far as your original question, Joel, I highly recommend the RAID solution. 

OPTION 1.  Although I love terminal services, I would hate to develop on it every single day.  There's too much I want to transfer between my computer & the server that the client won't easily allow.  There's also a mental dissonance that I can't quite shake. 

OPTION 2.  I have no experience with this.  All I can say is that I don't like the sounds of it.  ;)

OPTION 3.  Yeah!!!  Of course all of our servers have this, and it has been a life saver.  "Oh, looky there, a hard drive crashed.  Time to call Dell" vs. "Why the hell is server X down?!"

As for the last question, maybe it's time to hire that person to be on staff?  These guys usually make good networking/tech support guys too.

not my regular made up name
Thursday, January 23, 2003

If the controller fails, it could have corrupted your drives long before it actually died.  That is why I said earlier you need minumum of two controllers and two hard drives.  The cheapest configuration would be to mirror. 

Scsi may be noisier, you don't need 15 k screamers though for workstations.  IDE drives are cheaper by the dozen and worth exactly what you pay for them. 

Computer costs have plummeted - true, but if you want reliability you still have to pay.  The $299 walmart special is fine for mailing friends and surfing, but if your livlihood depends on the computer, don't look for the cheapest way out.  "Hey, I found a brain surgeon that will remove my tumor for half what the other guy wanted to charge."  "What a bargain."  You get what you pay for, even in computers.  I am referring to hardware ONLY.

Crusty Admin
Thursday, January 23, 2003

Joel,

I’ve gone through the exact same thing myself. And unfortunately, I didn’t think to try posting for opinions, I simply tried each of you ideas (in the same order in face).

Terminal Server: As it has been said already, it is good for deployment but poor for development. No matter what you have for a server, it will be slower. You can test this by using the Windows 2000 Server TS Administrator (every Win2K comes with an admin TS license). You’ll notice right away that even with just one user one, it is sluggish.

Network Storage: This wasn’t so bad. Our switch wasn’t so hot, so I’m going to assume that a better switch would give better performance then I got in testing.  But even with a poor switch, it did work. My problem was that booting up took way to long. I also didn’t like that we were stuck with all of our eggs (our data) in one basket (the server’s array). I ended up concluding that 2 days wasted on one developer was better than 2 days wasted on every developer (server goes down, you all go down).

Local RAID: This is what we ended up with. It worked great. We didn’t use stripes however, just mirrors. This was because of “controller fear”. My experience has been that when a control goes nutty, it will often corrupt you striped on both mirrors. But when it is just mirrored, only the mirror ends up corrupt (anyone else notice this?). By using other mirrored drives, if the control failed we could simply plug it directly into the standard IDE controller until the card could be replaced.

Marc
Thursday, January 23, 2003

RAID? Backups? What is all this?

And why is Joel not applying his own costing logic found elsewhere in the site?

Ghost would be a good thing, particularly if you have a spare drive lying about, although this doesn't address the core issue of component failure in a business critical machine. A Ghosted drive does not help if the motherboard or PSU fails.

Solution: Build a developer spec pc, doesn't have to be super fast; just be adequate for development work. (Ghost hdd at this point - or not, your choice).

Test pc works ok, then put it in a store cupboard. Upgrade it periodically.

When a developer pc fails, immediately swap it out for the spare, reload projects from source code control and carry on. Meanwhile, the pc technician can fix the original machine "offline". Yes, you lose what was being worked on that morning, but that is also true in the event of HDD failure.

The main advantage is that developer time is not wasted. Think:
Developer down time (lost productivity etc) vs £200 for a spare pc.

What's your hourly consulting rate, Joel? $250? I'm willing to bet that a half day of it is worth more than the cost of some spare hardware (he says, neatly ignoring the difference between cost and price).

Justin
Thursday, January 23, 2003

Justin, I think Joel's downtime is because a develpment pc can change many times during the course of a day.  The one in the closet is outdated immediatly.

Crusty Admin
Thursday, January 23, 2003

Joel,

I think the problem with your backup strategy is that you aren't using backup software that knows how to completely backup Windows and do a "bare metal" (Veritas term) restore.  Typically the software will use some kind of bootable floppy or CD to get your machine up and talking to the backup server, and then it will restore everything including your OS, registry, data, etc.  Most of the Veritas products are able to do this and I believe some versions of Ghost and several other Ghost-like products can do this.  You shouldn't have to spend time installing the OS at all.

I know when I was doing some research on the subject a year or so ago Veritas had a product specifically targetted at workstations.  I think it actually stores the backups in a compressed format on the hard drive on the server and only moves files to tape if they aren't used frequently.  This allows users to request a restore of a specific file and not have to worry about loading tapes usually.  I believe it is Veritas NetBackup Professional.

As far as hard drive reliability goes, there are definitely some SCSI drives with much higher MTBF.  IDE is all about price, SCSI is generally about performance and reliability.  The MTBF for most IDE drives is in the 500K to 600K range.  The Seagate Cheetah drives are at 1200K hours.

If you can stand the extra noise, I also recommend using bay coolers for your hard drives.  They are basically 5 1/4 inch frames with fans in the front to pull air in over the drives.  Most OEMs design for cheapest cost and lowest noise levels, which usually means proper cooling suffers.  Heat is the enemy of hardware.

Anonymous
Thursday, January 23, 2003

I like the private RAID idea.

I've had bad luck with those el cheapo IDE-RAID controllers.  They're often plagued by poorly-written drivers.  I've lost many hours to troubleshoot a Highpoint IDE raid controller.

The best solution is to invest in a 3ware IDE-RAID system, but those puppies can get really pricey.

Take a look at this:
http://www.accusys.com.tw/75.htm

I use it in my systems.  It's provides hot-swap RAID1 with no drivers.  It's a simple device, but works well.

Myron Semack
Thursday, January 23, 2003

a) It struck me as odd that a single hard drive failure led to re-installation of several machines.

b) You can slip stream the service packs into the Win XP installation which speeds up the process considerably.

c) Taking "several" days to re-install a couple of machines is clearly unacceptable in terms of a disaster recovery plan.  What would you do if you had a fire or something?  It sounds to me like you may have a bigger problem - even if you add redundancy with RAID - what will you do in the case of a loss of the physical machine?

Anyway, to answer your questions:

Option 1: Terminal Services. 
-Adds three different "single points of failure" (network, server, junky computers).  If the network and server have problems (inevitable, even with RAID), your entire staff is non-productive.
-I'm pretty sure that triple-monitor clients are not supported.

Option 2: Network Storage.
Q. Is this going to be slower than local drives? A. Yes.  Think about it.  A single PC can use bandwidth of 1000 Mbps alone with a standard ATA 133 MBps HD.  So if you have multiple PCs accessing the server, the amount of bandwidth they can use will be less than if they were just using a harddrive on the local machine.  You can add more than one network card to the server, which would help - but there's a limit of how many you can add.  Furthermore, there's an overall bandwidth limit on switches.  Most don't have a throughput of max speed times # of ports. Finally, although both 133 MBps and 1000 Mbps are theoretical numbers, there's a higher overhead in the network world than the harddrive world, so you are more likely to get the 133 MBps than you are to get the 1000 Mbps. So a hard drive will beat the network any day.

I should add that I have some idea of what it's like working in such an environment.  We have a fairly server-centric approach with all data stored on the network - although not as extreme as the one you propose.  We've had so many minor glitches relating to this apprach (the switch acting as a hub, network going down, developper doing something bad slowing down the server to a nearly unusable state etc etc) that we've lately decided that we would be more productive with a different approach. The biggest issue is that if there are any problems at all with the server or the network - we're out of business until the problem is fixed. 

Anyway, if you are interested in learning more about this option, I'd recommend taking a look at iSCSI.  However, FWIW, I think this approach is somewhat at odds with your "keep developpers happy by getting them the best equipment money can buy" philosophy.

Option 3: Private IDE RAID.
This is the option I'd choose if I had to pick one of your three options.  But I would note that while it adds a level of redundancy (a good thing tm), it also adds a level of complexity (a bad thing). My personal experience (admittedly limited and somewhat second-hand) with RAID is that it is generally causes more downtime than it prevents.  This is particularly true if you skimp on the hardware and/or have the RAID machine setup or maintained by anyone who does not have a thorough (read: expert) handle on what they are doing. If you are not a RAID expert, I'd consider looking at the alternatives to RAID. 

(you did ask for "opinions or other suggestions"):

a) Purchase identical systems. (Easier to isolate hardware problems, easier to implement an effective disaster recovery plan)

b) Replace said systems at least every two years. (reduces likelihood of failure)

c) Place all hard drives in removable drive bays. Keep at least one spare harddrive for every ten machines (harddrive should be identical to the harddrive currently in those machines). (To put in new drive, turn off machine, pull out bay, replace with spare - fast and easy)

d) Keep the operating system and programs on a physically separate drive from any data.  (If one drive in the machine fails, you are either reinstalling/restoring programs etc, or restoring data from backup - but not both!)

e) Use Roaming profiles. (if a client fails, you keep your settings.  If the server fails - well, you've backed up your settings with your "excellent backup strategy" - right?)

f) Keep a master disk with at least the basic OS and frequently used programs.  Apply any service packs etc to the master disk. You can then Ghost from the master disk to your replacement drive, or even better, get one of those cool faster hardware things that copies from one master disk to a bunch of disks (hardware based Ghosting).
OR
You could also substitute the hard drive from one of your current (working) machines instead of a master - although you'd be stuck if all of your machines were toasted somehow.

Incidentally, you should be using Ghost or something like it if downtime is a problem for you, no matter whether you have RAID or not.  RAID is not a cure-all, and like everything, can also fail. And if the RAID controller fails, there's a very good probability that it will take your RAID sets with it - in which case you are back where you started.

g) Really concerned about downtime? Keep a spare machine.

The bottom line is that anything can and will fail, so you need to figure out a way to a) isolate failure to the smallest number of people possible, b) minimize the impact of failure in terms of time and money and c) minimize time of failure recovery.

RAID solutions will help towards those goals - but they should only be a small part of your overall disaster recovery plans.

Phibian
Thursday, January 23, 2003

Phibian,

Is there any website/'s from where you picked up all this info. I am sure it comes with experience, but.....

thanks,

Prakash S
Thursday, January 23, 2003

Myron,

The 2 port 3ware controller is actually affordable. I bought a few a while back for ~$120.

1 2 port 3ware card + 2 60GB 7200 RPM IDE drives results in a nice 60GB hardware RAID 1 subsystem that you can get for about $300.

That's only a little bit more than the cost of 1 40GB SCSI drive.

Joe Smalle
Thursday, January 23, 2003

Coming from working at 3Com and the one who worked on the firmware for network booting (PXE, etc.);  I recommend your proposed network boot solution.  Good PC's, great network architecture, and kick-ass server with mirroring for peace of mind.  We ran a good batch of ~10-15 computers from a simple mediocre server with high speed ethernet, and although Windows 2000 took a little longer to boot, it wasn't a big deal.  Once it booted with a virtual network C: drive, the thing ran without any notice of where the files resided.  3Com NIC's come with PXE ready and 2000 has our code base in it.  Give your PC's 256-512 megs of RAM and your fine.

Done deal.

sedwo
Thursday, January 23, 2003

This may sound kinda crazy, but I've been doing my development within VMWare.  I make an occassional back up by powering off the VM, and copying it onto a network drive overnight.  You do lose a little bit of speed while running in a VM, but with a reasonably fast box, it's pretty good.

If your machine does die miserably, it's pretty simple to get a replacement box, copy the disk files over, and be productive again.

codemonkey
Thursday, January 23, 2003

Stephen, I'd say that the "D" (data) drive should be a separate drive entirely which means it can be swapped out to another PC if ever the OS/software drive goes tits up. That means the developer is instantly productive again, albeit that they miss their favourite settings and utilities and such. When they get their own PC back again they'll get them back, so they will just have to be patient.

Other than that, private RAID. If the RAID controller does barf, would you rather lose one developers data or everyones? And it sure as hell is quicker accessing local drives than network ones, your developers would hate you if you forced that on them.


Friday, January 24, 2003

Could someone tell me what the problem is with overnight gosting (every night, every dev machine) on a fully swiched network. This was my plan for our shop, but there might be some snags I overlooked.
Here is what I would like:

dev machines (hardware all image equivalent)
                    I
      switched 100Mbs
                    I
GBit line to different building (for offsite)
                    I
      switched 100Mbs
                    I
cheap machines dedicated to storing the images

one standby dev hardware per 25 devs.

Does this sound ok? dumb?

I can second the statement that Raid sometimes does more damage than good, although I only have experience in server SCSI Raid, not desktop IDE Raid.

Just me (Sir to you)
Friday, January 24, 2003

Something that I've not seen is VMWare.

It's not a perfect solution but it can save the day if needs be (it worked for me more than once...)

A VMWare image can be set as a complete developer configuration (Win, VS, SQLServer, Office, whatever) and stored on a network or removable drive (like FireWire or USB2).

As data lies on the central server (along with CVS), you can just take a blank machine, install VMWare on it (takes 2 minutes on a Win XP preinstalled PC), copy the image, boot it up, checkout the codebase and there you go. Of course, it is not as efficient as a standard config but it will help you continuing the development right away.

Philippe Back
Friday, January 24, 2003

<disclaimer>
I'm making an assumption here that you bought of the shelf PC's. So if you do have those Seagate Cheetah SCSI's installed, flame me ;)
</disclaimer>

I would recommend option 4:

OPTION 4: Get reliable hard drives.

Why not option 1..3:
Like others say: don't create a single point of failure for all your developers. If the server/network goes down, you all will be drinking very expensive coffee. And for raid, it has problems too. Wacky controllers can corrupt both drives. If one raid drive fails (twice as likely than a single drive PC) you still loose time buying/finding the spare drive and installing it. If you go raid make sure it's mirrored and get a A+ quality controller. Plus all these solutions take a considerable amount of time and money to setup.

Why option 4:
It's the least complex, cheap and reliable. A certain someone mentioned that a programmer should get the best tools he can buy. When reading this most think of software not hardware. Companies buy off the shelf PC's without considering what components go into it. Many A brand PC's shipped with notoriously unreliable IBM GXP75 drives. The saying goes that you don't get fired for buying IBM. In this case I think they should. Buy reliable drives and you will suffer from hard drive failures a factor 10 less.

Drive reliability data is readily available here (registration required):
http://storagereview.com/map/lm.cgi/survey_login

It shows that SCSI drives are indeed much more reliable. There's a reason why SCSI come with 5 years warranty and most IDE's with 1 year. Personally, I still prefer a good IDE drive as SCSI generally make more noise. So how much noise do drives make? Check here (only idle though):
http://storagereview.com/comparison.html

I like my systems reliable, quiet and fast. All in that particular order.

Jan Derk
Friday, January 24, 2003

Hmmm, there's more round the houses, cross the road, get a bus, miss the train, walk home and end up next door to where you started from on this thread than I can remember for some while.

I lean towards keeping all sources on the server, even in multiple developer scenarios but that doesn't solve the problem of the hot fix 'I just lost my OS and my registry and my favourite tools' mess.

Personally I live with it, if only because the one good thing from having a hard disk crash on a workstation is that it gets rid of all the crap on the drive as well.  Life is too short to get rid of all the squirrely utilities and COM objects you evaluated.

About the only solution that gives you immediate  startup is a disk duplex system, two controllers that write the whole data to two drives simultaneously.  If you can't find a duplex driver (and on my cursory look I didn't find one), then you can use RAID 1.

RAID 1 is disk mirroring with a single controller and is the highest fault tolerance with a single controller.  Forget about RAID 4 and 5 they are really only suitable for servers.

http://www.jwilcox.com/raid_defined.htm  for a reasonable definition of RAID.

I'd still keep all the sources and data on the server though.

Simon Lucy
Friday, January 24, 2003

We do fast machines with local HDs. All user data is stored on the server with synchronization to the local machine, for offline use.  HDs are imaged with Ghost.

A hard drive dies, install the new one, pull the image from the network and log the user in.  All done.

BC

BC
Friday, January 24, 2003

Why not just have two PCs on every developers desk. You install something on your main machine you then install it on the other one while you are doing some work. PC1 goes down you just move to PC2 and download your latest code from some network source. The other nice thing about this is when you have one of those days where you install something on PC1 and it breaks some other program you can go right to PC2 if you need to get back to work quickly and just have PC1 image itself off PC2 that night or even during lunch. Is it perfect, I don't know, but it sounds simple enough.

Jeff
Friday, January 24, 2003

I don't understand what is wrong with wanting more reliability with one's machines.  Backups do not work.  Often I have to take down a piece of information when someone calls and I have no pen, so I open up metapad and save it to the desktop.  If my HD then dies a few minutes later, before a backup is made, I have to get back in touch with the person and seem unprofessional.

Option 3 is interesting.  In Hennesy&Patterson, it seems that the most important thing with RAID 0+1 is "mean time to repair."  Since the two mirrored drives are in the same environment, when one pops it's often a matter of time before the other does too.  How does the controller signal that one has errors?  A software popup is probably more unreliable than any drive, so are you meant to keep an eye on the blinkenlights?

Options 1 & 2 look dangerous.  Maybe your entire staff has learned how to deal with an app server.  But what about your new guy, who you've started out on coding "Hello world" and is staying late after everyone left, working on internationalization to impress you?  Should he troubleshoot the shared server?  What if it was a faulty switch?  When I used hp-ux, I wasn't absolutely glad to be dependent on the sysadmin.  Errors in those systems tend to have global consequences and are painful to debug.

Obviously these are theoretical questions, but they're basic.  Where does one go to answer them? 

BTW, aren't there SCSI problems with WinXP Pro?
http://forums.storagereview.net/viewtopic.php?t=1758

Tj
Friday, January 24, 2003

The SCSI slow performance problem is a known issue for Windows XP and was fixed with a new NTFS driver long ago in both a hotfix and in Service Pack 1 (Sep-2002). For example, see:
http://support.microsoft.com/default.aspx?scid=kb;en-us;308219
Hard Disk Performance Is Slower Than You Expect (with SCSI and NTFS).

Philip Dickerson
Friday, January 24, 2003

Joel, regarding your options -
1) TS Client - your developers are now working over a 100Mbit pipe instead of the hardware architecture that we've all been driving for the past fifteen years. In addition, you can't multimonitor TS the way you can in the OS (when I use multiple monitors, it's like Tom Cruise in "Minority Report" - grabbing and dragging windows around to where I need them. )

2) Net PC's. There's a reason this "solution in search of a problem" was stillborn. You're letting architecture get in the way of what your developers want to do. Let me also point out that if *I* applied for a job and was told everyone worked on NetPC's then I would keep looking. (This applies to point (1) above, as well)

3) RAID. #1 question - why IDE? Why not go SCSI? Again, there's a reason that any failure-tolerant system has a SCSI architecture. Price out boxes with three 18GB SCSI drives. (Note - you also can get drives that are 2x as fast in the deal).
Also, make sure you're getting the right kind of RAID. Either mirroring or RAID 5 will buy you breathing space while you replace the dead drive. Make sure your RAID solution can rebuild on the fly. When you buy the systems, buy at least five spare hard drives and put them on the shelf. If you're very, very careful you might be able to get systems with hot-swappable SCA drives (which always seem to be on clearance)

If you do go with IDE RAID, make sure you can get either mirroring or RAID 5 (*not* striping).

Finally - I've found that emails to Alienware and Polywell have gone unanswered. I just got a nice fast reply from a company called xdream-machines.com, which has really nice configurators for their (intelligently configured) systems, and seems to be reasonably priced. However I haven't purchased from them, so can't comment there.

Best of luck!

Philo

Philip Janus
Friday, January 24, 2003

If you have a large number of developers I would suggest setting up a warm backup system.

Must say this is an expensive solution, both in setup and maintainance, but at least it prevents you from having 50 developers idle for 1-2 days at a cost of min 50$ / man hour ... ummm that's 20.000$/day.

For small development, shops probably a RAID on the server, backups, offsite backup archives and UPSs will prevent most of the trouble, but downtime is still highly probable.

Finally, for developers I would pick up fast machines, with tons of memory and harddrive such that compilation takes a reasonably small amount of time and uses only local storage (instead of bandwidth)

Dino
Friday, January 24, 2003

One small point on disk drive reliability: keep them spinning.  Don't power down the computer when you go home, or on vacation for that matter.  Disable the power saving feature that shuts down drives.  Stopping and starting, heating up and cooling down, are not good for them.

I believe at least part of the reputation that SCSI has for greater reliability comes from being used in servers that run continuously.

RH
Friday, January 24, 2003

http://www.haifa.il.ibm.com/projects/storage/iboot/index.html explains a proprietary hack by IBM in Israel trying to fool the BIOS into booting from a network drive using the iSCSI protocol. I don't know much iSCSI and there's very little information about this project so maybe it is not very useful info :-)

Li-fan Chen
Saturday, January 25, 2003

1. Developing through TS (or Citrix) doesn't cut it.
2. Your network goes down and everybody's hosed.
3.  SCSI RAID-5 is the only way to go to achieve what you want.  Bite the bullet and buy your developers a server-class machine.  They'll love it.  You'll have peace of mind.  Calculate the cost and see if it would be worth it to avoid this past week.  If it is, do it.  This is a proven solution that works.

John Cavnar-Johnson
Sunday, January 26, 2003

Dual PCs for your developers.
SCSI RAID.
Hot swappable hard drives.
Best Options.

WNC
Sunday, January 26, 2003

Since everyone else has, I'll chip in with my opinion:

RAID 1 in the desktop. IDE for price, with a reputable reliable controller (ie: 3ware, Adaptec...) and buy 3 identical drives per machine. Your drive will not blow up today, it'll blow up in 18 months when the specific model is discontinued, the new "60 gig" HD has a different configuration and may or may not be discontinued or only available from the budget line of the manufacturer. While I'm at it, whatever happened to 10 gig drives?

And now for the ever popular "How come it's taken you 2 days to get everything back?" Well, I can see that very well, been there, put in the long hours and fought my way through the small little things that "should've been," the vendors that are "back-ordered" and unresponsive, etc.

Now, if I spec a server, it has redundant PSU, hardware SCSI raid with hot-spare, spare HD and PSU must be available on site and ram is ECC (usually standard with whatever can meet all previous requirements.) The main reason being that when the shit hits the fan, it's really just a matter of having your colo facility call you about "the bleeping server of yours has been beeping continusously for 2 hours" that are resolved in 5 minutes after swapping the PSU and/or HDD that blew up.

Alex
Sunday, January 26, 2003

" Maybe I misunderstand RAID but if your RAID controller fails can't you just:

1. take the busted RAID card out of the machine.
2. Plug either of the mirrored HDDs into the normal IDE controller on the motherboard
3. Turn the PC back on.

Takes 5 mins. No lost data. Replace RAID card at your leisure.

Or is it more complex than that?"

In theory, that is how it is supposed to work, but in practice, I have never seen it so.

Eddy Young
Sunday, January 26, 2003

I use the following process:

A development computer has 2 hard drives,  1 for OS 1 for data. 

I use 'second copy' to constantly (every 15 minutes or so) keep the data drive backed up to  my server.

Only when the OS drive goes out am I in for a bit of work, and the backup is very current if the data drive goes.

Gregor Brandt
Sunday, January 26, 2003

I've also had pretty bad experience with cheapo IDE Raid controllers. Buggy drivers, slow boot times, extreme wackiness. Yeah, also, it's a good idea not to forget to put the two drives for striping on different channels. And, um, do something about temeperature control inside the case.

deadprogrammer
Monday, January 27, 2003

*  Recent Topics

*  Fog Creek Home