Fog Creek Software
Discussion Board




Flipping bits (weird hardware problem?)

I'll start this off by saying I work in a small company with old cheap hardware and a short deadline.  In better news, we have a build server and a daily build running.  (In fact, we've improved +6 points on the Joel test since I started working here, makes me proud.)

Currently I'm having an obscure problem with the build server, that looks to be hardware related.  Bits flip.  For no apparent reason.  I can assure you that last night a VC++ include file said "bind" (because the nightly build ran through it without error) where today it says "b)nd" and won't compile.  Lowercase i (0x01101001) --> close paren (0x00101001).  The modification date on the file hasn't changed.  This is the second time this has happened (the first time, I re-Ghosted the machine with its image and planned to do something about it if it happened again).

Our hardware guy has checked the RAM and run scandisk -- nothing appears to be wrong.  He's stumped.  I'm afraid the same thing will happen / is happening in places where it can't be corrected for, and we'll end up with flaky deliverables despite the process.

I grant that buying an all-new machine and reinstalling everything should fix the problem, if it's hardware, but that takes A) time and B) money.  Because we're operating penny wise, pound foolish, the $500 + 2 days for a replacement machine probably won't be forthcoming until we waste a greater-value amount of our time proving there is in fact a problem. ;)

That sad state of affairs aside, it is an interesting head-scratcher.  Any hardware sleuths who could suggest theories, or even better, diagnostic tools with which to test said theories, will get...well, my thanks and admiration, the knowledge that they are furthering techie entertainment at the cost of actually getting things done, and that's pretty much it, darn the management's cheap little hearts.

Mikayla
Monday, February 03, 2003

Hi,

I had a similar problem. It turned out that somebody mixed some 100 ram with 133 ram. And that led to random bit flips which was really hard to find out (fixing it is simple, though. Just replace the ram). It took me almost a week to find that out. (Man, I should have had daily backups.)

Simple scan disk or mem check might not be able to ensure the mem or the harddisk are okay. You need some heavy-duty tools like DocMem. For the hard disk, there are diagnosis tools for each brand. For mine, I d/l a bootable disk from Maxtor's site which runs tests as thorough as it gets. (BTW, that tool uses DR-DOS. ) Also check the CPU temp. The CD Rom that comes with the motherboard might have some other diagnosis tools too.

If you have two similar machines, you can replace the HW piece by piece and do some control experiments.

Good luck.

S.C.
Monday, February 03, 2003

DocMem is I believe free, and may be able to find the problem.

I would have thought you would be with ECC memory on a server though.

Scott Mueller always reckoned that the change to non-parity memory would cause problems.

Check for magnetic interference. An ac/motor near the server for example.

Stephen Jones
Monday, February 03, 2003

Probably not related, but we had a similar experience a while back where the build machine just suddenly fell over with compile syntax errors everywhere.

It turned out that the server that ran the SourceSafe database had been upgraded over the weekend and the new network card kept having collisions and sending across incorrect packet data. Which is strange, as I thought this is the sort of thing TCP sorts out when IP packets come in with incorrect characters or in the wrong order.

Switched the network card in the server over and everything ran fine...

Better than being unemployed...
Monday, February 03, 2003

I've had this problem when my machine was overheating (AMD K6, crappy case).  I still use it (it's my network firewall) but I have to run it with the case off or it locks up or corrupts the filesystem.

This was very hard to diagnose.

Wayne Venables
Monday, February 03, 2003

Make sure your fans are all running.
Have had problem like this was heat!

moses whitecotton
Monday, February 03, 2003

Thanks all.  Nothing has turned up a definite cause yet, but I will keep trying your suggestions (and try some of the low-tech "can't hurt" solutions such as minimizing heat and magnetism in the area).

Mikayla
Monday, February 03, 2003

Most likely cause is mixed ram speeds somewhere in the system. We had a very odd NT server system where at some point the customer (or their supplier) had mixed and matched ram.
The other thing to watch out for is a duff raid/scsi card again if that has on board ram the same problems can occur.
Network cards can also cause this to occur (particularly if the PCI bus timing is out)
It may also be data pattern sensitive.
Sadly diagnosing this may be almost impossible.
Your first step should be to try and get this to repeat. If a system fails 4 out of 5 times then you should be able to narrow this down by swapping parts, if it fails 1 in 10 times then you could be in for a very long wait.
If you have a MSDN sub then there are some Hardware Compatibility Lab test CDs which might be useful to help isolate the cause.

Peter Ibbotson
Monday, February 03, 2003

It's your RAM.

There is no ram test program available that can test all combinations of ram state changes, thus this stuff can slip through all the tests on your computer and back at the factory.

It's more of a problem nowadays with huge RAMs and non-name manufacturers.

Buy name brand RAM.

X. J. Scott
Monday, February 03, 2003

More then Likely it is RAM,  but I have seen in rare cases where noise coming out of the power supply caused an issue like this and one time a faulty Mother Board did the same thing (extrmely rare). PS and MB issues happened with cheap old hardware.

cheers
MAD

A Software Build Guy
Monday, February 03, 2003

I agree RAM sounds like the most likely culprit, but if it is indeed the RAM, how would it have managed to change a file on his hard disk without updating the timestamp?  I would believe the RAM flipping a bit and causing the compile to fail, but then how did it write that error back to the file without the OS updating the timestamp on the file?

Mike McNertney
Monday, February 03, 2003

Well, I can't prove there's anything wrong with the RAM, but on the general wisdom I've changed it out anyway and checked the new RAM as best I can.  My hope at this point is that it won't happen again, or at least not until we get out of this part of the development cycle, when there's some chance of doing a better job fixing it.

Still doesn't explain how it happens, but at this point I'm willing to accept there are some things I'll never know. ;)

Mikayla
Tuesday, February 04, 2003

Apart from the physical RAM itself there's the system bus, cpu ram cache, I/O path to the drive and they can all be attacked by a bunch of factors.

If there's no temperature sensor on the motherboard then you could try sticking a temperature probe into the case close to the processor, but not attached to it.  If its more than about 70C then you may have a temperature problem (this will vary depending on the processor).

The power supply might be glitching, and on board voltage regulators failing and so capacitors behave like leaky buckets with the bottoms kicked out.  Or the power supply could even introduce noise which isn't filtered out which knocks out some compoenent.

Or perhaps the motherboard itself has developed a fault.

If all the errors are only seen on saved files then it may be the path to the drive that's at fault.  The drive is unlikely itself to be faulty in this way but the data cable might be.  IDE cables have to be quite short you get rapid attenuation of the signal on them.

So, unless I decided that life was too short and time too expensive (which is most likely what I would decide), I'd start with the easiest components and swap out the drive cable, check connections, etc, etc.  Then check the temperature inside the case, step down the processor to a slower speed if possible and then swap out the RAM.

If it still failed I'd take everything and put it in a new case and power supply.

But, in reality I'd throw the machine away, apart from the drive and build/buy a new one.

Simon Lucy
Tuesday, February 04, 2003

*  Recent Topics

*  Fog Creek Home