Re: how can a bit be off in memory?



The Natural Philosopher wrote:
Charles T. Smith wrote:

Actually, I guess I'll buy the arguments of the ECC (EMA?) posters
that it
shows why ECC is a good thing.

I don't. I am fairly certain you had a corrupt transfer off the disk.

I happen to like ECC memory, but that is probably not relevant to this
discussion because...

In my experience, that is almost certainly another device on the IO bus
that woke up when it shouldn't as some PARTICULAR address passed it by..

I once wrote an operating system and in the disk driver, I had it write a
checksum of each block into the block. This was in addition to the checksum
written and checked by the hardware. This machine was heavily used and in 4
years, I got only one disk failure from the disk system (two hard drives).
Of course those drives were small in terms of capacity: 40 Megabytes each.
But they were large physically, about the size of a washing machine.

Now the interesting thing (to me, anyway) was that the disk checksum that I
calculated agreed with what I got back, indicating that the transfer was OK,
when it was not. The data I read back was all zeros, and the checksum was
zero. But what the failure was was that the transfer failed completely and I
got all zeros for everything. (I then initialized the checksum to something
else, and that fixed that problem.)

By rebooting you changed the memory address it was being loaded INTO.

That problem will remain..sometime someplace you will load another file
off disk into the same space and it will be corrupted.

I think at this point a complete breakdown of what cards and
motherboards you have, is in order..

Even that might not be enough. I had a PDP-11/45 with LOTS of peripheral
hardware. We had a contract with DEC for maintenance, and one day, while
doing PM, a techie dropped a screw driver and blew out a few IO cards on the
Unibus. They repaced them, of course. But then the machine got flakey. Once
every few days it would crash. We had lots of test programs (on punched
paper tape in those days) and I ran them and discovered nothing. So I called
in the techies and they ran the test programs and said it must be user
software at fault. It felt like memory to me, but they insisted I write a 10
instruction program to show it. Eventually I lost it and ran each test
program for several hours instead of a minute or so. The double precision
floating add/subtract program would fail after a few hours. So they replaced
the FPU and that did not help at all. They replaced the memory, and that did
not help either. They finally send down some techies from Maynard (or
wherever their engineering department was) and found an interface module
with a flakey address decoder card. That screw driver made the address
decoder unreliable and when an address meant for another device went down
the bus, this card mistakenly recognized it as its own. We lost something
like 2 weeks with that one.

I am scratching my brains as to how to replicate this reliably too..you
want to copy huge files around probably on a bare bones system, and see
if you can reliably get a one byte corruption. Lacking hardware
emulation (which I used to find the issue we had) your only recourse is
probably to remove cards till the problem vanishes, and ditch whatver
card it was that caused the problem.

If it was a single byte that was corrupted it is almost certainly some 8
bit device that is the issue. That may help to pin it down. I didn't
know there were such anymore..



--
.~. Jean-David Beyer Registered Linux User 85642.
/V\ PGP-Key: 9A2FC99A Registered Machine 241939.
/( )\ Shrewsbury, New Jersey http://counter.li.org
^^-^^ 11:15:01 up 9 days, 18:51, 3 users, load average: 4.11, 4.18, 4.17
.



Relevant Pages

  • Re: 4 CONSECUTIVE CORRUPT DISK DISASTERS WITH WIN2K
    ... Disk and drive controller. ... The application install is responsible for placing the shortcuts. ... unreadable / corrupt. ... Restoration of data files from backup hard drives during re-installation ...
    (microsoft.public.win2000.general)
  • Re: defrag/ error check/ Safe mode problems
    ... You need to find out whether you have a failing hard disk? ... For chkdsk to ... but \FOUND.002 was the "corrupt" file. ... > retried and it went to Safe Mode, but I couldn't get sysclean to run. ...
    (microsoft.public.windowsxp.perform_maintain)
  • Re: SBS 2003 - Hal.dll Getting Currupt
    ... Consistenly getting that error would mean, in my mind, a strong possibility of disk corruption - especially with a raid set. ... If you have hardware raid, you should be able to enter the raid diagnostics at boot time, and run some tests on the controller, disks, and raid set from there. ... "Windows could not start because the following file is missing or corrupt: ...
    (microsoft.public.windows.server.sbs)
  • Re: Foolish project: a fault tolerant disk array
    ... >> hardware hooked between the computer and the disk drives that will ... It is THAT hardware that causes the ... or we use Raid-1 Parallel ATA IDE PCI cards. ...
    (comp.os.linux.hardware)
  • Re: SBS 2003 - Hal.dll Getting Currupt
    ... When "hal.dll missing or corrupt" message appears, it can be because of damaged or missing hal.dll file. ... Go to Boot Menu in BIOS and verify that your hard disk which contains Windows installation is the topmost in boot sequence if you have more than one hard disk. ... "Les Connor" wrote in message ...
    (microsoft.public.windows.server.sbs)