Re: data corruption with nvidia chipsets and IDE/SATA drives // memory hole mapping related bug?!



Hi everybody.

After my last mails to this issue (btw: anything new in the meantime? I
received no replys..) I wrote again to nvidia and AMD...
This time with some more success.

Below is the answer from Mr. Friedman to my mail. He says that he wasn't
able to reproduce the problem and asks for a testing system.
Unfortunately I cannot ship my system as this is my only home PC and I
need it for daily work. But perhaps someone else here might has a system
(with the error) that he can send to Nvidia...

I cc'ed Mr. Friedman so he'll read your replies.

To Mr. Friedman: What system did you exactly use for your testing?
(Hardware configuration, BIOS settings and so on). As we've seen before
it might be possible that some BIOSes correct the problem.

Best wishes,
Chris.




Lonni J Friedman wrote:
Christoph,
Thanks for your email. I'm aware of the LKML threads, and have spent
considerable time attempting to reproduce this problem on one of our
reference motherboards without success. If you could ship a system
which reliably reproduces the problem, I'd be happy to investigate further.

Thanks,
Lonni J Friedman
NVIDIA Corporation

Christoph Anton Mitterer wrote:

Hi.

First of all: This is only a copy from a thread to nvnews.net
(http://www.nvnews.net/vbulletin/showthread.php?t=82909). You probably
should read the description there.

Please note that his is also a very important issue. It is most likely
not only Linux related but a general nforce chipset design flaw, so
perhaps you should forwad this mail to your engineers too. (Please CC me
in all mails).

Also note: I'm not one of the normal "end users" with simple problems or
damaged hardware. I study computer science and work in one of Europes
largest supercomputing centres (Leibniz supercomputing centre).
Believe me: I know what I'm talking about.... and I'm investigating in
this issue (with many others) for some weeks now.

Please answer either to the specific lkml thread, to the nvnews.net post
or directly to me (via email).
And I'd be grateful if you could give me email-addresses from your
developers or enginers, or even better, forward this email to them and
CC me. Of course I'll keep their emails-addresses absolutely confident
if you wish.

Best wishes,
Christoph Anton Mitterer.
Munich University of Applied Sciences / Department of Mathematics and
Computer Science
Leibniz Supercomputing Centre / Department for High Performance
Computing and Compute Servers




Here is the copy:
Hi.

I've already tried to "resolve" this via the nvidia knowledgebase but
either they don't want to know about that issue or there is noone who is
competent enought to give information/solutions about it.
They finally pointed me to this fourm and told me that Linux
<http://www.nvnews.net/vbulletin/showthread.php?t=82909#> support would
be handled here (they did not realise that this is probably a hardware
<http://www.nvnews.net/vbulletin/showthread.php?t=82909#> flaw and not
OS related).

I must admit that I'm a little bit bored with Nvidia's policy in such
matters and thus I only describe the problem in brief.
If here is any competent chipset engineer who reads this, than he might
read the main discussion-thread (and some spin-off threads) of the issue
which takes place at the linux-kernel mailing list (again this is
probably not Linux related).
You can find the archive here:
http://marc.theaimsgroup.com/?t=116502121800001&r=1&w=2
<http://marc.theaimsgroup.com/?t=116502121800001&r=1&w=2>


Now a short description:
-I (and many others) found a data corruption issue that happens on AMD
Opteron / Nvidia chipset systems
<http://www.nvnews.net/vbulletin/showthread.php?t=82909#>.

-What happens: If one reads/writes large amounts of data there are errors.
We test this the following way: Create some test data (huge amounts
of),.. make md5sums of it (or with other hash algorithms), then verify
them over and over.
The test shoes differences (refer the lkml thread for more information
about this). Always at differnt files (!!!!). It may happen at read AND
write access <http://www.nvnews.net/vbulletin/showthread.php?t=82909#>.
Note that even for affected users the error occurs rarely (but this is
of course still far to often): My personal tests shows about the following:
Test data: 30GB (of random data), I verify sha512sum 50 times (that is
what I call one complete test). So I verify 30*50GB. In one complete
test there are about 1-3 files with differences. With about 100
corrupted bytes (at leas very low data sizes, far below an MB)

-It probably happens with all the nforce chipsets (see the lkml thread
where everybody tells his hardware)

-The reasons are not single hardware defects (dozens of hight quality
memory <http://www.nvnews.net/vbulletin/showthread.php?t=82909#>, CPU,
PCI bus, HDD bad block scans, PCI parity, ECC, etc. tests showed this,
and even with different hardware compontents the issue remained)

-It is probably not an Operating System related bug, although Windows
won't suffer from it. The reason therefore is, that windows is (too
stupid) ... I mean unable to use the hardware iommu at all.

-It happens with both, PATA and SATA disk. To be exact: It is may that
this has nothing special to do with harddisks at all.
It is probably PCI-DMA related (see lkml for more infos and reasons for
this thesis).

-Only users with much main memory (don't know the exact value by hard
and I'm to lazy to look it up)... say 4GB will suffer from this problem.
Why? Only users who need the memory hole mapping and the iommu will
suffer from the problem (this is why we think it is chipset related).

-We found two "workarounds" but these have both big problems:
Workaround 1: Disable Memory Hole Mapping in the system BIOS at all.
The issue no longer occurs, BUT you loose a big part of your main memory
(depending on the size of the memhole, which itself depends on the PCI
devices). In my case I loose 1,5GB from my 4GB. Most users will probably
loose 1GB.
=> inacceptable

Workaround 2: As told Windows won't suffer from the problem because it
always uses an software iommu. (btw: the same applies for Intel CPUs
with EMT64/Intel 64,.. these CPUs don't even have a hardware iommu).
Linux is able to use the hardware iommu (which of course accelerates the
whole system).
If you tell the kernel (Linux) to use a software iommu (with the kernel
parameter iommu=soft),.. the issue won't appear.
=> this is better than workaround 1 but still not really acceptable.
Why? There are some following problems:

The hardware iommu and systems with such big main memory is largely used
in computing centres. Those groups won't abdicate the hwiommu in
general, simply because some Opteron (and perhaps Athlon) / Nvidia
combinations make problems.
(I can tell this because I work at the Leibniz Supercomputing Centre,..
one of the largest in Europe)

But as we don't know the exact reason for the issue, we cannot
selectively switch the iommu=soft for affected
mainboards/chipsets/cpu-steppings/and alike.

We'd have to use a kernel wide iommu=soft as a catchall solution.
But it is highly unlikely that this is accepted by the Linux community
(not to talk about end users like the supercomputing centres) and I
don't want to talk about other OS'es.


So we (and of course all, and especially professional, customers) need
Nvidias help.

Perhaps this might be solvable via BIOS fixes, but of course not by the
stupid-solution "disable hwiommu via the BIOS".
Perhaps the reason is a Linux kernel bug (although this is highly unlikely).
Last but not least,.. perhaps this is AMD Opteron/Athlon (Note: These
CPUs have the memory controllers directly integrated) issue and/or
Nvidia nforce chipset issue.

Regards,
Chris.
*
btw: For answers from Nvidia engineers/developers or end-users who
suffer from that issue too,... please post it to the lkml thread (see
above for the link) and if not possible here.
You may even contact me via email (calestyo@xxxxxxxxxxxx) or personal
messages.*

PS: Please post any other resources/links to threads about this or
similar problems.


-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may contain
confidential information. Any unauthorized review, use, disclosure or distribution
is prohibited. If you are not the intended recipient, please contact the sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------



begin:vcard
fn:Mitterer, Christoph Anton
n:Mitterer;Christoph Anton
email;internet:calestyo@xxxxxxxxxxxx
x-mozilla-html:TRUE
version:2.1
end:vcard



Relevant Pages

  • Re: DVI output, ATI or nVidia
    ... it is a bad thing for a hardware maker not to give away its trade ... NVidia has stated plenty times that they don't ... Tell that to laptop owners who unknowingly bought it thinking it was ... Because your definition of support is short-sighted and lasts only so ...
    (Fedora)
  • Re: DVI output, ATI or nVidia
    ... it is a bad thing for a hardware maker not to give away its trade ... NVidia has stated plenty times that they don't ... Tell that to laptop owners who unknowingly bought it thinking it was ... Because your definition of support is short-sighted and lasts only so ...
    (Fedora)
  • Re: DVI output, ATI or nVidia
    ... >> of hardware the user bought and wants to use. ... NVidia has stated plenty times that they don't ... > IMO, qemu, bochs, xen, and whatever are just not as easy to use. ... Because your definition of support is short-sighted and lasts only so ...
    (Fedora)
  • Prognosen: Spieleentwicklung und Hardwareauslastung
    ... Im allgemeinen habe ich immer gerne Hardware gekauft, welche die Spiele zur gleichen Zeit noch nicht vollständig ausnutzen konnten, so daß ich also oft schon für die kommenden Spielegenerationen gewapnet war. ... Wieviel stärker ist eine Grafikkarte mit nVidia GF 7800 GTX gegenüber einer mit nVidia GF 6800 GT/Ultra in der Praxis und wie wird das ...
    (de.rec.spiele.computer.technik)