Re: NForce2 pseudoscience stability testing (2.6.0-test11)

From: Craig Bradney (cbradney_at_zip.com.au)
Date: 12/03/03

  • Next message: Xose Vazquez Perez: "Re: libata in 2.4.24?"
    To: b@netzentry.com
    Date:	Wed, 03 Dec 2003 01:28:58 +0100
    
    

    Jump down for my replies..

    On Wed, 2003-12-03 at 00:36, b@netzentry.com wrote:
    > Regardless of BIOS version, doesn˙t the kernel get to
    > fix values if things are "wrong?" I've seen this message
    > on several boards where various tables and ACPI and APIC
    > things are "fixed up." The only reason I want to find out so bad
    > is that board survives unusual amounts of abuse on Windows 2000. (I
    > havent been able to hang Windows as Prakash had.)
    >
    > Right now I can't even afford to test 2.6.0.test11 in terms
    > of time but very similar problems exist in 2.4, suggesting
    > something fundamental?
    >
    > About the IDE, it seems to be the easiest way to promote the
    > problem but time seems to be the biggest factor. Some have
    > suggested wrt this NFORCE2 problem that idle time makes it
    > worse, but I've seen the hang under both conditions.

    Not idle time.. my PC idles (with KDE/xchat/evolution running , if that
    is idling :)) while i am out or sleeping.. thats at least 4 hours non
    stop per day...

    > I have a very minimal set of things running on this board, I
    > shut off the USB1.x and 2.0 controller, jumpered the Si SATA
    > off and turned of Audio and the NVIDIA Lan. (Prakash suggested
    > the Si SATA was the evil but I have it jumpered off.)

    Audio on and used, USB 2.0, 1.x on but nothing plugged, NVIDIA lan off,
    3com lan on and connected at 100Mbit.

    > For me the formula to reproduce this problem is this simple:
    > - Do anything (including passing the noapic nolapix noapixio etc)
    > in a UP-kernel with APIC compiled in, get hang.
    > - To avoid the problem, recompile the UP-kernel with APIC turned
    > off.
    >
    > What probably isnt the problem:
    > - Motherboard flavor (This problem has cropped up on all nforce2 boards)

     a7n8x deluxe v2 with standard clocked athlon xp 2600+

    > - BIOS (This problem can change nature with a BIOS rev, but I havent
    > seen it go away and I tried several).

     bios 1007.

    > - SATA - problem appears even if SATA is shut off via jumper, as does Craig.

    SATA is shut off on jumper. no SATA in kernel

    > - Memory - I swapped memory in a test to see. I also tried the dual
    > channel and the single channel mode. No change.

    dual channel 2x256mb

    > - SMP / UP / passed flags Julien said: "I run strictly non-SMP kernels
    > and they always crash if APIC (or local APIC?) is enabled." - I see the
    > same things.

    smp off, preempt off. lapic on, apic on, acpi on

    > (Julien also suggest a quick way to see if the system is stable: type
    > hdparm -t /dev/hd<someharddrive> several times).

    as my uptime below suggests.. that didnt hurt a thing. I havent rebooted
    since that test.

    > - ASUS. I have a rev2.0 PCB, and several rev 1.x PCB, and they all
    > suffer. Others also report boards from different manufacturers hanging
    > with the NFORCE2. (Prakash: Abit NF7-S V2.0), Lenar (Epox 8RDA+ , 8RDA3+ )
    > - Kernel 2.4 or 2.6 specifically. This problem occurs both in later 2.6
    > builds and 2.4.23 (and below)
    > - IDE (NFORCE/AMD IDE driver in kernel) It seems the problem is easily
    > exacerbated by IDE, but I think that this is not the root cause. In
    > NO-APIC mode the IDE behaves reliably.
    >
    >
    > Craig said he ran it for 18 hours with abuse, including Juliens hdparm
    > test.

    Uptime is 4 days 1:55 and counting...

    Things I can remember running in that time:
    apache, mysql, proftpd, kde, evolution, xchat2, mozilla, konsole,
    scribus, any updates including gcc 3.2.3 from 3.2.2 that came through
    gentoo's portage system, gcc recompilation of scribus quite often (cvs
    updates and own coding), gdb for about 3 hours finding one scribus
    segfault testing over and over again with many crashes (fixed now btw
    for those scribus users out there :))

    > -> Craig, which kernel are you using? Distro (RedHat Taroon's kernel has
    > LAPIC turned off)? PCB Rev of motherboard? Bios revision? Whats the
    > lspci, cat /proc/interrupts look like? dmesg?

    I'm using Gentoo's gentoo-dev-sources 2.6.0_beta11 which is 2.6 test11
    plus the patches to be found at:
    http://dev.gentoo.org/~brad_mssw/kernel_patches/2.6.0/genpatches-0.6/

    They have released a 2.6_beta11-r1 that I ahvent upgraded to which uses
    http://dev.gentoo.org/~brad_mssw/kernel_patches/2.6.0/genpatches-0.7/

    and i notice there is a 0.8 directory.

    So.. sorry if my comments have misled.. its not exactly standard 2.6 now
    that I do check it. There is one nvidia nforce network patch.

    As above, PCB is 2.0, BIOS is standard 1007 from ASUS.

    lspci and proc/interrupts follow. dmesg has been wiped out by some
    packet command errors with drive seek errors on hdc (dvdrw) from when i
    was playing music with that drive.

    -----
    lspci:
    00:00.0 Host bridge: nVidia Corporation nForce2 AGP (different version?)
    (rev c1)
    00:00.1 RAM memory: nVidia Corporation nForce2 Memory Controller 1 (rev
    c1)
    00:00.2 RAM memory: nVidia Corporation nForce2 Memory Controller 4 (rev
    c1)
    00:00.3 RAM memory: nVidia Corporation nForce2 Memory Controller 3 (rev
    c1)
    00:00.4 RAM memory: nVidia Corporation nForce2 Memory Controller 2 (rev
    c1)
    00:00.5 RAM memory: nVidia Corporation nForce2 Memory Controller 5 (rev
    c1)
    00:01.0 ISA bridge: nVidia Corporation nForce2 ISA Bridge (rev a4)
    00:01.1 SMBus: nVidia Corporation nForce2 SMBus (MCP) (rev a2)
    00:02.0 USB Controller: nVidia Corporation nForce2 USB Controller (rev
    a4)
    00:02.1 USB Controller: nVidia Corporation nForce2 USB Controller (rev
    a4)
    00:02.2 USB Controller: nVidia Corporation nForce2 USB Controller (rev
    a4)
    00:05.0 Multimedia audio controller: nVidia Corporation nForce
    MultiMedia audio [Via VT82C686B] (rev a2)
    00:06.0 Multimedia audio controller: nVidia Corporation nForce2 AC97
    Audio Controler (MCP) (rev a1)
    00:08.0 PCI bridge: nVidia Corporation nForce2 External PCI Bridge (rev
    a3)
    00:09.0 IDE interface: nVidia Corporation nForce2 IDE (rev a2)
    00:0c.0 PCI bridge: nVidia Corporation nForce2 PCI Bridge (rev a3)
    00:0d.0 FireWire (IEEE 1394): nVidia Corporation nForce2 FireWire (IEEE
    1394) Controller (rev a3)
    00:1e.0 PCI bridge: nVidia Corporation nForce2 AGP (rev c1)
    02:01.0 Ethernet controller: 3Com Corporation 3C920B-EMB Integrated Fast
    Ethernet Controller (rev 40)
    03:00.0 VGA compatible controller: ATI Technologies Inc Radeon R250 If
    [Radeon 9000] (rev 01)
    03:00.1 Display controller: ATI Technologies Inc Radeon R250 [Radeon
    9000] (Secondary) (rev 01)
    -----
    cat /proc/interrupts
               CPU0
      0: 350520478 XT-PIC timer
      1: 346366 IO-APIC-edge i8042
      2: 0 XT-PIC cascade
      8: 2 IO-APIC-edge rtc
      9: 0 IO-APIC-level acpi
     12: 3568875 IO-APIC-edge i8042
     14: 2286009 IO-APIC-edge ide0
     15: 153023 IO-APIC-edge ide1
     19: 26315864 IO-APIC-level radeon@PCI:3:0:0
     21: 855007 IO-APIC-level ehci_hcd, NVidia nForce2, eth0
     22: 3 IO-APIC-level ohci1394
    NMI: 0
    LOC: 350520402
    ERR: 0
    MIS: 0
    -----
    > -> Josh, Craig & anyone else who gets this working on 2.6.test11 or some
    > other fork of 2.6 can they please try 2.4.23-release as well and see if
    > the machine hangs as well?

    I was running gentoo's 2.4.23 pre 8 before this.. its still on the
    machine. I get no hangs with it.

    This machine has never run Win* (its only 3 weeks old). Wouldnt go near
    an Uber BIOS on a still warranted motherboard.

    Hope some of it helps.

    Craig

    > The Linux APIC code generically works on most other hardware. Something
    > specific to the NFORCE2 chips and its interaction with Linux's APIC code
    > causes the hard hangs. The Windows 2000's APIC code was made before the
    > NFORCE2 existed, and it seems to run fine there.
    >
    > - About that Uber BIOS bios for the Asus Deluxe board, Anyone running
    > this: a) do you really want to run a hacked bios when other OS run fine
    > on the unhacked BIOS b) do you believe that any of the un-hidden
    > settings the uber bios or settings you may have changed helps solve this
    > problem?
    >
    > These bugs on the LK Bugzilla seem related:
    > http://bugme.osdl.org/show_bug.cgi?id=1203
    >
    > Loosely related:
    > http://bugme.osdl.org/show_bug.cgi?id=1530
    > http://bugme.osdl.org/show_bug.cgi?id=1440
    > http://bugme.osdl.org/show_bug.cgi?id=1269
    >
    >
    > Does anyone know which developers would be interested in looking at
    > this? I think it would be better if a specific
    > patch fixed this problem than try a kernel from bitkeeper
    > on a daily basis and wait for the problem to go away without
    > ever knowing what caused it.
    >
    >
    >
    >
    >
    > > To me the strangest thing is that when I first got this
    > > board a month or so ago it would hang with APIC or LAPIC
    > > enabled. Now it works fine without disabling APIC. All I
    > > did was update the BIOS and use it for a while with APIC
    > > disabled. 2.6.0-test9-mm through 2.6.0-test11 all work just
    > > fine. Still at the same time some people are reporting that
    > > it works, some are reporting that it doesn't. I probably
    > > wouldn't think to much of this except I was one of the ones
    > > that said APIC causes crashes with IDE load, but now it
    > > doesn't?
    > >
    >
    >
    >
    > > On approximately Tue, Dec 02, 2003 at 10:13:46AM +0000,
    > > ross.alexander wrote: Alistair,
    > >
    > > I upgraded the BIOS about a week ago to 1007. I personally
    > > found it to be less stable than 1006. I don't believe it is
    > > a problem with my hardware combination since it has been
    > > stable for long periods of time. I was running the SMP
    > > kernel simply because I (wrongly) presumed a) you needed it
    > > to get the IO-APIC working, and b) it didn't do any harm.
    > >
    > > It is clear that the UP kernel is considerable more stable
    > > than the SMP kernel. This is a very useful fact since it
    > > suggests that it is not a problem with the IDE device
    > > driver per se. The whole purpose of my testing is to try to
    > > determine which options increased the stability and hence
    > > highlight where the problem could be.
    > >
    > > One of the reasons I don't like ACPI is the huge amount of
    > > additional complexity it adds and the amount of stuff it
    > > could screw up. Now I have not heard that any of the VIA
    > > KTxxx based motherboards have any problems. If this is true
    > > then the problem does not lie with the LAPIC, since that is
    > > in the processor, not the MB. The fact that it seems to
    > > only occur with the NForce2 chipset means it could well be
    > > some interrupt coming into the LAPIC from Interrupt Bus.
    > > However I certainly don't claim to be an expert on this so
    > > I could well be talking complete crap.
    > >
    > > Conclusion: More testing required.
    > >
    > > Cheers,
    > >
    > > Ross
    > >
    > > Alistair John Strachan <s0348365 28/11/2003
    > > 04:46 p.m.
    > >
    > > To: ross.alexander
    > > <brendan cc: linux-kernel
    > > Subject: Re: NForce2 pseudoscience stability testing
    > > (2.6.0-test11)
    > >
    > > On Friday 28 November 2003 15:13,
    > > ross.alexander
    > >
    > > The conclusion to this is the problem is in Local APIC with
    > > SMP. I'm not saying this is actually true only that is what
    > > the data suggests. If anybody wants me to try some other
    > > stuff feel free to suggest ideas.
    > >
    > > Cheers,
    > >
    > > Ross
    > >
    > > It's evidently a configuration problem, albeit BIOS,
    > > mainboard revision, memory quality, etc. because I and many
    > > others like me are able to run Linux 2.4/2.6 with all the
    > > options you tested and still achieve absolute stability, on
    > > the nForce 2 platform.
    > >
    > > My system is an EPOX 8RDA+, with an Athlon 2500+ (Barton)
    > > overclocked to 2.2Ghz, and 2x256MB TwinMOS PC3200 dimms.
    > > FSB is at 400Mhz, and the ram timings are 4,2,2,2. One
    > > might expect such a configuration to be unstable,
    > >
    > > but it is not.
    > >
    > > I'm currently running 2.6.0-test10-mm1 with full ACPI (+
    > > routing), APIC and local APIC, no preempt, UP, and
    > > everything has been rock-solid, despite the machine being
    > > under constant 100% CPU load and fairly active IO load.
    > >
    > > Also, many others have found that just disabling local apic
    > > (and the MPS setting in the BIOS) as well as ACPI solves
    > > their problem, so I'm skeptical that SMP really causes
    > > *nForce 2 specific* instability.
    > >
    > > -- Cheers, Alistair.
    > >
    > > personal: alistair()devzero!co!uk university:
    > > s0348365()sms!ed!ac!uk student: CS/AI Undergraduate
    > > contact: 7/10 Darroch Court, University of Edinburgh.
    > >
    > > - To unsubscribe from this list: send the line "unsubscribe
    > > linux-kernel" in the body of a message to
    > > majordomo@vger.kernel.org More majordomo info at
    > > http://vger.kernel.org/majordomo-info.html Please read the
    > > FAQ at http://www.tux.org/lkml/
    >

    -
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/


  • Next message: Xose Vazquez Perez: "Re: libata in 2.4.24?"

    Relevant Pages

    • Kernal oops with drm radeon
      ... Trying to play a recording with mythtv creates a kernel oops ... 9000-9fff: PCI Bus #01 ... 0000:00:02.0 USB Controller: nVidia Corporation nForce2 USB Controller ...
      (Linux-Kernel)
    • RE: escalation stage 2 [was:RE: Big and ugly bug in 5.1-release]
      ... It is reognized as ar0 after setting it up in the controller's BIOS ... > If the controller says the drive is bad when you are in the BIOS, ... >> That's what I do and after kernel probing the machine reboots with the ...
      (freebsd-current)
    • Re: Linux wont detect HDs with my PCI RAID controller enabled.
      ... I recently downloaded and compiled a new kernel so ... > controller, I look and notice that the kernel finds the controller and ... > standard IDE drives now that I have added support for this new PCI ATA ... You could also look at the bios and find out how your bios is ...
      (comp.os.linux.hardware)
    • Re: freebsd without keyboard
      ... But looked at my kernel: ... sure to tell the bios not to report keyboard errors. ...
      (freebsd-questions)
    • Re: [Linux-fbdev-devel] Generic VESA framebuffer driver and Video card BOOT?
      ... >> kernel image or that the image is corrupt or that hardware diagnostics failed. ... Well being able to use this BIOS emulator logic to bring up the primary ... the same as Open Firmware really). ... Actually there is nothing wrong with the x86 BIOS from the perspective of ...
      (Linux-Kernel)