Re: sata_sil24 broken since 2.6.23-rc4-mm1



On 9/30/07, Tejun Heo <htejun@xxxxxxxxx> wrote:
Hello, Torsten.

Torsten Kaiser wrote:
On 9/28/07, Torsten Kaiser <just.for.lkml@xxxxxxxxxxxxxx> wrote:
So in case of -rc3-mm1 I'm pretty sure that it works.

That's still the case.

Ah... that's weird. It would be much better if -rc3-mm1 is broken too. :-P

It would be much better if it would break all the time. Then bisecting
would not be guesswork. :-P

Not completely sure is if 2.6.23-rc7-sglist kernel works. I booted
that 9 times, but from a quick look in /var/log/messages, I might not
have hit the "correct" situation to trigger the error.
That kernel is vanilla 2.6.23-rc7 plus the patch from
http://www.kernel.org/pub/linux/kernel/people/tomo/misc/v2.6.23-rc7-sglist-arch.diff.bz2
( http://marc.info/?l=linux-ide&m=119055574826083&w=2 )

That is no longer the case. Yesterday this kernel did also show the failure.
The error looked a little bit different, but happend at the same
location during the bootup.
Sadly dmesg overflowed and I was not able to capture the first error.

[ 53.462632] Freeing unused kernel memory: 332k freed
ite cache: enabled, read cache: enabled, doesn't support DPO or FUA
[ 77.170903] ata1.00: exception Emask 0x40 SAct 0x1 SErr 0x0 action 0x2
[ 77.170905] ata1.00: irq_stat 0x00020002, SGT no on qword boundary
[ 77.170908] ata1.00: cmd 61/08:00:09:d6:42/00:00:25:00:00/40 tag 0
cdb 0x0 data 4096 out
[ 77.170909] res 50/00:00:af:ea:42/00:00:25:00:00/e0 Emask
0x40 (internal error)

Hmm... SGT not on qword boundary? Please add the following to the end
of sil24_port_start() and report successful and failed kernel boot log.

printk("XXX sil24 cb=%p cb_dma=%llx\n",
cb, (unsigned long long)cb_dma);

Added to my current "2.6.23-rc4-mm0.5"
I will report it's output.

Also, does 'dmesg -s 10000000' give full dmesg?

That boot ended in a minimal initrd environment that normally only
starts the RAID5 and then opens contained encrypted real root.
I was just able to push the output from dmesg through the serial link,
but had no man pages to tell me about -s ...
And that kind of error was until now a one-of-a-kind one. All other
errors where not "internal error" but "timeout".
But one time I had another SGT related error:
Sep 11 19:19:24 treogen [ 33.340000] ata1.00: exception Emask 0x20
SAct 0x1 SErr 0x0 action 0x2
Sep 11 19:19:24 treogen [ 33.340000] ata1.00: irq_stat 0x00020002,
PCI master abort while fetching SGT
Sep 11 19:19:24 treogen [ 33.340000] ata1.00: cmd
61/08:00:09:d6:42/00:00:25:00:00/40 tag 0 cdb 0x0 data 4096 out
Sep 11 19:19:24 treogen [ 33.340000] res
50/00:00:af:ea:42/00:00:25:00:00/e0 Emask 0x20 (host bus error)

What I find kind of interessing is, that while I got three different
error codes the cmd part of the output was always the same.

Is it possible that the bug is much older, and only some change in
rc3-mm->rc4-mm makes it visible that the initialization of the SiI3132
is incomplete?

I can't tell but there is a pretty large userbase of sil24/32 and you
seem to be the only one to report this problem yet. I think it might be
coming somewhere else than libata or sata_sil24 itself. Hmmm... It
would be really great if you can break -rc3-mm1 too. Correct behavior
for something like that for just one -mm version sounds very weird.

It's not just 2.6.23-rc4-mm1. All -mm's after rc4 are broken for me.
Confirmed breakage on -rc4-mm1, -rc6-mm1 and -rc8-mm1. I'm just
narrowing on rc4-mm1 because that was the first version to break.

I'm currently trying to bisect 2.6.23-rc4-mm1. Here is the current status:

[the 2.6.23-rc4-mm1 series-file has 2013 lines]
Up to (incl.) x86_64-convert-to-clockevents.patch (line 747): 2 good boots
Up to (incl.) x86_64-cleanup-struct-irqaction-initializers-patch
(line779): 2 good boots
Up to (incl.) slub-optimize-cacheline-use-for-zeroing.patch (line
1045): 1 failed
Up to (incl.) fix-discrepancy-between-vdso-based... (line1461): 1 good, 1 failed

Next try: up to patch fs-remove-some-aop_truncated_page.patch

That means from the patches added to the rc4 variant of the mm-kernel
the following are remaining:
git-xfs-build-fix.patch
enforce-noreplace-smp-in-alternative_instructions.patch
paravirt-fix-preemptible-lazy-mode-bug.patch
i386-apic-fix-4-bit-apicid-assumption-of-mach-default.patch
fix-the-max-path-calculation-in-radix-treec-update.patch
mm-no-need-to-cast-vmalloc-return-value-in-zone_wait_table_init.patch
introduce-write_begin-write_end-aops-fix2.patch
implement-simple-fs-aops-fix.patch
ext2-convert-to-new-aops-fix2.patch
ext3-convert-to-new-aops-fix-fix.patch
ext4-convert-to-new-aops-fix-fix.patch
gfs2-convert-to-new-aops-fix.patch
reiserfs-convert-to-new-aops-fix2.patch
hostfs-convert-to-new-aops-fix-fix.patch
ufs-convert-to-new-aops-fix2.patch
sysv-convert-to-new-aops-fix2.patch
minix-convert-to-new-aops-fix2.patch
affs-convert-to-new-aops-fix-fix.patch
memoryless-nodes-add-n_cpu-node-state-move-setup-of-n_cpu-node-state-mask.patch
memoryless-nodes-fixup-uses-of-node_online_map-in-generic-code-fix.patch
memoryless-nodes-fixup-uses-of-node_online_map-in-generic-code-fix-2.patch
update-n_high_memory-node-state-for-memory-hotadd.patch
slub-avoid-page-struct-cacheline-bouncing-due-to-remote-frees-to-cpu-slab.patch
slub-do-not-use-page-mapping.patch
slub-move-page-offset-to-kmem_cache_cpu-offset.patch
slub-avoid-touching-page-struct-when-freeing-to-per-cpu-slab.patch
slub-place-kmem_cache_cpu-structures-in-a-numa-aware-way.patch
slub-optimize-cacheline-use-for-zeroing.patch

But due to the unreliable nature of the bug, I can't be to sure about that.

Next version is compiled, now again switching the PC off for an hour...

Torsten
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/



Relevant Pages

  • Re: 2.6.30-rc8 Oops whilst booting
    ... However the vfat filesystem didn't have /sbin/init. ... boot from, to /dev/sdb6, so that I minimised the risk to sdc1 in the ... reboot festival that bisecting would involve. ... changed the name of the root partition that is passed to the kernel by ...
    (Linux-Kernel)
  • Re: 2.6.30-rc8 Oops whilst booting
    ... However the vfat filesystem didn't have /sbin/init. ... boot from, to /dev/sdb6, so that I minimised the risk to sdc1 in the ... changed the name of the root partition that is passed to the kernel by ... this bisecting, I expect the root filesystem to be on /dev/hdb6. ...
    (Linux-Kernel)
  • Re: [APIC] Kernel panic, rsync corruption, intel q8200, 2.6.28-rc8
    ... Messages on boot using Intrepid, Jaunty Alpha 3, or Fedora 10: ... I assume this is an ACPI problem. ... fresh hard drives, run tests, obtain debug information, and report back. ... If any previous kernel version worked OK, please be sure to note that. ...
    (Linux-Kernel)
  • Re: Lock order reversals on RELENG_8
    ... Are these lock order reversals reported any reason for preoccupation and/or should I report elsewhere? ... The kernel I just built does not boot at all, so I don´t know whether these are a closed issue already. ...
    (freebsd-questions)
  • Re: 2.6.20.6 vanilla doest boot
    ... Reporting bios bug ... The 2.6.15 dmesg do not give any clues to why 2.6.20.6 would not boot. ... Regardless of extra kernel ... Please mention in the bug report what the latest working kernel was. ...
    (Linux-Kernel)