Re: [BUG] 2.6.21-rc1,2,3 regressions on my system that I found so far
- From: Maxim <maximlevitsky@xxxxxxxxx>
- Date: Sun, 18 Mar 2007 01:00:01 +0200
On Saturday 17 March 2007 01:19:44 Len Brown wrote:
On Friday 16 March 2007 06:30, Maxim Levitsky wrote:
Good day,
I want to report regressions I have with 2.6.21-rc3 kernel.
I use CONFIG_NO_HZ.
Do any of these issues go away with CONFIG_NO_HZ=n (or boot with nohz=n)
or are they all independent of it?
thanks,
-Len
1) Both suspend to disk and suspend to RAM are completely broken:
On vanilla 2.6.20 suspend to disk works perfectly and suspend to ram works _almost_ perfectly (I will tell about that later).
On 2.6.21-rc1 and later system hangs even before suspend begins (suspend to disk hangs before image write , and after suspend to ram ,
some devices are powered down (disk,power leds) , and some and not(fans, power) , and system hangs).
I did a git-bisect and I found which commit caused that:
e3c7db621bed4afb8e231cb005057f2feb5db557 - [PATCH] [PATCH] PM: Change code ordering in main.c (breaks S3)
ed746e3b18f4df18afa3763155972c5835f284c5 - [PATCH] [PATCH] swsusp: Change code ordering in disk.c (breaks swsusp, I don't use it, but I tested it)
259130526c267550bc365d3015917d90667732f1 - [PATCH] [PATCH] swsusp: Change code ordering in user.c (breaks uswsusp, that I use)
I reverted those commits and now system suspends correctly to disk, but suspend to ram showed some more regressions.
2) ) After suspend to ram I get this
Mar 14 00:22:23 MAIN kernel: [ 2.072875] caller is check_tsc_sync_source+0x1d/0x100
Mar 14 00:22:23 MAIN kernel: [ 2.072878] [show_trace_log_lvl+26/48] show_trace_log_lvl+0x1a/0x30
Mar 14 00:22:23 MAIN kernel: [ 2.072881] [show_trace+18/32] show_trace+0x12/0x20
Mar 14 00:22:23 MAIN kernel: [ 2.072884] [dump_stack+22/32] dump_stack+0x16/0x20
Mar 14 00:22:23 MAIN kernel: [ 2.072887] [debug_smp_processor_id+173/176] debug_smp_processor_id+0xad/0xb0
Mar 14 00:22:23 MAIN kernel: [ 2.072891] [check_tsc_sync_source+29/256] check_tsc_sync_source+0x1d/0x100
Mar 14 00:22:23 MAIN kernel: [ 2.072894] [__cpu_up+80/384] __cpu_up+0x50/0x180
Mar 14 00:22:23 MAIN kernel: [ 2.072897] [_cpu_up+98/208] _cpu_up+0x62/0xd0
Mar 14 00:22:23 MAIN kernel: [ 2.072901] [cpu_up+46/80] cpu_up+0x2e/0x50
Mar 14 00:22:23 MAIN kernel: [ 2.072903] [enable_nonboot_cpus+110/160] enable_nonboot_cpus+0x6e/0xa0
Mar 14 00:22:23 MAIN kernel: [ 2.072906] [enter_state+326/496] enter_state+0x146/0x1f0
Mar 14 00:22:23 MAIN kernel: [ 2.072909] [state_store+174/192] state_store+0xae/0xc0
Mar 14 00:22:23 MAIN kernel: [ 2.072912] [subsys_attr_store+43/64] subsys_attr_store+0x2b/0x40
Mar 14 00:22:23 MAIN kernel: [ 2.072917] [sysfs_write_file+186/272] sysfs_write_file+0xba/0x110
Mar 14 00:22:23 MAIN kernel: [ 2.072920] [vfs_write+150/352] vfs_write+0x96/0x160
Mar 14 00:22:23 MAIN kernel: [ 2.072923] [sys_write+61/112] sys_write+0x3d/0x70
Mar 14 00:22:23 MAIN kernel: [ 2.072926] [sysenter_past_esp+93/153] sysenter_past_esp+0x5d/0x99
Mar 14 00:22:23 MAIN kernel: [ 2.072929] =======================
Mar 14 00:22:23 MAIN kernel: [ 2.072931] checking TSC synchronization [CPU#0 -> CPU#1]:
Mar 14 00:22:23 MAIN kernel: [ 2.092922] Measured 72051818872 cycles TSC warp between CPUs, turning off
It looks clear that preempt is enabled all the way in second cpu initialization, ( I think that at least in check_tsc_sync_source, it should be disabled,
shouldn't it ? )
Then I did add preempt_disable() / preempt_enable() to this function , and I still got this:
Mar 14 00:22:23 MAIN kernel: [ 2.072931] checking TSC synchronization [CPU#0 -> CPU#1]:
Mar 14 00:22:23 MAIN kernel: [ 2.092922] Measured 72051818872 cycles TSC warp between CPUs, turning off
It happens after second CPU is brought back on-line.
Now I understand that this is TSC sync problem and I tried to do some tests:
I tried to disable/enable second CPU by hand, eg I did number of times,
echo "0" > /sys/devices/system/cpu/cpu1/online
echo "1" > /sys/devices/system/cpu/cpu1/online
and TSC sync was ok.
Then I disabled 2nd CPU, have suspended system to RAM , resumed it , and then enabled 2nd CPU and got same error message.
Then I disabled cpufreq , and did above tests, and got same results.
I think that maybe this error is false, that there is some difference in TSC clock, but this difference is constant, and can be fixed
3) Sometimes I get this (once in three boots or so)
[ 36.217405] ENABLING IO-APIC IRQs
[ 36.217587] ..TIMER: vector=0x31 apic1=0 pin1=2 apic2=-1 pin2=-1
[ 36.433917] APIC timer disabled due to verification failure.
And NO_HZ is disabled due to that (I get 1000/s timer's interrupts)
I haven't investigated that yet.
It looks like another new test that my hardware fails to perform...
And now I want to tell you about that _almost_ working suspend to ram I got in 2.6.20:
To put it simply sometimes system wakes from resume, and sometimes not (about 1 in 5 times)
When it does it works perfectly.
This is quite common problem but ironically my case is very different and harder to solve.
The fact is that I found thanks to RTC tricks , similar to PM_TRACE that system hangs in exactly three places (in one of them of course)
I put between instructions, code like that to save a position in RTC alarm which is not cleared on reboot
Note that this code uses ax, but I checked every time I put it that ax can be used (eg, it is loaded in next instruction)
#define TRACE(val) \
movb $0x01, %al ; \
outb %al, $0x70 ; \
movb $ ## val, %al ; \
outb %al, $0x71
It hangs very early in asm code, and those are places:
1) /home/maxim/software/kernel/linux-2.6.20-mod/arch/i386/kernel/acpi/wakeup.S:wakeup_start:
ljmpl $__KERNEL_CS,$wakeup_pmode_return
As I see that is first time wakeup low page is addressing kernel memory by jumping to it.
2) /home/maxim/software/kernel/linux-2.6.20-mod/arch/i386/kernel/acpi/wakeup.S:do_suspend_lowlevel
call restore_registers
It hangs exacly on that instruction, I can only see that this is first time protected stack is accessed
3) /home/maxim/software/kernel/linux-2.6.20-mod/arch/i386/power/cpu.c :__restore_processor_state(struct saved_context *ctxt)
mtrr_ap_init();
It hangs somewhere inside this function, and because is is long I haven't found where exactly, it is easier to disable MTRR all together.
Note that all three places have different external behavier:
in first case, system powers on -> off -> on -> hangs
I did put test in RTC to see whenever BIOS call kernel twice but I know from this test for sure that it calls it only once,
It actualy makes sense, because exception occures before IDT is loaded, so system has no choice but to power down
in second case I see blinking leds -> almost sure a oops
in third case system just hangs
That's all, I will continue to dig those problems out
Thanks for attention,
Maxim Levitsky
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Issues that are caused by those
e3c7db621bed4afb8e231cb005057f2feb5db557 - [PATCH] [PATCH] PM: Change code ordering in main.c (breaks S3)
ed746e3b18f4df18afa3763155972c5835f284c5 - [PATCH] [PATCH] swsusp: Change code ordering in disk.c (breaks swsusp, I don't use it, but I tested it)
259130526c267550bc365d3015917d90667732f1 - [PATCH] [PATCH] swsusp: Change code ordering in user.c (breaks uswsusp, that I use)
doesn't go away, even worse, I tested latest Linus's git tree, and same hang both is suspend to disk and suspend to ram (before suspending)
Mar 14 00:22:23 MAIN kernel: [ 2.072875] caller is check_tsc_sync_source+0x1d/0x100
Mar 14 00:22:23 MAIN kernel: [ 2.072878] [show_trace_log_lvl+26/48] show_trace_log_lvl+0x1a/0x30
Mar 14 00:22:23 MAIN kernel: [ 2.072881] [show_trace+18/32] show_trace+0x12/0x20
Mar 14 00:22:23 MAIN kernel: [ 2.072884] [dump_stack+22/32] dump_stack+0x16/0x20
Mar 14 00:22:23 MAIN kernel: [ 2.072887] [debug_smp_processor_id+173/176] debug_smp_processor_id+0xad/0x
^ This thing is fixed (sorry for bothering you)
Mar 14 00:22:23 MAIN kernel: [ 2.072931] checking TSC synchronization [CPU#0 -> CPU#1]:
Mar 14 00:22:23 MAIN kernel: [ 2.092922] Measured 72051818872 cycles TSC warp between CPUs, turning off
^ This one I don't think is related to NO_HZ, maybe it is hardware problem, but it exist without NO_HZ
[ 36.217405] ENABLING IO-APIC IRQs
[ 36.217587] ..TIMER: vector=0x31 apic1=0 pin1=2 apic2=-1 pin2=-1
[ 36.433917] APIC timer disabled due to verification failure.
This one is now discussed, I will look at it and it is not related to NO_HZ
And hangs with S3 as I said occur even in 2.6.20
And I forgot to tell about another problem with (now I know ,hi-resolution timers)
That before suspend to ram APIC timer is used and HPET is not used :
root@MAIN:/home/maxim# cat /sys/devices/system/clockevents/clockevents0/registered
lapic F:0007 M:3(periodic) C: 1
hpet F:0003 M:1(shutdown) C: 0
lapic F:0007 M:3(periodic) C: 0
root@MAIN:/home/maxim#
But after suspend to ram HPET is 'woken'
root@MAIN:/home/maxim# cat /sys/devices/system/clockevents/clockevents0/registered
lapic F:0007 M:3(one shoot) C: 1
hpet F:0003 M:3(one shoot) C: 0
lapic F:0007 M:3(one shoot) C: 0
Note that I added those (one shoot), (periodic) descriptions, would be nice to have them in kernel, can I send a patch ? ;-)
and I see average of 18 IRQs/sec on IRQ 0
Best regards,
Maxim Levitsky
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
- Follow-Ups:
- References:
- [BUG] 2.6.21-rc1,2,3 regressions on my system that I found so far
- From: Maxim Levitsky
- Re: [BUG] 2.6.21-rc1,2,3 regressions on my system that I found so far
- From: Len Brown
- [BUG] 2.6.21-rc1,2,3 regressions on my system that I found so far
- Prev by Date: Re: Keyboard stops working after *lock [Was: 2.6.21-rc2-mm1]
- Next by Date: Re: [BUG] 2.6.21-rc1,2,3 regressions on my system that I found so far
- Previous by thread: Re: [BUG] 2.6.21-rc1,2,3 regressions on my system that I found so far
- Next by thread: Re: [BUG] 2.6.21-rc1,2,3 regressions on my system that I found so far
- Index(es):
Relevant Pages
|
|