Erratic Crashes at Boot (Was Power Supply Cause of Crashes? )
From: W. Watson (wolf_tracks_at_invalid.inv)
Date: 03/20/05
- Previous message: Rattleon: "Re: Best supported motherboard maker and linux?"
- In reply to: Floyd L. Davidson: "Re: Power Supply Cause of Crashes? (Review)"
- Next in thread: Floyd L. Davidson: "Re: Erratic Crashes at Boot (Was Power Supply Cause of Crashes? )"
- Reply: Floyd L. Davidson: "Re: Erratic Crashes at Boot (Was Power Supply Cause of Crashes? )"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Date: Sun, 20 Mar 2005 05:01:20 GMT
Comments mixed in below.
Floyd L. Davidson wrote:
> "W. Watson" <wolf_tracks@invalid.inv> wrote:
>
>>OK, let me review the situation.
>>
>>1. For a year this computer has had unexplained crashes about
>>twice a month. Each time it is difficult to get the power back
>>on. I recently found a sure way to do it. Turn off all power,
>>and make sure the red LED on the MB is out, then power on, and
>>boot. Up until then I thought it was a bad outlet connection. My
>>explorations on this suggest otherwise. That is, it has
>>something to do with the computer and not the wall outlet.
>
>
> That is almost certainly a heat problem. Either the power
> supply is full of dust bunnies, a fan is not working, or it
> simply never had enough cooling... and/or the ambient
> temperature is too high.
I dusted it out pretty well with a can of air. It really has been pretty clean
in the year it's been in action. I did get a look at the temps somewhere along
the line and they looked quite reasonable. The temp in the room where the
computer is has been in the mid 50s.
>
> Having to actually unplug, and wait for a bit, is highly
> indicative that it is the power supply which is overheating, but
> that is not necessarily true.
I can test that idea by waiting much longer. If it is convenient to do so, I'll
give it a try.
Note that I've done a fair amount of cleaning (air can), reconnecting, pushing
on components to seat them, and looking for obvious defects (damanged cable).
>
> Once you get the OS working again, look into this further.
> Under load, find out what the CPU temperature is, for example.
> Anything over perhaps 60C is bad.
Got nowhere near that when I booted from Knoppix.
>
>
>>2. About a week ago it died as I was rebooting. That is the
>>power suddenly went off. Attempts to bring it up met with a lot
>>of fscking during boot, and getting to the login stage, when it
>>would suddenly fail again. The fsck were advised by the book
>>with a Y to the prompt, since it reported the system had been
>>shutdown uncleanly. Ther had been no recent changes to
>>hardware. I had just finished do a realtime kernel build, and
>>was just going to power down for the morning. Nothing ususual
>>had taken place.
>
>
> Your crash apparently damaged some files, and fsck will have
> saved them, but not with their names. Which means you no doubt
> have some corrupted or nonexistant system configuration files.
So far the limited text files I've moved to the other HD look good. There's a
fair amount of data on hda but if I lose it, so be it.
>
>
>>3. I decided to see if the HDs were intact. I booted from a
>>Knoopix on a CD, and examined the HDs. They passed the fsck for
>>hda, and hdb. I transferred files from hda to hdb to be in the
>>safe side, since if I had to re-install RHL 9 it would be on hda.
>
>
> By that time, the fsck being run a boot time had already fixed
> the filesystem, but it won't fix the files themselves... so it
> crashes.
>
>
>>4. While in Knoppix, I looked at the log messages: boot.log,
>>dmesg, and another one (the name escapes me at the moment, ah
>>...)--messages. I could only find one message that looked
>>suspicious. I reported on this in the thread, "Is My Hard Drive
>>Dead". Here's part of my post and the response to it.
>>"W. Watson" <wolf_tracks@invalid.inv> writes:
>>
>>...Looking in dmesg on the hda drive,I
>
>
> except that dmesg is a circular buffer, and of course does not
> survive reboots. Hence all of this tells what happened as it
> booted Knopnix. Basically hardware related messages aren't
> going to mean much here... :-(
Interesting. It looked like it just added to the end of the file. There's a
boot.log, boot1.log, ..., boot4.log. I'm not sure what's going on there. I just
looked at boot.log.
>
>
>>>>see, from Koopnix, "Error"-- about 8 lines below
>>>>=========dmesg=======================
>
> ...
>
>>>>hda: task_no_data_intr: status=0x51 { *DriveReady SeekComplete Error* }
>>>>hda: task_no_data_intr: error=0x04 { *DriveStatusError* }
>>
>>Getting those messages at that point (but not later on), is completely
>>harmless. It's the driver checking whether the drive supports some
>>feature, or something like that.
>
>
> Or it may be checking timing. Regardless, it is indeed
> harmless.
>
>
>>5. I called ABIT about my VA-10 motherboard with a AMD XP 2100
>>MHz CPU. They said it sounded like a PSU problem and urged me to
>>try a new PSU. As I reported a few messages back, the new PSU
>>didn't solve the problem.
>
>
> It might very well solve the intermittent reboots. But it can't
> restore your system configuration or repair the files.
If I'm lucky, it may not matter. If.
>
>
>>6. Here's the state of the problem.
>>
>>Case a: First let me describe the normal boot up when the system
>>is working fine. Everything goes along smoothly, and I get to
>>the login. There's a momentary presentation of a text login
>>quickly followed by a screen that comes up that looks like the
>>screen when I shutdown. A few seconds later, the correct Redhat
>>login screen appears with a prompt (logoff, logon, Shutdown,
>>etc. icons across the bottom). I login--successfully.
>
>
> You are booting into X, using xdm or some other similar display
> manager.
>
> The first "text login" screen is a virtual console, and
> immediately following that the X server is started and a
> graphical login is provided.
>
>
>>Case b: Let me describe what's happening now. Each time I try to
>>boot up, I get a message of an unclean shutdown, and choose to
>>have fsck repair things. It takes about 7 minutes to work
>>through the start up to the text login.
>
>
> You have a problem with the filesystem that cannot be fixed with
> the way fsck is automatically run at boot. You need to boot
> into single user mode and run fsck manually to fix the problems
> (which by now might be considerable!).
How do I get into single user mode? Maybe that's what I'm doing when I
successfully login, as described below, and do "shutdown -h now".
>
>
>>I briefly get a text
>>login as above. Next comes a messed up screen that looks very
>>noisy. It lasts a few seconds. The text login reappears for
>>about 3-4 seconds, the is replaced again by the messed up
>>screen. This goes on forever.
>
>
> (BTW, I see no indication that this is a hardware problem.)
>
> I'm not positive about exactly what is going on at this point.
> It is some sort of loop (and there are several possible ways to
> have a loop at that point). It appears, since whatever you
> enter is still there each time, that it is not actually
> restarting X or the display manager. Hence it is probably a
> redisplay loop, inside the display manager, and something it is
> doing is not working... and my first bet is that an image file
> that is being displayed on the screen, as part of your login
> splash, is corrupted. The xdm program, as an example, puts a
> little xpm image on the screen just to the left of the login
> prompt. (I use xdm, but it is very modified and I don't recall
> what the original xpm file said, but I think it was something
> about the X server.)
>
> When X displays a fubar xpm file, it can do all sorts of things!
>
>
>>However, if I patiently enter root
>>for the user id, that's all I've got time for, and wait for the
>>login to reappear again, I'll get the request for a
>>password. 2-3 more cycles of this and I can get my 8 character
>>password in and press Enter. Now I'm logged in. Now I go through
>>4-5 cycles of login/funky-screen and manage to get in: shutdown
>>-h now. If I succeed, it shuts down normally. When I reboot, all
>>goes well again, but the same screwy login/funky-screen. Unless
>>I want to go through the tiring text login, I am forced to
>>reboot.
>
>
> You could do "init s" instead of shutdown...
I'm in a bit over my head here. So I just enter "init s", and that will bring it
down in some reasonable fashion?
>
>
>>Case c: Basically, this is case b again, but if it seems like
>>this is the difference. I try to type root in for the first
>>appearance of the text login. I'm successful, but the power dies
>>almost immediately when the funky screen is coming up. I think
>>if I wait for a later text login, I get case b.
>
>
> Whoa, this is really different! You actually have time to type
> in a userid and password at the text login??? And then the
> power goes out? Hmmmm... I've never seen it last long enough
> to do more than read it (from a burned memory image, as it's
> gone too fast to react).
Not quite. I never get a shot at the password. Once I get root typed and hit
enter, the game is over. It's going to power down. I'm pretty sure that's what
happens. It's a distinctly different situation that case b where I wait for a
few cycles of login/crud, then start to login.
>
>
>>If anyone wants to see all the logs I got while in Knoppix, I'll post them.
>>
>>I see someone posted a suggestion about logging and checking out
>>Xwin (manually issuing startx, etc). That might be smart. I'll
>>see if I can do it.
>>
>>After this message, I will try to start a new thread with a
>>slightly different thread. The topic on power supplies is pretty
>>much exhausted. It's now shifting to X-win.
>
>
> Do *not* start another thread. If you want to change the
> subject line, go ahead, but do *not* drop the References:
> headers that keep all of this threaded.
Sounds right. I'm changing the title of this to "Erratic Crashes at Boot". It'd
be good to boost this up to a higher level, so that the indenting doesn't keep
going further to the right.
>
> Whatever, boot to single user, either at boot time or using
> "init s".
>
> Regardless, you *must* stop the random poking at the box and
> start logically following a pattern of steps that isolates
> the problem...
>
I'm a bit over my head on that account. My knowledge of Linux is pretty limited.
I've gotten quite a ways off the expected trail. The trail is this. I'm an end
user of an application that runs realtime Linux, rtlinux. It's supposed to
pretty much run 24/7. I know enough about Linux to install it and to build the
required rt kernel from a set of instructions provided. I can build the
application from another set of instructions, and operate the application. I
have some modest idea of how X-Window works, and generally how Linux works.
Debugging this problem is beyond my skills unless I get some help. I work with
an organization that can barely support this effort, and they are a long way
from me. I would have had second thoughts of taking on this application if I
knew how much time it took to fix 'little' problems. I'm retired and do this
strictly as a volunteer.
There's one thing I really know how to do. I can wipe out that primary disk and
reinstall RHL 9 and begin again as though none of this happened. Perhaps the
only reason I haven't done that yet is that I find the erratic powering down of
the machine puzzling. If I can figure out why that happens, I'll be better off.
I think the crux of the matter is to discover why the erratic shutdown. I have
another computer that runs a similar ABIT board, and I think it just plain died
about six months into its use. I'm pretty confident that I'm not mistaken and
that was instead this board.
Maybe a smart thing to do regarding heat is just to bring up the box with
Knoppix and wait a few hours, bring it down, and look at BIOS to see what temps
it's running at? I did leave the machine on in this mode yesterday for 4 hours
without anything falling apart.
Well, I'm going to put the old power supply back in and wait for further
suggestions. If I can't get by this present dilemma, I'll call ABIT Monday and
see what they have to say. If they say it's not their board, I'll re-install and
hope for the best.
--
Wayne T. Watson (Watson Adventures, Prop., Nevada City, CA)
(121.015 Deg. W, 39.262 Deg. N) GMT-8 hr std. time)
Obz Site: 39° 15' 7" N, 121° 2' 32" W, 2700 feet
"I know that defies the law of gravity, but, you see, I never
studied the law of gravity." -- Bugs Bunny
Web Page: <home.earthlink.net/~mtnviews>
- Previous message: Rattleon: "Re: Best supported motherboard maker and linux?"
- In reply to: Floyd L. Davidson: "Re: Power Supply Cause of Crashes? (Review)"
- Next in thread: Floyd L. Davidson: "Re: Erratic Crashes at Boot (Was Power Supply Cause of Crashes? )"
- Reply: Floyd L. Davidson: "Re: Erratic Crashes at Boot (Was Power Supply Cause of Crashes? )"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]