Erratic Crashes at Boot (Was Power Supply Cause of Crashes? )

From: W. Watson (wolf_tracks_at_invalid.inv)
Date: 03/20/05

  • Next message: Floyd L. Davidson: "Re: Erratic Crashes at Boot (Was Power Supply Cause of Crashes? )"
    Date: Sun, 20 Mar 2005 05:01:20 GMT
    
    

    Comments mixed in below.
    Floyd L. Davidson wrote:

    > "W. Watson" <wolf_tracks@invalid.inv> wrote:
    >
    >>OK, let me review the situation.
    >>
    >>1. For a year this computer has had unexplained crashes about
    >>twice a month. Each time it is difficult to get the power back
    >>on. I recently found a sure way to do it. Turn off all power,
    >>and make sure the red LED on the MB is out, then power on, and
    >>boot. Up until then I thought it was a bad outlet connection. My
    >>explorations on this suggest otherwise. That is, it has
    >>something to do with the computer and not the wall outlet.
    >
    >
    > That is almost certainly a heat problem. Either the power
    > supply is full of dust bunnies, a fan is not working, or it
    > simply never had enough cooling... and/or the ambient
    > temperature is too high.
    I dusted it out pretty well with a can of air. It really has been pretty clean
    in the year it's been in action. I did get a look at the temps somewhere along
    the line and they looked quite reasonable. The temp in the room where the
    computer is has been in the mid 50s.
    >
    > Having to actually unplug, and wait for a bit, is highly
    > indicative that it is the power supply which is overheating, but
    > that is not necessarily true.
    I can test that idea by waiting much longer. If it is convenient to do so, I'll
    give it a try.

    Note that I've done a fair amount of cleaning (air can), reconnecting, pushing
    on components to seat them, and looking for obvious defects (damanged cable).
    >
    > Once you get the OS working again, look into this further.
    > Under load, find out what the CPU temperature is, for example.
    > Anything over perhaps 60C is bad.
    Got nowhere near that when I booted from Knoppix.
    >
    >
    >>2. About a week ago it died as I was rebooting. That is the
    >>power suddenly went off. Attempts to bring it up met with a lot
    >>of fscking during boot, and getting to the login stage, when it
    >>would suddenly fail again. The fsck were advised by the book
    >>with a Y to the prompt, since it reported the system had been
    >>shutdown uncleanly. Ther had been no recent changes to
    >>hardware. I had just finished do a realtime kernel build, and
    >>was just going to power down for the morning. Nothing ususual
    >>had taken place.
    >
    >
    > Your crash apparently damaged some files, and fsck will have
    > saved them, but not with their names. Which means you no doubt
    > have some corrupted or nonexistant system configuration files.
    So far the limited text files I've moved to the other HD look good. There's a
    fair amount of data on hda but if I lose it, so be it.
    >
    >
    >>3. I decided to see if the HDs were intact. I booted from a
    >>Knoopix on a CD, and examined the HDs. They passed the fsck for
    >>hda, and hdb. I transferred files from hda to hdb to be in the
    >>safe side, since if I had to re-install RHL 9 it would be on hda.
    >
    >
    > By that time, the fsck being run a boot time had already fixed
    > the filesystem, but it won't fix the files themselves... so it
    > crashes.
    >
    >
    >>4. While in Knoppix, I looked at the log messages: boot.log,
    >>dmesg, and another one (the name escapes me at the moment, ah
    >>...)--messages. I could only find one message that looked
    >>suspicious. I reported on this in the thread, "Is My Hard Drive
    >>Dead". Here's part of my post and the response to it.
    >>"W. Watson" <wolf_tracks@invalid.inv> writes:
    >>
    >>...Looking in dmesg on the hda drive,I
    >
    >
    > except that dmesg is a circular buffer, and of course does not
    > survive reboots. Hence all of this tells what happened as it
    > booted Knopnix. Basically hardware related messages aren't
    > going to mean much here... :-(
    Interesting. It looked like it just added to the end of the file. There's a
    boot.log, boot1.log, ..., boot4.log. I'm not sure what's going on there. I just
    looked at boot.log.
    >
    >
    >>>>see, from Koopnix, "Error"-- about 8 lines below
    >>>>=========dmesg=======================
    >
    > ...
    >
    >>>>hda: task_no_data_intr: status=0x51 { *DriveReady SeekComplete Error* }
    >>>>hda: task_no_data_intr: error=0x04 { *DriveStatusError* }
    >>
    >>Getting those messages at that point (but not later on), is completely
    >>harmless. It's the driver checking whether the drive supports some
    >>feature, or something like that.
    >
    >
    > Or it may be checking timing. Regardless, it is indeed
    > harmless.
    >
    >
    >>5. I called ABIT about my VA-10 motherboard with a AMD XP 2100
    >>MHz CPU. They said it sounded like a PSU problem and urged me to
    >>try a new PSU. As I reported a few messages back, the new PSU
    >>didn't solve the problem.
    >
    >
    > It might very well solve the intermittent reboots. But it can't
    > restore your system configuration or repair the files.
    If I'm lucky, it may not matter. If.
    >
    >
    >>6. Here's the state of the problem.
    >>
    >>Case a: First let me describe the normal boot up when the system
    >>is working fine. Everything goes along smoothly, and I get to
    >>the login. There's a momentary presentation of a text login
    >>quickly followed by a screen that comes up that looks like the
    >>screen when I shutdown. A few seconds later, the correct Redhat
    >>login screen appears with a prompt (logoff, logon, Shutdown,
    >>etc. icons across the bottom). I login--successfully.
    >
    >
    > You are booting into X, using xdm or some other similar display
    > manager.
    >
    > The first "text login" screen is a virtual console, and
    > immediately following that the X server is started and a
    > graphical login is provided.
    >
    >
    >>Case b: Let me describe what's happening now. Each time I try to
    >>boot up, I get a message of an unclean shutdown, and choose to
    >>have fsck repair things. It takes about 7 minutes to work
    >>through the start up to the text login.
    >
    >
    > You have a problem with the filesystem that cannot be fixed with
    > the way fsck is automatically run at boot. You need to boot
    > into single user mode and run fsck manually to fix the problems
    > (which by now might be considerable!).
    How do I get into single user mode? Maybe that's what I'm doing when I
    successfully login, as described below, and do "shutdown -h now".
    >
    >
    >>I briefly get a text
    >>login as above. Next comes a messed up screen that looks very
    >>noisy. It lasts a few seconds. The text login reappears for
    >>about 3-4 seconds, the is replaced again by the messed up
    >>screen. This goes on forever.
    >
    >
    > (BTW, I see no indication that this is a hardware problem.)
    >
    > I'm not positive about exactly what is going on at this point.
    > It is some sort of loop (and there are several possible ways to
    > have a loop at that point). It appears, since whatever you
    > enter is still there each time, that it is not actually
    > restarting X or the display manager. Hence it is probably a
    > redisplay loop, inside the display manager, and something it is
    > doing is not working... and my first bet is that an image file
    > that is being displayed on the screen, as part of your login
    > splash, is corrupted. The xdm program, as an example, puts a
    > little xpm image on the screen just to the left of the login
    > prompt. (I use xdm, but it is very modified and I don't recall
    > what the original xpm file said, but I think it was something
    > about the X server.)
    >
    > When X displays a fubar xpm file, it can do all sorts of things!
    >
    >
    >>However, if I patiently enter root
    >>for the user id, that's all I've got time for, and wait for the
    >>login to reappear again, I'll get the request for a
    >>password. 2-3 more cycles of this and I can get my 8 character
    >>password in and press Enter. Now I'm logged in. Now I go through
    >>4-5 cycles of login/funky-screen and manage to get in: shutdown
    >>-h now. If I succeed, it shuts down normally. When I reboot, all
    >>goes well again, but the same screwy login/funky-screen. Unless
    >>I want to go through the tiring text login, I am forced to
    >>reboot.
    >
    >
    > You could do "init s" instead of shutdown...
    I'm in a bit over my head here. So I just enter "init s", and that will bring it
    down in some reasonable fashion?
    >
    >
    >>Case c: Basically, this is case b again, but if it seems like
    >>this is the difference. I try to type root in for the first
    >>appearance of the text login. I'm successful, but the power dies
    >>almost immediately when the funky screen is coming up. I think
    >>if I wait for a later text login, I get case b.
    >
    >
    > Whoa, this is really different! You actually have time to type
    > in a userid and password at the text login??? And then the
    > power goes out? Hmmmm... I've never seen it last long enough
    > to do more than read it (from a burned memory image, as it's
    > gone too fast to react).
    Not quite. I never get a shot at the password. Once I get root typed and hit
    enter, the game is over. It's going to power down. I'm pretty sure that's what
    happens. It's a distinctly different situation that case b where I wait for a
    few cycles of login/crud, then start to login.
    >
    >
    >>If anyone wants to see all the logs I got while in Knoppix, I'll post them.
    >>
    >>I see someone posted a suggestion about logging and checking out
    >>Xwin (manually issuing startx, etc). That might be smart. I'll
    >>see if I can do it.
    >>
    >>After this message, I will try to start a new thread with a
    >>slightly different thread. The topic on power supplies is pretty
    >>much exhausted. It's now shifting to X-win.
    >
    >
    > Do *not* start another thread. If you want to change the
    > subject line, go ahead, but do *not* drop the References:
    > headers that keep all of this threaded.
    Sounds right. I'm changing the title of this to "Erratic Crashes at Boot". It'd
    be good to boost this up to a higher level, so that the indenting doesn't keep
    going further to the right.
    >
    > Whatever, boot to single user, either at boot time or using
    > "init s".
    >
    > Regardless, you *must* stop the random poking at the box and
    > start logically following a pattern of steps that isolates
    > the problem...
    >
    I'm a bit over my head on that account. My knowledge of Linux is pretty limited.
    I've gotten quite a ways off the expected trail. The trail is this. I'm an end
    user of an application that runs realtime Linux, rtlinux. It's supposed to
    pretty much run 24/7. I know enough about Linux to install it and to build the
    required rt kernel from a set of instructions provided. I can build the
    application from another set of instructions, and operate the application. I
    have some modest idea of how X-Window works, and generally how Linux works.
    Debugging this problem is beyond my skills unless I get some help. I work with
    an organization that can barely support this effort, and they are a long way
    from me. I would have had second thoughts of taking on this application if I
    knew how much time it took to fix 'little' problems. I'm retired and do this
    strictly as a volunteer.

    There's one thing I really know how to do. I can wipe out that primary disk and
    reinstall RHL 9 and begin again as though none of this happened. Perhaps the
    only reason I haven't done that yet is that I find the erratic powering down of
    the machine puzzling. If I can figure out why that happens, I'll be better off.

    I think the crux of the matter is to discover why the erratic shutdown. I have
    another computer that runs a similar ABIT board, and I think it just plain died
    about six months into its use. I'm pretty confident that I'm not mistaken and
    that was instead this board.

    Maybe a smart thing to do regarding heat is just to bring up the box with
    Knoppix and wait a few hours, bring it down, and look at BIOS to see what temps
    it's running at? I did leave the machine on in this mode yesterday for 4 hours
    without anything falling apart.

    Well, I'm going to put the old power supply back in and wait for further
    suggestions. If I can't get by this present dilemma, I'll call ABIT Monday and
    see what they have to say. If they say it's not their board, I'll re-install and
    hope for the best.

    -- 
                  Wayne T. Watson (Watson Adventures, Prop., Nevada City, CA)
                      (121.015 Deg. W, 39.262 Deg. N) GMT-8 hr std. time)
                       Obz Site:  39° 15' 7" N, 121° 2' 32" W, 2700 feet
                  "I know that defies the law of gravity, but, you see, I never
                   studied the law of gravity." -- Bugs Bunny
                             Web Page: <home.earthlink.net/~mtnviews>
    

  • Next message: Floyd L. Davidson: "Re: Erratic Crashes at Boot (Was Power Supply Cause of Crashes? )"
  • Quantcast