Re: Does redhat linux log all hardware events/issues/error in /var/log/mcelog?



Hi

Try ethtool or peek into /proc to check the network interface stats. You
could use the HBA vendor's software installed on the OS to support the
hardware.

Example: If the HBA were from emulex then you could check the hbaanywhere
cli commands to run a healthcheck

- Ajay
On Mar 13, 2012 9:39 PM, "unix syzadmin" <unixsyzadmin@xxxxxxxxx> wrote:

Thanks.
I have downloaded and installed the OpenManage from Dell.
The following commands say if the health of system components is OK.
omreport chassis - health of all main components of the system chassis
omreport chassis processors - cpu health
omreport chassis memory - memory health
omreport chassis pwrsupplies - power supply health
omreport storage controller - raid controller health

However this leaves out the integrated NIC ports and the HBA adapters.
What linux / dell open manage commands can be used to confirm if those are
healthy as well?

Thanks,


On Mon, Mar 12, 2012 at 9:00 PM, Paul Tader <ptader@xxxxxxxxxxxxxx> wrote:

On 3/12/12 5:28 PM, unix syzadmin wrote:

Hi,

We run redhat linux on intel hardware (mostly Dell, lately dell R710s).
We want to be able to catch any hardware issues when they occur to act
on
them as quickly as possible.

My understanding is that all hardware events/issues/errors are logged
in
/var/log/mcelog (Machine Check Events log). Is this correct? Can't
stress
this enough; does it log all hardware issues
(cpu,memory,disk,ethernet,**fibre/hba etc) ?

Thanks,


I've used MCElog to catch some CPU events but I think you might want to
check out Dell's OpenManage client. It will report/monitor a lot more
information.


http://linux.dell.com/wiki/**index.php/Repository/OMSA<
http://linux.dell.com/wiki/index.php/Repository/OMSA>


To install:

# wget -q -O -
http://linux.dell.com/repo/**hardware/latest/bootstrap.cgi<
http://linux.dell.com/repo/hardware/latest/bootstrap.cgi>| bash
# yum install srvadmin-base
# yum install srvadmin-storageservices

(logout / login for environment variables to take effect)

# /opt/dell/srvadmin/sbin/**srvadmin-services.sh start
...

# omreport chassis
Health

Main System Chassis

SEVERITY : COMPONENT
Ok : Fans
Ok : Intrusion
Ok : Memory
Ok : Power Supplies
Ok : Processors
Ok : Temperatures
Ok : Voltages
Ok : Hardware Log
Ok : Batteries

# omreport chassis temps
Temperature Probes Information

------------------------------**------
Main System Chassis Temperatures: Ok
------------------------------**------

Index : 0
Status : Ok
Probe Name : System Board Ambient Temp
Reading : 20.0 C
Minimum Warning Threshold : 8.0 C
Maximum Warning Threshold : 42.0 C
Minimum Failure Threshold : 3.0 C
Maximum Failure Threshold : 47.0 C

# omreport storage pdisk controller=0

List of Physical Disks on Controller SAS 6/iR Integrated (Embedded)

Controller SAS 6/iR Integrated (Embedded)
ID : 0:0:0
Status : Ok
Name : Physical Disk 0:0:0
State : Online
Failure Predicted : No
Certified : Not Applicable
Encryption Capable : No
Secured : Not Applicable
Progress : Not Applicable
Bus Protocol : SAS
Media : HDD
Capacity : 67.75 GB (72746008576 bytes)
Used RAID Disk Space : 67.75 GB (72746008576 bytes)
Available RAID Disk Space : 0.00 GB (0 bytes)
Hot Spare : No
Vendor ID : DELL
Product ID : ST973402SS
Revision : S229

<snip>

You get the idea.

--
redhat-list mailing list
unsubscribe mailto:redhat-list-request@**redhat.com<
redhat-list-request@xxxxxxxxxx>
?subject=unsubscribe
https://www.redhat.com/**mailman/listinfo/redhat-list<
https://www.redhat.com/mailman/listinfo/redhat-list>

--
redhat-list mailing list
unsubscribe mailto:redhat-list-request@xxxxxxxxxx?subject=unsubscribe
https://www.redhat.com/mailman/listinfo/redhat-list
--
redhat-list mailing list
unsubscribe mailto:redhat-list-request@xxxxxxxxxx?subject=unsubscribe
https://www.redhat.com/mailman/listinfo/redhat-list