Uhhuh. NMI received for unknown reason, Dazed and Confused
If you search Google, you will find tons of results for the string "Uhhuh. NMI received for unknown reason" (maybe even with the note b9 on CPU 0) or, even searching "Dazed and confused, but trying to continue".
Maybe, you'll be lucky enough to find this blog post....
This post applies specifically to HP ProLiant servers running CentOS 6 with a P410 RAID controller. We were running 2.6.32-220.el6.i686 and 2.6.32-220.7.1.el6.i686, but its possible that other versions are affected in this series as well. (In fact, it's also possible that this applies to the P212, P410i, P411, and P812 controllers too...
Over the past few months, we've been rolling out CentOS 6 on all of our firewalls, samba servers and general purpose systems. In most instances, this runs well. Unfortunately however, sometimes we seem to get NMI errors on some servers - and these errors scroll across the console like crazy, making it extremely difficult to do anything without closing your eyes.
Like everyone else, our initial thought was memory. Seeing that an NMI error in our experiences lead to a faulty RAM module (or two). We believed this to be the answer, until we started seeing this problem on more than one server.
The errors we were getting were a bit misleading and, we needed to look at them in more detail. We start with the IPMI SEL (Intelligent Platform Management Interface System Event Log). To view this, we utilize OpenIPMI:
Install OpenIPMI:
yum -y install OpenIPMI OpenIPMI-tools
chkconfig ipmi on
service ipmi start
Look at the log:
ipmitool sel elist | tail
If you look at the log, you may see a lot of errors like this:
d38 | 04/01/2012 | 11:59:03 | Critical Interrupt #0x03 | PCI PERR | Asserted
d50 | 04/01/2012 | 11:59:03 | Critical Interrupt #0x03 | PCI PERR | Asserted
"PCI PERR" alerts are actually related to the PCI bus, not memory. Therefore, our attention should be drawn there.
However, just because it says it's the PCI bus, doesn't mean you need to replace all of your PCI/PCIe/PCIx cards. Instead, look to dmesg of a freshly booted server. If you're lucky, you'll see something like this:
usb 2-1.1: new full speed USB device using ehci_hcd and address 3
Uhhuh. NMI received for unknown reason a1 on CPU 0.
You have some hardware problem, likely on the PCI bus.
Dazed and confused, but trying to continue
usb 2-1.1: New USB device found, idVendor=0000, idProduct=0000
If you're like me, you would probably assume that the error relates to the USB subsystem and so, you would modify /etc/grub.conf and add "nousb" to your kernel boot parameters and then reboot.
Of course, after you do that and boot the server, you learn that the error still occurs.
Had you kept looking, you may have also seen this sequence:
ata4: SATA max UDMA/133 cmd 0xc880 ctl 0xc800 bmdma 0xc488 irq 19
Uhhuh. NMI received for unknown reason b9 on CPU 0.
You have some hardware problem, likely on the PCI bus.
Dazed and confused, but trying to continue
So here, it looks like maybe the problem is actually related to the SATA controller (or disks, who knows)...
In our case - it was the controller and, we started looking. First, we looked at our actual RAID controller firmware:
[root@centos6 ~]# dmesg | grep -E 'RAID.*HP'
scsi 0:0:0:0: RAID HP P410 3.52 PQ: 0 ANSI: 5
[root@centos6 ~]#
From the above, we're running firmware version 3.52 - not the latest version ....
At the time of this writing, version 5.14 is the latest (and in fact many, many versions later) and so, since we really like to run the latest firmware whenever needed - seems to be the logical upgrade path.
We then draw our attention to this advisory from HP, where it allows us to download firmware version 5.14.
Once we download CP016377.md5 and CP016377.scexe (and subsequently look at the md5sum of CP016377.scexe to confirm the download was successful), we run the program:
[root@centos6 ~]# chmod 0755 CP016377.scexe
[root@centos6 ~]# ./CP016377.scexe
HP Enclosure ROM Flash.
Flash Engine Version: 2.06.10
Copyright (c) 2006-2009 Hewlett-Packard Development Company L.P.
Device [Smart Array P410]
Flash this device? [NO, yes, quit] yes
Preparing to flash devices on the array controller...
Requesting flash - this could take up to 15 minutes...
Flash complete.
The array flash operation succeeded.
Device [Smart Array P410]
Flash this device? [NO, yes, quit] yes
Preparing to flash devices on the array controller...
Requesting flash - this could take up to 15 minutes...
Flash complete.
The array flash operation succeeded.
[root@centos6 ~]#
At the completion of the flash process above, we completely shut down our server and physically unplug the power cords for a few seconds.
Once this is done, we reboot the server again.
Once you boot the system, you'll then notice that the errors are no longer scrolling across the screen. We confirm this via ipmi again:
[root@centos6 ~]# ipmitool sel elist
18 | 04/01/2012 | 13:45:01 | Critical Interrupt #0x03 | PCI PERR | Asserted
30 | 04/01/2012 | 13:45:02 | Critical Interrupt #0x03 | PCI PERR | Asserted
48 | 04/01/2012 | 13:45:02 | Critical Interrupt #0x03 | PCI PERR | Asserted
60 | 04/01/2012 | 13:45:02 | Critical Interrupt #0x03 | PCI PERR | Asserted
78 | 04/01/2012 | 13:45:02 | Critical Interrupt #0x03 | PCI PERR | Asserted
90 | 04/01/2012 | 13:46:38 | Button Power Button | State Asserted
a8 | 04/01/2012 | 13:45:23 | Button Power Button | State Asserted
[root@centos6 ~]#
WAIT! The errors are still there? .... Not so fast ...
Look at the timeline of events in the event log. If you notice, the power button alerts are after the PCI PERR alerts. This is because the errors that were present in the event log occurred prior to the physical system power off. (Interestingly enough, the power off alert occurs before the power on alert, but is not inserted into the log until after the power on notice happens. More than likely, it was queued up somewhere - but, I digress.........
Since we want to be certain that under no circumstances these errors are occurring any longer, we clear the log:
[root@centos6 ~]# ipmitool sel clear
Clearing SEL. Please allow a few seconds to erase.
[root@centos6 ~]# ipmitool sel elist
SEL has no entries
[root@centos6 ~]#
At this point - go ahead and reboot again. Once you do, your log will be nearly empty, sans a few power-on/power-off notices.
For me, this ends the problem. Server boots normally - RAID arrays work fine, everything is magic.
HOWEVER, for others - upgrading to firmware version 5.14 may cause some undue stress when attempting to boot into HP ORCA (Online ROM Configuration for Arrays). Most notably, you can no longer create/manage your RAID arrays via this utility.
This problem is a known HP bug in firmwares 5.12 and 5.14 and (as of this writing) the only workaround is to utilize the HP ACU (Advanced Configuration Utility) CD to address this problem. Here is the HP Advisory on this matter.
References
[1] HP Advisory on 5.14/5.12 controllers - http://h20000.www2.hp.com/bizsupport/TechSupport/Document.jsp?lang=en&cc=us&taskId=110&prodSeriesId=3885791&prodTypeId=329290&prodSeriesId=3885791&objectID=c03161926
[2] HP p410 Array Firmware version 5.14
http://h20000.www2.hp.com/bizsupport/TechSupport/SoftwareDescription.jsp?lang=en&cc=us&prodTypeId=329290&prodSeriesId=3883890&prodNameId=3883931&swEnvOID=4024&swLang=8&taskId=135&swItem=MTX-2a4fd2395826468dad49bb19e3&mode=3
[3] HP Advanced Configuration Utility - http://h18004.www1.hp.com/products/servers/proliantstorage/software-management/acumatrix/sw-drivers.html
[4] OpenIPMI - http://openipmi.sourceforge.net/