Sunday, April 1, 2012

Uhhuh. NMI received for unknown reason, Dazed and Confused

If you search Google, you will find tons of results for the string "Uhhuh. NMI received for unknown reason" (maybe even with the note b9 on CPU 0) or, even searching "Dazed and confused, but trying to continue".

Maybe, you'll be lucky enough to find this blog post....

This post applies specifically to HP ProLiant servers running CentOS 6 with a P410 RAID controller. We were running 2.6.32-220.el6.i686 and 2.6.32-220.7.1.el6.i686, but its possible that other versions are affected in this series as well. (In fact, it's also possible that this applies to the P212, P410i, P411, and P812 controllers too...

Over the past few months, we've been rolling out CentOS 6 on all of our firewalls, samba servers and general purpose systems. In most instances, this runs well. Unfortunately however, sometimes we seem to get NMI errors on some servers - and these errors scroll across the console like crazy, making it extremely difficult to do anything without closing your eyes.

Like everyone else, our initial thought was memory. Seeing that an NMI error in our experiences lead to a faulty RAM module (or two). We believed this to be the answer, until we started seeing this problem on more than one server.

The errors we were getting were a bit misleading and, we needed to look at them in more detail. We start with the IPMI SEL (Intelligent Platform Management Interface System Event Log). To view this, we utilize OpenIPMI:

Install OpenIPMI:
yum -y install OpenIPMI OpenIPMI-tools
chkconfig ipmi on
service ipmi start



Look at the log:
ipmitool sel elist | tail



If you look at the log, you may see a lot of errors like this:
d38 | 04/01/2012 | 11:59:03 | Critical Interrupt #0x03 | PCI PERR | Asserted
d50 | 04/01/2012 | 11:59:03 | Critical Interrupt #0x03 | PCI PERR | Asserted



"PCI PERR" alerts are actually related to the PCI bus, not memory. Therefore, our attention should be drawn there.

However, just because it says it's the PCI bus, doesn't mean you need to replace all of your PCI/PCIe/PCIx cards. Instead, look to dmesg of a freshly booted server. If you're lucky, you'll see something like this:

usb 2-1.1: new full speed USB device using ehci_hcd and address 3
Uhhuh. NMI received for unknown reason a1 on CPU 0.
You have some hardware problem, likely on the PCI bus.
Dazed and confused, but trying to continue
usb 2-1.1: New USB device found, idVendor=0000, idProduct=0000



If you're like me, you would probably assume that the error relates to the USB subsystem and so, you would modify /etc/grub.conf and add "nousb" to your kernel boot parameters and then reboot.

Of course, after you do that and boot the server, you learn that the error still occurs.

Had you kept looking, you may have also seen this sequence:

ata4: SATA max UDMA/133 cmd 0xc880 ctl 0xc800 bmdma 0xc488 irq 19
Uhhuh. NMI received for unknown reason b9 on CPU 0.
You have some hardware problem, likely on the PCI bus.
Dazed and confused, but trying to continue



So here, it looks like maybe the problem is actually related to the SATA controller (or disks, who knows)...

In our case - it was the controller and, we started looking. First, we looked at our actual RAID controller firmware:
[root@centos6 ~]# dmesg | grep -E 'RAID.*HP'
scsi 0:0:0:0: RAID HP P410 3.52 PQ: 0 ANSI: 5
[root@centos6 ~]#



From the above, we're running firmware version 3.52 - not the latest version ....

At the time of this writing, version 5.14 is the latest (and in fact many, many versions later) and so, since we really like to run the latest firmware whenever needed - seems to be the logical upgrade path.

We then draw our attention to this advisory from HP, where it allows us to download firmware version 5.14.

Once we download CP016377.md5 and CP016377.scexe (and subsequently look at the md5sum of CP016377.scexe to confirm the download was successful), we run the program:

[root@centos6 ~]# chmod 0755 CP016377.scexe
[root@centos6 ~]# ./CP016377.scexe

HP Enclosure ROM Flash.
Flash Engine Version: 2.06.10
Copyright (c) 2006-2009 Hewlett-Packard Development Company L.P.

Device [Smart Array P410]
Flash this device? [NO, yes, quit] yes
Preparing to flash devices on the array controller...
Requesting flash - this could take up to 15 minutes...
Flash complete.
The array flash operation succeeded.
Device [Smart Array P410]
Flash this device? [NO, yes, quit] yes
Preparing to flash devices on the array controller...
Requesting flash - this could take up to 15 minutes...
Flash complete.
The array flash operation succeeded.
[root@centos6 ~]#



At the completion of the flash process above, we completely shut down our server and physically unplug the power cords for a few seconds.

Once this is done, we reboot the server again.

Once you boot the system, you'll then notice that the errors are no longer scrolling across the screen. We confirm this via ipmi again:

[root@centos6 ~]# ipmitool sel elist
18 | 04/01/2012 | 13:45:01 | Critical Interrupt #0x03 | PCI PERR | Asserted
30 | 04/01/2012 | 13:45:02 | Critical Interrupt #0x03 | PCI PERR | Asserted
48 | 04/01/2012 | 13:45:02 | Critical Interrupt #0x03 | PCI PERR | Asserted
60 | 04/01/2012 | 13:45:02 | Critical Interrupt #0x03 | PCI PERR | Asserted
78 | 04/01/2012 | 13:45:02 | Critical Interrupt #0x03 | PCI PERR | Asserted
90 | 04/01/2012 | 13:46:38 | Button Power Button | State Asserted
a8 | 04/01/2012 | 13:45:23 | Button Power Button | State Asserted
[root@centos6 ~]#



WAIT! The errors are still there? .... Not so fast ...

Look at the timeline of events in the event log. If you notice, the power button alerts are after the PCI PERR alerts. This is because the errors that were present in the event log occurred prior to the physical system power off. (Interestingly enough, the power off alert occurs before the power on alert, but is not inserted into the log until after the power on notice happens. More than likely, it was queued up somewhere - but, I digress.........

Since we want to be certain that under no circumstances these errors are occurring any longer, we clear the log:

[root@centos6 ~]# ipmitool sel clear
Clearing SEL. Please allow a few seconds to erase.
[root@centos6 ~]# ipmitool sel elist
SEL has no entries
[root@centos6 ~]#



At this point - go ahead and reboot again. Once you do, your log will be nearly empty, sans a few power-on/power-off notices.

For me, this ends the problem. Server boots normally - RAID arrays work fine, everything is magic.

HOWEVER, for others - upgrading to firmware version 5.14 may cause some undue stress when attempting to boot into HP ORCA (Online ROM Configuration for Arrays). Most notably, you can no longer create/manage your RAID arrays via this utility.

This problem is a known HP bug in firmwares 5.12 and 5.14 and (as of this writing) the only workaround is to utilize the HP ACU (Advanced Configuration Utility) CD to address this problem. Here is the HP Advisory on this matter.


References
[1] HP Advisory on 5.14/5.12 controllers - http://h20000.www2.hp.com/bizsupport/TechSupport/Document.jsp?lang=en&cc=us&taskId=110&prodSeriesId=3885791&prodTypeId=329290&prodSeriesId=3885791&objectID=c03161926

[2] HP p410 Array Firmware version 5.14
http://h20000.www2.hp.com/bizsupport/TechSupport/SoftwareDescription.jsp?lang=en&cc=us&prodTypeId=329290&prodSeriesId=3883890&prodNameId=3883931&swEnvOID=4024&swLang=8&taskId=135&swItem=MTX-2a4fd2395826468dad49bb19e3&mode=3

[3] HP Advanced Configuration Utility - http://h18004.www1.hp.com/products/servers/proliantstorage/software-management/acumatrix/sw-drivers.html

[4] OpenIPMI - http://openipmi.sourceforge.net/

Saturday, November 12, 2011

It has been quiet on the outskirts lately, mainly due to excessively involved projects involving nothing but open source software.

A recent project involved integrating JBoss 7.0.2, Pentaho BI 4.0.0 GA (Tomcat), Apache, PHP 5, Perl 5.8, Oracle Forms 10g and Oracle Database 11g. This project involved load balancing utilizing both Apache Proxy as well as LVS and VMware ESXi. This was coupled with CUPS, Hylafax and some fun hardware configurations to make the whole thing as fault tolerant as possible.

Lessons learned in this project? CUPS wasn't meant to be clustered and forces you to be in a master-slave operation. LVS to multiple proxies (using WLC) adds an exciting new twist to debugging where your sessions are going. Sticky Sessions are easy on the application server, but a pain when running deployments and, everything gets more complicated in heterogeneous environments.

I guess if there were a problem to speak of, it would be about our IP addressing issue. We have a pair of Cisco ASA firewalls which provide IPSEC tunnels to various networks. Some of these networks we control, and some we don't. As there are a few hundred networks, it's excessively complicated to just ask them to change their settings. In fact, the last time we asked for a change - several sites took 6 months to get their changes in place. This really isn't acceptable and, we certainly can't afford to push this type of delay for the project release.

Working within the boundaries of the tunnel whereas each endpoint had access to only one IP address, we needed to solve the problem whereas the application server was printing directly to printers on the remote network. See, the old application allowed users to type in their printer name - like "abc003" and, cups would print to said printer. The printers were all networked and in the event that they were USB/Parallel or whatever, they were connected to their own print server somewhere.

The problem we ran into was that CUPS is just a queue. Sure you print to the queue, but a completely separate thread is responsible for connecting to the remote print server. Without modifying the "socket" source (or other cups backend applications), you're forced to use the default settings which are that the application connects to the remote endpoint via the "default" interface.

For reference, "default" applies to whatever interface the OS deems as appropriate for sending traffic to a remote endpoint. Routing 101 applies here....

Anyway, the result was that I created a failover cluster of print servers which is located behind a separate firewall. This firewall was running LVS and was responsible for forwarding traffic directly to the CUPS servers. The CUPS servers used a separate, private address in which they shared solely with the LVS firewall. When the firewall saw traffic to/from that private address range, then it used iptables to SNAT the traffic as the single IP that users connected to. As for the application? That was pushed behind the system too so that it could be on its own IP.

Then comes the new application, it's a bit more complicated - 3 BI servers, 3 application servers, 2 report servers, 2 database servers and 3 proxy servers. In this case, LVS on the aforementioned firewall just balances the port 80 traffic to one of the 3 proxy servers. Then, using sticky sessions, mod_proxy_balancer, mod_proxy_ajp and mod_headers to push the users to one of the application, report or bi servers (based on the URL) - I was able to enforce that one single IP address was seen by all endpoints both for inbound and outbound connections.

Oh, and for those who just say "well that's what NAT is for"... NAT was used only for the 3 print servers. "Direct Route" in LVS was used instead which puts a lower load on the load balancing servers and requires implementing arptables_jf on each of the servers involved...

Guess this is just a small excerpt of some of the stuff I do on a routine basis

Saturday, August 28, 2010

Enabling Tethering on the iPhone

Today, I set out on a plan to enable the tethering on my iPhone. During this process, I wrote down the (very easy) steps I took to get my workstation back onto the internet.

Assumptions: iPhone OS Version 4; Windows Vista Desktop

A. Connect your computer to your iPhone
B. Configure your iPhone
1. Click "Settings"
2. Click "General"
3. Select "Internet Tethering"
4. Change "Off" to "On"
5. Click "Network" (top-left breadcrumb)
6. Click "General" (top-left breadcrumb)
7. Click "Settings" (top-left breadcrumb)
8. Click the "Home" button
C. Reboot your computer

At this point, the driver is setup - now we need to setup the network connections

D. Configure Windows Vista
1. Click the "Orb"
2. Click "Control Panel"
3. Click "Network and Internet"
4. Click "View network status and tasks"
5. Click "Manage network connections" (Left-menu)
6. Right click on "Local Area Connection 4" (Apple Mobile Device Ethernet)
7. Click "Rename"
8. Type "iPhone"
9. Hit Enter
10. Agree to the 'Administrative Access' question
11. Right-click on iPhone
12. Click "Disable"
13. Agree to the 'Administrative Access' question
14. Right-click on all remaining network adapters, click "Disable"
15. Agree to the 'Administrative Access' questions

E. Enable only the iPhone connection
1. Right click on "iPhone"
2. Click "Enable"
3. Agree to the 'Administrative Access' question

F. Verify
1. Look at your iPhone screen saver - should read "Internet Tethering"
2. Unlock your screen
3. Look at the "top bar" of the iPhone home screen, should read "Internet Tethering"
4. In Windows, Open up the command prompt
i. Window-R
ii. cmd.exe
iii. Enter
5. Type "ipconfig" and hit enter
6. Look at your network adapters, should see one that reads "172.10.20.2" with a gateway of 172.10.20.1"
7. Open up Firefox
8. Visit "www.ipchicken.com"
9. Look at your IP address, should be a mobile-*.att.com (i.e. 166.137.137.100)
10. Open OpenVPN client
11. Connect to your VPN
12. Access remote networks as usual
G. Cheer

At this point, everything would be configured correctly - and all is magical. In the future, I may have spare time to show how to do this on Linux - but that depends on how busy life keeps me.

Tuesday, August 11, 2009

Nifty trick with gnu find

Common problem - you have a bunch of files on disk that you need to manage. Lets say you need to scour through /home/${user}/files and set all files to permission 0640 and all directories to 0750. Let's also assume that you have a lot of files to manage.

Performance is important, so you naturally you choose to use 'findutils' to perform the action. But how do you get the most out of find? Try this nifty little tidbit:

find "${dir}" \(   \
\( -type d -exec chmod 0750 {} \; ) \
-o \
\( -type f -exec chmod 0640 {} \; \) \
\)

This one-liner has the magic "-o" operator (or) in find - but, with the ability to group your commands, you're not only limited to:
\( -iname '*.pm' -o -iname '*.pl' \)

but you can also combine test/action sequences.

Now, instead of running boring.sh, which takes an incredible amount of time - you can run nifty.sh, cutting your find time down exponentially!

Labels: , , ,

Tuesday, April 21, 2009

Looping through files in bash, quietly

All too often, I find myself looking at a scenario when I want to look at all files located in a particular directory, and for each file - do something to the file (maybe, copy it, rot13 it, several things, etc...).

There are many ways to do this, like

find ${dir} -maxdepth 1 -exec cat "{}" \;

But what if you wanted to spend some time doing other things with those files, maybe rot13 it and email it to a destination determined by the name and subject? Ok, so that's probably not really what you want to do with those files, but the point is being missed entirely.

Welcome the bash for loop - a simple loop that we're all familiar with, but forget about what it really does. For instance:

for file in *.txt; do
echo ${file}
done

Simple enough - it just echos the files - but what is happening on that for line? Expansion. Yes, your listing of *.txt is now being expanded to say a.txt, b.txt, c.txt and more-importantly "don't wait.txt".

Ah-ha, what happens with "don't wait.txt" ? Well, the problem with the for loop is this - the space is interpreted as a delimiter, so now you have "don't" and "wait.txt" - not really what you wanted now is it?

Of course, if it was - then you should enjoy it now, because when you don't want it - then you're stuck in the situation where the files weren't processed correctly, and your entire business is going to shambles, all because someone decided to be funny and upload a file with a new name that doesn't match your format... But wait, there's a fix!

Aside from the typical notions of using perl, or find with a pipe to an auxillary function, it can also be handled by using a combination of "ls", "grep" (or find) and "for" . Here's a process:

tmpfile="/tmp/fun.tmp.$$"
dir="/home/sites/site123/data/"
cd "${dir}"
ls -1 | grep -E '\.txt$' > "${tmpfile}" 2>/dev/null
# the prior 2 lines can be changed to:
# find ${dir} -maxdepth 1 -type f -name '*.txt',
# but that's just not as fun...
max=$(cat "${tmpfile}" | wc -l)
for ((i=1;i<=${max};i++)); do
file=$(cat "${tmpfile}" | head -n "${i}" | tail -n 1)
# If you really wanted to use perl, might as well use
# IO::Dir/opendir/readdir but if you're focusing on
# this detail - you're missing the point!
perl -i -ne 'tr/[a-zA-Z]/[n-za-mN-ZA-M]/;print;' \
"${file}"
done
rm -f "${tmpfile}"

The magic? Looping through a counter of files, rather than the files themselves (and quoting your arguments) so that shell expansion doesn't kick in and ruin your day.

Friday, April 10, 2009

For what it's worth

Just like everyone else in the world, I have a blog. Expect commentary about once every 6 months at best.