I haven’t written a tech blog post in a while, but I’ve been working on an interesting, albeit frustrating, problem over the last few days.

At work I have 12 Dell PowerEdge 1950 servers, each with dual quad core Xeons (ranging from 1.8ghz to 2.3ghz), 16GB of RAM, and 138GB SAS drives. They’re running VMWare Server 1.0.3 on CentOS 4.4, with all of the latest OS level updates installed.

We’re virtualizing about 120+ Red Hat Enterprise Linux 4 U2, U4, Windows 2000, and Windows 2003 Server nodes, both 32 and 64 bit. These nodes would be running my company’s software, Oracle, and MS SQL Server.

The bulk of those VM’s live on a Dell Poweredge 2900 server with 8 x 500GB SATA drives, and a Dell PERC 5/i RAID controller in a RAID 5 config. The CPU is a quad core 1.8ghz Xeon. It has 2GB of RAM. The server is running CentOS 5 and is sharing it’s disks with NFS v3. There’s a 2GB bonded ethernet connection using the onboard Broadcom nic’s and a Dell Powerconnect 5324 switch.

We were seeing that Windows 2003 64 bit nodes, when under moderate to heavy load, would experience massive packet loss. Additionally, the VMWare Server Client would not redraw the servers screens reliably. Finally, the node would bluescreen with a KERNEL_DATA_INPAGE_ERROR. This would happen when our software was copying SQL Server media to the node in preparation to provision a database. This would only happen with 64 bit Windows - 32/64 bit Linux would be fine, and 32 bit Windows would be fine.

The Windows Event Log would be littered with warnings and errors about “The device, \Device\Scsi\symmpi1, is not ready for access yet.” It didn’t take a rocket scientist to figure out that something was happening to make these machines try to access swap, fail, and bluescreen.

Now, I had been told by users that this was happening on nodes that were on local disk as well as our remote NFS server. I did extensive testing and was not able to reproduce the problem when the nodes were on local disk. It turns out that I was given erroneous information, and that nodes that people thought were local were in fact on NFS. Once I moved my test nodes over to NFS, I could reproduce the problem.

VMWare has a KB article that addresses this issue.  In fact, it seems fairly common for people who run their VM’s over an iSCSI SAN. Once I applied the registry change, my VM’s stopped bluescreening, but our file copy operation would still fail.

Looking on the VMWare Server, you would see load averages of ~20-30, and iowait’s around 25%. Looking at the NFS box, you could see that i/o to /dev/sda2 was eating up about 100% of CPU.

I changed our NFS mount options. No dice. I turned on Jumbo Frames on the bridged nic on my test VMware server. No dice. Each step would make things a “little” better, but not solve the problems.

Then, I moved the VM images over to our Netapp, which was no small feat since most of the space is used. I finally freed up about 120gb, enough for my 5 test VM’s and their snapshots, and went to testing. I fired the VM’s back up ran through another provisioning event.

Not only did my packet loss issues seem to go away, but for once I was able to run a Windows 2003 64 bit node on NFS and provision MS SQL instances without bluescreening.

Our Netapp isn’t the newest model. It’s a FAS 270 with 1.2tb of space. It’s connected to another Dell switch in another rack, with a 1GB uplink to my core switches. The Netapp does not even have Jumbo Frames enabled. Somehow, though, it’s kicking the crap out of my Dell NFS box, despite being seemingly “inferior.”

My questions at the moment are:

  1. Is my config on this NFS box fundamentally broken somehow?
  2. Is Linux’s NFS server really bad? Would I be better off with BSD or Solaris?
  3. Is something up with the driver for the PERC/5i? Is write caching enabled?
  4. Is there something up with the LSI driver in Win64 that does not show up in Win32 or in Linux?
  5. If I have to rebuild this NFS box, where do I put 1TB worth of VMWare images while I rebuild the box?

Leave a Reply

You must be logged in to post a comment.