My Large VMWare Server Farm
Ben Ruset January 16th, 2008
It seems like many people come to this blog from Google searches about VMWare, CentOS, and OpenFiler. I figured it might be good to talk about my VMWare Server deployment at work, since it’s something that I am fairly proud of.
I have fifteen Dell PowerEdge 1950 servers. They’re 1U each, with dual quad-core Intel Xeon CPU’s ranging from 1.8 to 2.2ghz. They each have 16GB of RAM. Ten of them have 143GB 15K 3.5″ SAS drives, and 5 of them have 143GB 10K 2.5″ SAS drives. The servers that have the 10K drives have a backplane that will allow you to plug in 4 drives. The servers with the 15K drives have backplanes that will allow you to only have 2 drives. Each server has two onboard Broadcom NIC’s, a PCI-X Broadcom NIC, and a recently added dual port Intel e1000 NIC. I’ll get into that in a second.
Each VMWare server runs CentOS 4.4 64 bit ServerCD edition. For those of you who don’t know, CentOS is a 100% Red Hat Enterprise Linux binary compatible distribution. It’s built from Red Hat sources and, due to the nature of the GPL, is able to be released by the CentOS group for those of us who want Red Hat Linux but don’t want or need to pay for Red Hat support. I would argue, given my experiences with Red Hat support, that the support offerings of CentOS are superior.
I am a firm believer in keeping things as simple as possible. I have seen many other Linux sysadmins want to go crazy with the software they deploy and the hacks they roll into production, only to be bogged down in a morass of “one offs” or to leave behind a legacy of poorly documented systems that really need their original owner to run right. I don’t like that, which is why I tend to stay on the straight and narrow. I keep my partitioning simple. I (generally) keep the packages I install restricted to the ones available through official CentOS channels. Some may consider this heresy, but if there is a RPM available for something, I’d rather install that than build from source. All of this leads to systems that “just work” and that can hum along and do their jobs with a minimum amount of fuss. Could I squeeze some extra performance out if I did a custom compiled kernel? Sure. Do I want to be troubleshooting VMWare at 3AM in the morning because something in that kernel broke virtual networking? No way.
On all but a few of our VMWare servers, we run VMWare Server 1.0.3. New servers that have just made it into production are getting 1.0.4, with a general upgrade planned in the somewhat near future. Not because we’re seeing problems, but if we have to take boxes down to add new hardware (the Intel e1000 NICs that I am getting to in a second) we might as well upgrade VMWare while we’re at it.
We chose VMWare Server for the price. You absolutely can not beat it for the price, which is free. We spoke with VMWare about getting VMWare ESX in, and even in it’s most basic of forms, it would have been prohibitively expensive. Here at GA we’re concerned about getting the most value for our money. By going with VMWare Server we lose the ability to have multiple snapshots per VM which would be nice, but is not a deal breaker. We also lose the central management, but you can make up for that by buying VMWare VirtualCenter 1.4, which we did. I’m not too happy with it, but it could be because it just doesn’t scale well to the level that we’re using it, or it could be set up better. Probably both.
Each VMWare server has three nics. Two onboard and one PCI-X. eth0 and eth1 are both bridged interfaces - eth0 handles all of the main traffic to each node, and also serves as the management interface to the VMWare server itself. eth1 handles Oracle priv traffic for RAC, and cluster heartbeats for Windows SQL Server clusters. eth2, the pci-x NIC, handles all of the storage traffic. Each VMWare server has a dedicated uplink on it’s own VLAN to a Dell PowerEdge 2900 that is acting as a big NFS server.
We ran into a problem with the PowerEdge 1950’s on-board NIC’s. If you put them under any sort of load (which we were with multiple VM’s trying to copy media and provision databases on ASM) the bus that the NIC’s were sitting on would reset. That would drop all of the VM’s off the network for a time, and the switches that the nics were plugged into would show that the link had gone down and then back up. This is a bad thing. We’re also not the first people to see it. After a fight with Dell (who were not really inclined to help us because of CentOS or VMWare Server) I got them to send us an Intel e1000 card. Installing this in the spare PCI-X slot made our network problems go away. So, we’re in the midst of bringing down all of our VMWare servers, disabling the on-board NIC’s, and installing these Intel cards.
Another problem we’re running into is that Dell PowerEdge 2900. We have ~70 VM’s on it, and when they get under heavy load some of the VM’s experience SCSI resets, which sometimes results in database creates failing, and support tickets in our queue. According to some of the folks on the Linux-Poweredge mailing list, the hardware RAID controller that is in the box - the PERC5/i - generally sucks under Linux, offering performance slower than software RAID. There are rumors of an updated driver from Dell that will make it run faster — we’ll have to see how that pans out. In the mean time, we’re going to be ordering fifteen 750GB SATA drives for each server. That will increase our total available VM storage to 11TB or so, which is better than the 2TB we get from the 2900. That also means that we lose out on nifty features like “if the VMWare server goes down, we can bring these VM’s back up on another machine.”
You may be curious how many VM’s we can stuff on one of those 1950’s. Well, with a mix of local and NFS storage, we’ve gotten up to 15 VM’s running at once. These aren’t weenie VM’s either - they’re either RHEL nodes which have either 512 or 1GB (usually 1GB) of RAM, 15GB of disk, or Windows nodes with 512-1GB of RAM, 15GB of disk, and clusters running. They’re either running Oracle or MS SQL, and while they’re not handling millions of transactions, they’re being used by my development and QA staff.
As you might expect, power and cooling requirements for this bunch of servers is high. They’re all in one APC Netshelter VX rack, fed by three 15A 110v AC lines. Some other infrastructure servers are also on those circuits, but we’re using up roughly 30A in that one rack alone. Cooling is hard — we’ve blown past what the 5 ton AC unit in the room can handle, and the two portable A/C units don’t do much to help. We’re in the process of moving gear to a colo.
All said, this environment has helped GA really expand. If we had to make an investment in physical servers we would have spent in excess of $500k to purchase all of that gear. With less than $70k invested, we’re able to accomplish nearly the same thing — and more, once we work the bugs out. I’ve been a huge fan of virtualization since VMWare first came on the market, and in my case it’s really been worth it to deploy.
- VMWare , Virtualization , Work
- Comments(5)
Nice to see someone else using a large Vmware Server set up. Interesting you’re having such a hard time with the PERC controllers, though. We’ve had nothing but great luck with them, running approximately the same amount on our 1950’s. Did you happen to load up the OMSA stuff? Might help troubleshoot anything that could be amiss on your host.
I’ve generally found the software RAID set up to be pretty troublesome too, so you might want to check into it before taking the plunge.
Also one last thing, VMware Server runs fantastic over NFS–something you may want to consider if you’re continually running short on local disk. Very handy in doing the the VM/server swap trick you as mentioned. Works great in our environment.
[...] happened to run across this fella today, who also runs a very large VMware Server farm in a production environment. He makes a [...]
Ben-
I’ve been using a similar setup here for our office, CentOS 5 running VMware Server 1.0.3. Until recently, I’ve been using only local storage for the VM’s, but would like to switch to a central NFS solution. I keep having trouble with permissions when I get the NFS drive mounted in CentOS, if it mounts at all. Mind if I ask how you mount those NFS drives and run the VM’s off of them? It’s driving me crazy!
—Mik
I am trying to install a very small vmWare setup at my office with a PowerEdge 1950 and am having some performance issues. Are you seeing this at your location? Does the fact that you have a farm (which to me implies “cluster”) overcome this effect?
I would like to swap stories with you via e-mail if you’re willing.
My Dell NFS box, which proved to be too slow, was exporting NFS thusly:
/vol/vol1/vmware (no_root_squash,rw)
My VMWare servers were mounting it with the following options:
rw,soft,timeo=120,addr=192.168.50.10