My Large VMWare Server Farm
It seems like many people come to this blog from Google searches about VMWare, CentOS, and OpenFiler. I figured it might be good to talk about my VMWare Server deployment at work, since it's something that I am fairly proud of.
I have fifteen Dell PowerEdge 1950 servers. They're 1U each, with dual quad-core Intel Xeon CPU's ranging from 1.8 to 2.2ghz. They each have 16GB of RAM. Ten of them have 143GB 15K 3.5" SAS drives, and 5 of them have 143GB 10K 2.5" SAS drives. The servers that have the 10K drives have a backplane that will allow you to plug in 4 drives. The servers with the 15K drives have backplanes that will allow you to only have 2 drives. Each server has two onboard Broadcom NIC's, a PCI-X Broadcom NIC, and a recently added dual port Intel e1000 NIC. I'll get into that in a second.
Each VMWare server runs CentOS 4.4 64 bit ServerCD edition. For those of you who don't know, CentOS is a 100% Red Hat Enterprise Linux binary compatible distribution. It's built from Red Hat sources and, due to the nature of the GPL, is able to be released by the CentOS group for those of us who want Red Hat Linux but don't want or need to pay for Red Hat support. I would argue, given my experiences with Red Hat support, that the support offerings of CentOS are superior.
I am a firm believer in keeping things as simple as possible. I have seen many other Linux sysadmins want to go crazy with the software they deploy and the hacks they roll into production, only to be bogged down in a morass of "one offs" or to leave behind a legacy of poorly documented systems that really need their original owner to run right. I don't like that, which is why I tend to stay on the straight and narrow. I keep my partitioning simple. I (generally) keep the packages I install restricted to the ones available through official CentOS channels. Some may consider this heresy, but if there is a RPM available for something, I'd rather install that than build from source. All of this leads to systems that "just work" and that can hum along and do their jobs with a minimum amount of fuss. Could I squeeze some extra performance out if I did a custom compiled kernel? Sure. Do I want to be troubleshooting VMWare at 3AM in the morning because something in that kernel broke virtual networking? No way.
On all but a few of our VMWare servers, we run VMWare Server 1.0.3. New servers that have just made it into production are getting 1.0.4, with a general upgrade planned in the somewhat near future. Not because we're seeing problems, but if we have to take boxes down to add new hardware (the Intel e1000 NICs that I am getting to in a second) we might as well upgrade VMWare while we're at it.
We chose VMWare Server for the price. You absolutely can not beat it for the price, which is free. We spoke with VMWare about getting VMWare ESX in, and even in it's most basic of forms, it would have been prohibitively expensive. Here at GA we're concerned about getting the most value for our money. By going with VMWare Server we lose the ability to have multiple snapshots per VM which would be nice, but is not a deal breaker. We also lose the central management, but you can make up for that by buying VMWare VirtualCenter 1.4, which we did. I'm not too happy with it, but it could be because it just doesn't scale well to the level that we're using it, or it could be set up better. Probably both.
Each VMWare server has three nics. Two onboard and one PCI-X. eth0 and eth1 are both bridged interfaces - eth0 handles all of the main traffic to each node, and also serves as the management interface to the VMWare server itself. eth1 handles Oracle priv traffic for RAC, and cluster heartbeats for Windows SQL Server clusters. eth2, the pci-x NIC, handles all of the storage traffic. Each VMWare server has a dedicated uplink on it's own VLAN to a Dell PowerEdge 2900 that is acting as a big NFS server.
We ran into a problem with the PowerEdge 1950's on-board NIC's. If you put them under any sort of load (which we were with multiple VM's trying to copy media and provision databases on ASM) the bus that the NIC's were sitting on would reset. That would drop all of the VM's off the network for a time, and the switches that the nics were plugged into would show that the link had gone down and then back up. This is a bad thing. We're also not the first people to see it. After a fight with Dell (who were not really inclined to help us because of CentOS or VMWare Server) I got them to send us an Intel e1000 card. Installing this in the spare PCI-X slot made our network problems go away. So, we're in the midst of bringing down all of our VMWare servers, disabling the on-board NIC's, and installing these Intel cards.
Another problem we're running into is that Dell PowerEdge 2900. We have ~70 VM's on it, and when they get under heavy load some of the VM's experience SCSI resets, which sometimes results in database creates failing, and support tickets in our queue. According to some of the folks on the Linux-Poweredge mailing list, the hardware RAID controller that is in the box - the PERC5/i - generally sucks under Linux, offering performance slower than software RAID. There are rumors of an updated driver from Dell that will make it run faster -- we'll have to see how that pans out. In the mean time, we're going to be ordering fifteen 750GB SATA drives for each server. That will increase our total available VM storage to 11TB or so, which is better than the 2TB we get from the 2900. That also means that we lose out on nifty features like "if the VMWare server goes down, we can bring these VM's back up on another machine."
You may be curious how many VM's we can stuff on one of those 1950's. Well, with a mix of local and NFS storage, we've gotten up to 15 VM's running at once. These aren't weenie VM's either - they're either RHEL nodes which have either 512 or 1GB (usually 1GB) of RAM, 15GB of disk, or Windows nodes with 512-1GB of RAM, 15GB of disk, and clusters running. They're either running Oracle or MS SQL, and while they're not handling millions of transactions, they're being used by my development and QA staff.
As you might expect, power and cooling requirements for this bunch of servers is high. They're all in one APC Netshelter VX rack, fed by three 15A 110v AC lines. Some other infrastructure servers are also on those circuits, but we're using up roughly 30A in that one rack alone. Cooling is hard -- we've blown past what the 5 ton AC unit in the room can handle, and the two portable A/C units don't do much to help. We're in the process of moving gear to a colo.
All said, this environment has helped GA really expand. If we had to make an investment in physical servers we would have spent in excess of $500k to purchase all of that gear. With less than $70k invested, we're able to accomplish nearly the same thing -- and more, once we work the bugs out. I've been a huge fan of virtualization since VMWare first came on the market, and in my case it's really been worth it to deploy.
Great Plains
I just finished reading "Great Plains" by Ian Frazier. After pretty much reading, and reviewing, Jersey history for the last few years, I wanted to broaden my horizon. The mini review on the front of the book compares the author to John McPhee, who wrote a really good book on the Pine Barrens, so I figured it was an omen that I would like this book. It did not disappoint.
Now, of course, I want to get in my Jeep and drive across the plains. I want to visit the site of Sitting Bull's cabin. I want to go back to Keota and Buckingham, in the Pawnee National Grassland in Colorado, and take in the surroundings again. There is something incredibly powerful when you look across the plains and see nothing - just miles and miles of grass blowing out to the horizon. When you're the only person around for miles. As much as I love the Pine Barrens, you can never get that far away from everything. Even in the middle of a cedar swamp, surrounded by hummocks and briers so fierce that you need a machete to cut through them, there's always something to remind you of humanities presence. A mylar balloon or a long abandoned tree stand loudly exclaims to the "explorer" that there's no uncharted territory to be found.
Frazier explains that the land and resources in the plains were so abundant - and cheap - that it didn't make sense to tear down the old and replace with the new. Unlike Manhattan, where buildings go up, live their lives, become obsolete, and get demolished, the plains holds structures, towns, that outlived their usefulness years ago and now just bake, unused, under the sun. In Jersey, the ghost towns we have only still exist because they're on protected land. If the Pines weren't protected, Harrisville or Martha would probably house a Wal-Mart and a Super Stop & Shop. In Jersey we force our conservation - in the plains it happens because it's just not worth it to build.
Some people may only see the plains as a lonely stretch of land between New York and San Francisco. From Google Earth, the Oklahoma Panhandle is wormwood, defined by thousands of green circles made by center pivot irrigation. It's a landscape so unlike what I'm used to in Jersey. It's an amazing place. This book makes me want to see more of it.
I want a Tablet PC
HP just announced the new TX2000 series tablet at CES the other day. (review) This thing looks sweet.
It has an AMD Turion X2 CPU, Wacom active digitizer plus touchscreen, DVD burner with lightscribe, LED lit screen, comes with a remote control (probably not as cool as the one that comes with Macs now), and is cheap! Basic configs (Vista Home, 1GB RAM, etc.) are booking at $1300, whereas the new Toshiba Portege M700 starts at $1500, and configs for the Dell Latitude XT are starting somewhere around $2500.
The TX2000 is essentially a refresh of the TX1000 - the main difference being the digitizer. The digitizer in the TX1000 (which is being fire saled by HP right now) was a passive digitizer, similar to what you would find on a PDA. There's no pressure sensitive writing, and if you touch the screen while you're writing (which is easy to do in tablet mode), you screw up your input. One of the big problems I have with tablets is that they're all so ugly looking. Not the HP. I saw the TX1000 in person at Best Buy last year, and was wholly impressed. The only thing that made me not buy it was the passive digitizer. Now that the TX2000 is out and has all of the looks (if not more) of the TX1000, plus the Wacom digitizer, pretty much all of my wishes have been granted.
Before the TX2000 came out, I was really excited about the Latitude XT. The specs on it, though, are underwhelming. The Core 2 Duo ULV processor is slow, and while it's in a really kick-ass form factor, and is one of the few tablets that you can get with on-site support. It's just really, really expensive. The Toshiba M700 looked great until I saw that it was thicker than my E1505! No way would I want to carry that beast around in tablet mode! The TX2000 seems to be a good compromise between the two.
I really want to get it, but I'm holding off because I just hate to spend the money. I wouldn't be able to sell my Inspiron E1505 to recoup some of the cost because Dell is/was selling them for dirt cheap recently, and after a year or so of use it's looking a little worse for wear. So my wife will probably inherit it, and her Mac Mini may turn into some sort of other node on my home network, or get sold.
Dell Account Rep Nightmare – Update 1
I spoke with Chris, who is the Regional Sales Manager for the New York area. Essentially, I'm pretty much stuck where I am now. The idea is to give the new account team 90 days, and then from there we'll see what happens. Not to sound too cynical, but I am sure that what will happen is if this new team isn't great, in 90 days I'll get moved to another team instead of back where I wanted to be with my former rep.
I tried to impress on Dell that I wasn't going to walk away from that phone call happy unless my account moved back to where it was.
A long time customer of Dell, who is blatantly telling Dell that they are unhappy, and making a reasonable request that will solve the problem was basically told to suck it up and deal with it.
My company would never think of treating a customer like that. But even then, if one of our customers was unhappy, we'd pretty much do anything to resolve the problem.
Dell has come under a lot of fire recently. First, they took a lot of slag (rightly so) for their consumer level customer support being sub-par. They worked to improve communication between end users and Dell by monitoring blogs, starting their own blog, and proactively solving problems. (They replaced my Inspiron 700m with a XPS M1210 after the Inspiron broke three times and came back from their repair depot more broken than it had arrived there.) Their direct sale business model has apparently peaked and isn't carrying the company any more, so they're branching out to retail shops. Mike Dell is again CEO. Financial problems.
I am a huge Dell fan. I've always had good luck with the Optiplex, Latitude, and PowerEdge line of gear. The Dimensions we have are stable enough. The Inspiron laptops we have, including the E1505 that I am tying this on now, hold up somewhat well but end up looking a bit worse for wear. Our entire switching infrastructure runs on Dell switches. We have a huge VMWare farm running on really high end PowerEdge 1950's. We're now reselling Dell gear.
Now, I'll be the first to admit that we're not buying at the volume of Dell's largest accounts. But we're certainly not stingy with the revenue we're giving them now. To top it off, I'm pretty much responsible for our switch to Dell, since my co-workers all love IBM and HP gear. It amazes me that I'm getting the runaround from Dell over this.