Ben Ruset Sysadmin, etc.

15Mar/080

Ides of March (2008) Update

My last post on this blag was on February 21st, about my new VPS hosting. While I am happy to report that I am still happy with them, I am dismayed at my lack of inspiration for updating here. Granted there are only a few people who read here - and perhaps none anymore - I am to be excused, but I have always found writing for the sake of writing to be therapeutic, and I am in need of therapy.

So, rather than make one big update, here's a bunch of little ones. If you squint your eyes, it sort of looks like one big post.

1) New Jeep - on Presidents Day I went out and leased (ick) a 2008 Jeep Wrangler Rubicon Unlimited. This is the first vehicle I have ever owned that has four nouns to describe it. In keeping with my theme of buying obnoxiously yellow Wranglers, this one is "Detonator Yellow." Why did I buy it? Because my old Grand Cherokee needed a lot of work, and given that my ability to collect a paycheck hinges on my 5h round trip commute, I need a reliable car.

The new Jeep.

 

The Grand Cherokee I traded in. I also called it "vehicle of impending doom." At least it looked good.

 

My much older 2004 Wrangler Rubicon, in "Solar Yellow."

 

It's a very nice Jeep. I don't think it's as good off-road as the '04 Rubicon I had, but it's far more comfortable on-road. I recently bought the MyGig navigation radio off eBay, which adds navigation as well as a 20GB hard drive for storing mp3's. This has cut down on my use of my iPhone as my primary music device (and my poor 30gb iPod video sits uncharged and unloved now.) So, now I'm car poor again. :/

2) VMWare Servers: I'm pretty amazed with the number of hits and comments to my VMWare Server blag. Thankfully it seems nobody is coming here for information on HPUX, which is relieving. Or, it could be that I'm the only person left in the world that has to use it. We're adding another 10 VMWare Servers to the mix. I've made up a nice kickstart to install CentOS 4.6, VMWare Server, VMWare MUI, and install the Dell OpenManage yum repository on the box. I'll probably post that soon.

3) Datacenter Migration: About two years ago we moved a bunch of gear from a colo to our new corporate headquarters. We built a (roughly) 25x25' room in which to store all of our server gear. We left a 1/2 rack at the colo to hold our DNS server and some other odds and ends, because we had a good deal on it. Fast forward a year -- the new server room is full, is drawing close to 200a of power, the a/c is tragically overutilized, one of our portable a/c units committed suicide, and the 1/2 rack at the colo has bloomed to a full rack of gear that is largely powered off. Since the old colo is out of space (and apparently getting out of the colo business anyway) we're moving to a new colo in Manhattan. This is a process that has been in the works for almost four months now, held up by ceaceless negotiations, a director level position above me being created, then the guy who filled it got fired, then we hired someone else, problems with leasing companies, arguments with how to transport gear to the new place, etc.  Is it any wonder why I am jealous of the perl programmers we have now?

4) Philosophy: My new boss and I have some differing views of how the world and business works. It's a friendly disagreement. He lent me some Ayn Rand to read this weekend. I've heard plenty about her, but never got around to reading any yet. I'm sure I'll disagree. There's talk about starting an office book club. I think my first contribution to that will be some Hermann Hesse. That'll show 'em. (Note: after I finish Ayn Rand I'm going to start rereading some Hesse, perhaps in chronological order.)

5) RIP Julius Caesar:  He got stabbed today, 2052 years ago.

2Feb/080

Oh, HPUX!

Fellow co-workers know about my general disdain for HPUX. It's the OS that seems friendly, until you start working on it, and every step that you need to do turns into a 20 step process. We recently got in a HP PA-RISC box, and I was tasked with getting it setup to meet Oracle pre-reqs. Note that I am primarily a Linux sysadmin and not an antiques broker, hence why I am not so good with HPUX.

To make Oracle on HPUX work, there's a rather lengthy document that tells you what is required. There's some disk size requirements, the Java SDK needs to be installed, there's a bunch of patches that need to be installed, and a bunch of kernel parameters that need to be set.

First, Java. Do I go to Sun? Do I go to HP? Well, it looks like I go to HP, since Oracle wanted version 1.4.2.00, and HP has 1.4.2.17. Download and install. I'm annoyed because when I run swinstall to get it installed, it wants to kick me into this curses based "SAM" admin tool to install the software. Why can't I just keep it in the regular console?

The server came with a 36GB SCSI disk, with about 15GB left free for use. Fortunately this version of HPUX 11.11 auto-partitions with LVM (presumably - it came to us pre-built) so it should be easy to resize things. The first thing I had to do was resize /tmp. It came to us with 100MB of space, and I needed it to be 400MB. I decided to give it a little extra and make it 500GB. Couldn't do it online, so I had to drop down to single user mode (init 1) and resize the slice (hah, slice, I'm so old school UNIX I can say that now!) The nice thing was that I was expecting to be kicked off my telnet (yeah, I know) session when I did that, but it kept it up. Hey, maybe HPUX isn't so bad after all!

Alrighty, /tmp is resized. Lets see what else we need. Oracle says we need these patches:

GOLDQPK11i, December 2004 or later
If GOLDQPK is not installed then these two patchkits need to be installed:
GOLDAPPS11i, December 2004 or later
GOLDBASE11i, December 2004 or later

11.11 required patches:
o PHNE_31097 s/b PHNE_32477 s/b PHNE_33498 s/b PHNE_35418 s/b PHNE_36168 s/b PHNE_37110
o PHSS_31221 s/b PHSS_33263 s/b PHSS_33944 s/b PHSS_33945
For JDK, refer to HP Java for latest patches:
o PHSS_30970 s/b PHSS_33033 s/b PHSS_35379 s/b PHSS_35385
For PL/SQL/ProC,OCI,OCCI,XDK:
o PHSS_32508 s/b PHSS_34411 s/b PHSS_35099 s/b PHSS_36087
o PHSS_32509 s/b PHSS_34412 s/b PHSS_35098 s/b PHSS_36086
o PHSS_32510 s/b PHSS_34413 s/b PHSS_35100 s/b PHSS_36088

Ok, so GOLDQPK11i is just a rollup of a bunch of patches and whatnot. Lets download that. Whee, 470MB. It took longer to actually install than to download. The server just churned away on it while it was creating a depot for installation. Finally rolled out of the office at 8:30 PM that night. Started working on this thing much, much earlier but I was distracted by various other things.

The next morning, I'm working from home, and up at 7. Lets look at these other patches. PHNE_31097 and then s/b and other patches? Succeeded by? Who knows? Logged onto HP's site and went to their patch download area and sure enough, those patches have been succeeded by newer ones. Why didn't Oracle just require the latest? Why should we assume what s/b is supposed to mean? Ask Larry, I have no idea. Download the patches and install. Ugh, not all of them go. What am I missing?

PHSS_36087 (s700_800 11.11 HP aC++ Compiler (A.03.77))
PHSS_36086 (s700_800 11.11 ANSI C compiler B.11.11.16 cumulative patch)
PHSS_36088 (s700_800 11.11 +O4/PBO Compiler B.11.11.16 cumulative patch)

Ugh, so I guess I need to put a compiler on this thing. Off I go to the HP DSPP site to download the aC++ and ANSI C compiler. 150mb later, I've got this thing installed.* Attempted to load the patches that I was missing, and they failed. Why? The new compiler is already past the revisions of those patches. Okay, I guess those pre-reqs are met.

* Let me backtrack a bit here. I actually couldn't get the thing installed, because /opt was not big enough. I attempted to resize it, but even in single user mode I couldn't dismount it. So I brought it back up in runlevel 3 and uninstalled all of the Mozilla crap, Apache, Tomcat, and Chinese language support. Now I had enough space. Go me.

Now for the kernel parameters. There's a giant list of them, and Oracle, rather than giving you some of the numbers to put in, expects you to do some math:

KSI_ALLOC_MAX (NPROC*8)
EXECUTABLE_STACK=0
MAX_THREAD_PROC 1024
MAXDSIZ 1073741824 bytes
MAXDSIZ_64BIT 2147483648 bytes
MAXSSIZ 134217728 bytes
MAXSSIZ_64BIT 1073741824
MAXSWAPCHUNKS 16384*
MAXUPRC ((NPROC*9)/10)
MSGMAP (MSGTQL+2)
MSGMNI NPROC
MSGSEG 32767
MSGTQL 4096
NCSIZE (NINODE+1024)*
NFILE (15*NPROC+2048)
NFLOCKS 4096
NINODE (8*NPROC+2048)
NKTHREAD (((NPROC*7)/4)+16)
NPROC 4096
SEMMAP (SEMMNI+2)*
SEMMNI 4096
SEMMNS (SEMMNI*2)
SEMMNU (NPROC - 4)
SEMVMX 32767
SHMMAX AvailMem
SHMMNI 512
SHMSEG 120
VPS_CEILING 64

Alright, I got myself a little shell script to set all of these values. Reboot and I should be all set, right?

A few hours later I got the machine kicked back to me saying that the kernel parameters are not set. I look and it shows:

# kmtune -q maxdsizParameter             Current Dyn Planned                    Module     Version
===============================================================================
maxdsiz             268435456  -  1073741824

I suggest another reboot. No dice. Well, I look at the man page for kmtune and it says that the kernel needs to be rebuilt (!!!) and the box rebooted before the new parameters take effect. Ugh. So I found someone's post on the HP forums on how to rebuild the kernel and reboot, still no dice. 7:00 PM on a Friday I give up. Hopefully Monday I will be able to figure it out.

HPUX, I want to get along with you. Why do you make things so difficult for me?

As an aside, I saw a job posting the other day advertising for a HPUX/Veritas sysadmin paying $150k. Now I know why.

Filed under: HPUX, Tech, Work No Comments
16Jan/088

My Large VMWare Server Farm

It seems like many people come to this blog from Google searches about VMWare, CentOS, and OpenFiler. I figured it might be good to talk about my VMWare Server deployment at work, since it's something that I am fairly proud of.

I have fifteen Dell PowerEdge 1950 servers. They're 1U each, with dual quad-core Intel Xeon CPU's ranging from 1.8 to 2.2ghz. They each have 16GB of RAM. Ten of them have 143GB 15K 3.5" SAS drives, and 5 of them have 143GB 10K 2.5" SAS drives. The servers that have the 10K drives have a backplane that will allow you to plug in 4 drives. The servers with the 15K drives have backplanes that will allow you to only have 2 drives. Each server has two onboard Broadcom NIC's, a PCI-X Broadcom NIC, and a recently added dual port Intel e1000 NIC. I'll get into that in a second.

Each VMWare server runs CentOS 4.4 64 bit ServerCD edition. For those of you who don't know, CentOS is a 100% Red Hat Enterprise Linux binary compatible distribution. It's built from Red Hat sources and, due to the nature of the GPL, is able to be released by the CentOS group for those of us who want Red Hat Linux but don't want or need to pay for Red Hat support. I would argue, given my experiences with Red Hat support, that the support offerings of CentOS are superior.

I am a firm believer in keeping things as simple as possible. I have seen many other Linux sysadmins want to go crazy with the software they deploy and the hacks they roll into production, only to be bogged down in a morass of "one offs" or to leave behind a legacy of poorly documented systems that really need their original owner to run right. I don't like that, which is why I tend to stay on the straight and narrow. I keep my partitioning simple. I (generally) keep the packages I install restricted to the ones available through official CentOS channels. Some may consider this heresy, but if there is a RPM available for something, I'd rather install that than build from source. All of this leads to systems that "just work" and that can hum along and do their jobs with a minimum amount of fuss. Could I squeeze some extra performance out if I did a custom compiled kernel? Sure. Do I want to be troubleshooting VMWare at 3AM in the morning because something in that kernel broke virtual networking? No way.

On all but a few of our VMWare servers, we run VMWare Server 1.0.3. New servers that have just made it into production are getting 1.0.4, with a general upgrade planned in the somewhat near future. Not because we're seeing problems, but if we have to take boxes down to add new hardware (the Intel e1000 NICs that I am getting to in a second) we might as well upgrade VMWare while we're at it.

We chose VMWare Server for the price. You absolutely can not beat it for the price, which is free. We spoke with VMWare about getting VMWare ESX in, and even in it's most basic of forms, it would have been prohibitively expensive. Here at GA we're concerned about getting the most value for our money. By going with VMWare Server we lose the ability to have multiple snapshots per VM which would be nice, but is not a deal breaker. We also lose the central management, but you can make up for that by buying VMWare VirtualCenter 1.4, which we did. I'm not too happy with it, but it could be because it just doesn't scale well to the level that we're using it, or it could be set up better. Probably both.

Each VMWare server has three nics. Two onboard and one PCI-X. eth0 and eth1 are both bridged interfaces - eth0 handles all of the main traffic to each node, and also serves as the management interface to the VMWare server itself. eth1 handles Oracle priv traffic for RAC, and cluster heartbeats for Windows SQL Server clusters. eth2, the pci-x NIC, handles all of the storage traffic. Each VMWare server has a dedicated uplink on it's own VLAN to a Dell PowerEdge 2900 that is acting as a big NFS server.

We ran into a problem with the PowerEdge 1950's on-board NIC's. If you put them under any sort of load (which we were with multiple VM's trying to copy media and provision databases on ASM) the bus that the NIC's were sitting on would reset. That would drop all of the VM's off the network for a time, and the switches that the nics were plugged into would show that the link had gone down and then back up. This is a bad thing. We're also not the first people to see it. After a fight with Dell (who were not really inclined to help us because of CentOS or VMWare Server) I got them to send us an Intel e1000 card. Installing this in the spare PCI-X slot made our network problems go away. So, we're in the midst of bringing down all of our VMWare servers, disabling the on-board NIC's, and installing these Intel cards.

Another problem we're running into is that Dell PowerEdge 2900. We have ~70 VM's on it, and when they get under heavy load some of the VM's experience SCSI resets, which sometimes results in database creates failing, and support tickets in our queue. According to some of the folks on the Linux-Poweredge mailing list, the hardware RAID controller that is in the box - the PERC5/i - generally sucks under Linux, offering performance slower than software RAID. There are rumors of an updated driver from Dell that will make it run faster -- we'll have to see how that pans out. In the mean time, we're going to be ordering fifteen 750GB SATA drives for each server. That will increase our total available VM storage to 11TB or so, which is better than the 2TB we get from the 2900. That also means that we lose out on nifty features like "if the VMWare server goes down, we can bring these VM's back up on another machine."

You may be curious how many VM's we can stuff on one of those 1950's. Well, with a mix of local and NFS storage, we've gotten up to 15 VM's running at once. These aren't weenie VM's either - they're either RHEL nodes which have either 512 or 1GB (usually 1GB) of RAM, 15GB of disk, or Windows nodes with 512-1GB of RAM, 15GB of disk, and clusters running. They're either running Oracle or MS SQL, and while they're not handling millions of transactions, they're being used by my development and QA staff.

As you might expect, power and cooling requirements for this bunch of servers is high. They're all in one APC Netshelter VX rack, fed by three 15A 110v AC lines. Some other infrastructure servers are also on those circuits, but we're using up roughly 30A in that one rack alone. Cooling is hard -- we've blown past what the 5 ton AC unit in the room can handle, and the two portable A/C units don't do much to help. We're in the process of moving gear to a colo.

All said, this environment has helped GA really expand. If we had to make an investment in physical servers we would have spent in excess of $500k to purchase all of that gear. With less than $70k invested, we're able to accomplish nearly the same thing -- and more, once we work the bugs out. I've been a huge fan of virtualization since VMWare first came on the market, and in my case it's really been worth it to deploy.

10Jan/080

Dell Account Rep Nightmare

My company, over the course of the last year or so, has been standardizing on Dell servers for our infrastructure projects. We've always used Dell desktops and notebooks, but over the last year and a half have bought fifteen dual quad core PowerEdge 1950 servers, a giant PowerEdge 2900, and a host of other various 1U boxes.

Three weeks ago, I was on my third Dell account team since I started here two and a half years ago. The rep I had was excellent. She went out of her way to solve problems for me. She got quotes out to me lightning quick. She was always pleasant to deal with, and sincerely interested in helping out whenever she could. I had actually even emailed Mike Dell and told him how great this rep was, and how happy I was dealing with her.

In my professional career, I've had direct dealings with Dell since about 1999. I standardized VPIsystems on Dell and Unisys gear, and we spent a ton of money on Dell desktops, workstations, and peripherals. At Rubin & Raine, I inherited an Everex server with dumb terminals, and by the time I left everyone had Dell desktops and laptops (except for Pat and his battle worn - and great - Toshiba), and about five Dell servers. Through various consulting gigs I did I always deployed Dell gear, and now at my current job, which I don't mention the name of but you can figure it out if you Google, we've become pretty much all Dell for our infrastructure. (We still develop on IBM Blades, Sun, and HP boxes.)

About two weeks ago I got an email from my new Dell rep saying that my account had been transitioned:

Your account currently resides in the Dell Business Development Group (BDG) where you have access to a variety of services and benefits to meet the specific IT demands of your company.

The primary reason for the e-mail today is to let you know there has been an update on your account team.  Effective immediately, I will be the primary Dell contact for your company.  If there is an outstanding purchase or issue I need to follow up on, let me know who you were working with and I will follow up with them to ensure there is no break in service.  In the weeks to come, I will be calling to introduce myself and briefly review a couple of items.

Your company will continue to be eligible for acquisition pricing and various other services including Premier Page, Premier Access, elevated level of tech support and customer care, etc.

As your account manager I lead your account team which consists of a Server & Storage Consultant, 3rd Party Licensing Specialist, 3rd Party Peripheral Specialist and Dell Financial Services Officer.  Please use me as your point of contact but feel free to contact your specialized team member as well (see my signature for your account team’s contact information).

Right off the bat, the first thing that I noticed was the "in the weeks to come." (The bolding is mine.) Alright, you're taking over my account. I've spent in excess of $100k on Dell gear in the last three months or so. Why are you going to wait weeks to get in touch with me? It's a fairly nit-picky thing, but it left a sour taste in my mouth.

I then went to my old rep and tried to see what I could do. Unfortunately there is nothing she could do about losing my account. I already had the email addresses of some of the higher ups with Dell's Oklahoma call center, so I went about contacting them.  It took a few days but someone did get back to me.

I should stop the story right here and take you back to 2000, when I had another excellent Dell rep by the name of Steve Milam. Eight years later and I still remember his name. He was, by far, one of the best salespeople (even outside of Dell) I had ever dealt with. He ended up getting transitioned to a higher level within Dell, and I got a new rep. The new rep was okay, but not nearly as good as Steve. I suffered through that for a few months before I left VPI. While I was at Rubin & Raine, the same thing happened. I had a decent (but not as good as Steve) rep, and was transitioned three times in rapid-fire sequence, each rep being progressively worse than the previous one. I finally had to email Mike Dell again and ask them to knock it off -- which they did for a month or two.

I spoke with Holly, who is my old rep's supervisor's supervisor. She assured me that my new account team - who has less accounts to deal with than my current rep - would be just as good, if not better. I pleaded with her to stay with my current rep, which she told me was not possible at her level, but she said she'd try to escalate the issue. She said that there were various political things that went on behind the scenes that made my seemingly easy to fulfill request impossible. I told her that Dell's political machinations are irrelevant to me, and the only thing that was important to me was staying with a rep that my company had a great working relationship.

I also escalated to Holly's boss Chris, who has yet to return my phone calls.

The last few days I requested some quotes from my new account team. It took several hours to get a quote back, and I only got them after I prodded my account rep to get them to me before a meeting. I also spoke with the storage specialist in my team, who didn't know what an ISCSI HBA was.

I got a call this morning from my rep's supervisor who attempted to defuse the situation. I told him the only thing that would make me happy would be if my account didn't get transitioned, and he quickly shot back "well it's already been transitioned." I told him that if I had to deal with a new account team, then it's worth my while to go deal with a new team at IBM or HP, since I am sure that they don't play musical reps like they do at Dell.

It's amazing to me that Dell would jeopardize an account - one that is growing and spending a lot of money - because of their own backroom dealings. It's also amazing that Dell is trying to "rebrand" itself and show the world that it's more customer focused. This process - which I have lived through about five times - is incredibly harmful to their dealings with customers. Their response to my unhappiness has been one of inflexibility. I have no choice to bend to the will of Dell, according to what I am told.

My IBM rep will be ecstatic to hear about this.

Filed under: Dell, Work No Comments