A cautionary tale of VM sprawl in data centers

VM sprawl

VM sprawl is a very real thing that happens in data centers all over the world. We build and provision VMs so quickly these days it can get away from us before we know it. Soon our vCenter Server is full of VMs that we don’t know who they belong to, what purpose they serve, if they’re still being used or if they were ever used at all.

While this following dramatization is hilarious, it’s important to catch the signs of VM sprawl before it brings your infrastructure to its knees.

Picture a systems engineer lying on a therapist’s nice leather couch in a dimly lit room as the therapist asks the following questions:

Therapist: Now, tell me when this all started.

Systems Engineer (SE): I don’t really even know where to begin, I guess it all started back in vSphere 3.5 for us. It was our first introduction to server virtualization. I remember it like it was yesterday. We had just got a new shipment from Dell, a brand new server with loads of RAM and the top-of-the-line processor. It was like Christmas morning for our team of engineers. We loaded the hypervisor on the server and connected it to our network, and we were off and running.

It was just one physical ESX hosts at first, and we created a Server 2003 virtual machine that was going to be our domain controller. We then created another server that would end up being our DNS server and one other virtual machine for our DHCP server. Our boss just loved it and wanted us to build other physical hosts so he bought us four more servers.

It was incredible. Since we had more than one server we decided it would be best to leverage vCenter Server so we could manage them all on one pane of glass. This is when it really started to get out of hand.

Therapist: Calm down, take a deep breath. Let’s talk about what happened after the vCenter installation.

SE: Well, once we installed vCenter and added all of our ESX hosts to it, we realized we could build much more than just a couple of servers. We could cluster our hosts and create clones, templates, snapshots and all sorts of other fun stuff. This is when the sprawl really started to tax our infrastructure. Within several years and several new vSphere releases we had over 1,500 virtual machines in our datacenter and we had to keep buying more storage. Our power, space and cooling costs went through the roof, and my boss winced at each request from our team for more resources.

Therapist: Who would you say is responsible for letting this VM Sprawl grow out of control? Do you feel responsible? Are the users responsible?

SE: I would love to just blame the users who kept requesting a new virtual machine each time they had a new project or wanted to test a new software package, but I can’t place all the blame on them. Late at night when the office was closed and I was all alone, I would build virtual machines in our test/dev environment to run my own tests.

I felt kind of like a mad scientist, it was so easy to provision a new VM.

Eventually, our need for more resources ended up costing too much and we had to let two of our administrators go to make up for the shortage of funding. This led me to drive our IT team harder to meet deadlines, give the end users what they wanted, which created huge VMs with multiple 150+ GB disk drives and 20 GB of RAM for the developers. We didn’t turn down a single new VM request and there were stale VMs sitting on the host sucking up resources. Just the other day I came across around 80 .vmdk files that had somehow been orphaned and were now zombie .vmdks just roaming the SAN and killing resources.

Therapist: How was your relationship with your father growing up?

SE: Huh?

Therapist: Never mind. What is your environment like now?

SE: I couldn’t tell you, my boss fired me and my team and brought in some crack IT consultants to figure out what was wrong with the environment. Apparently they fixed everything and were able to delete over 400 stale, unused VMs which freed up a ton of space on the SAN and relieved the stress on the network. Also, they implemented something called a Resource Governance Board consisting of my boss, a SAN engineer, network engineer and the CTO. So now when someone puts in a request for a VM, it has to go in front of the Resource Governance Board who looks at the request and each one either approves or denies it based on the environment’s available resources and the need justification.

It’s all a fancy schmancy process now. They even place expiration dates on the bigger VMs that they build.

Therapist: How can you place an expiration date on a VM?

SE: I don’t know, they have some sort of workflow that alerts the team when the VM hasn’t been logged into or used. After 90 days of inactivity they alert the VM owner that it will be deleted unless they send a new justification for keeping it. I guess they only have two SEs now, down from my team of 6 guys. I’m just [engineer begins to weep] sad it had to end this way.

Therapist: What are you doing now? Do you still work in IT?

SE: I work the night shift at the datacenter on the NOC team.

Therapist: That’s not so bad now is it?

SE: I eat Cheetos and drink Mountain Dew all night long just to stay awake, but I get by. I guess I learned a lot about VM sprawl and how to avoid it. I just wish I noticed the signs of VM sprawl before I was buried in it.

Therapist: Now, back to your Dad.

Do you have some sort of resource governance in place to monitor the requests for new VMs? If you don’t, you should consider doing so, you’ll find having resource governance in place a very valuable ally in protecting your resource consumption.

Don’t wait until you have to pay tens of thousands of dollars to an outside consulting firm to come in a take a deep look at your infrastructure to figure out why it’s bogged down. Be proactive and do it yourself now! VMware has given us a gift in the ability to provision VMs with speed and ease; this is something we as SEs and admins should not take lightly.

Make sure you get a solid need justification from the requestor so that you aren’t lying on a therapist couch wondering how it got so out of hand in the near future.

Dig deeper into VMware virtualization with vSphere 5.1 New Features Training by David Davis. Get access to VMware training and more when you sign up for a free 3-day trial to get access to TrainSignal’s entire library.


This site uses Akismet to reduce spam. Learn how your comment data is processed.