By Ruby Nahal, Systems Engineer
A rose by any other name
I, like many techs, think VMware’s “snapshot” is a pretty slick feature. I just think it should have been named something else.
See, when it comes to functionality, VMware did not stick to the industry standard of what most of us regard as a snapshot when naming their “snapshot” thus creating a scenario where making basic assumptions about its functionality can be a recipe for disaster – but only if you don’t finish this article.
Snapshot or time bomb?
Within the IT industry, the term ‘snapshot’ is already in play and used by professionals and storage vendors alike in a very general way. By this widely understood definition, a snapshot is a captured point in time within your running production copy that can be returned to later if need be.
However, in the case of VMware “snapshots”, when you take a “snapshot “ the production copy of the virtual machine disk (VMDK) is frozen (let’s call it virtualmachine01.vmdk) and a new vmdk is created (something like virtualmachine00001.vmdk). This new disk is the “snapshot” disk, and all changes after this point are written to this “snapshot” disk.
What this means is that the virtual machine is no longer running on the production virtual disk because it has been frozen in time. Instead, it is running on the snapshot disk while continuing to log all subsequent changes causing it to grow larger and larger.
Bigger isn’t better
At this point, one of two things can happen:
One, the snapshot disk will become so big that it will get corrupted to the point that the virtual machine will no longer be functional on this “snapshot” and you will have to recover from a backup assuming you have guest level backups. Or you will have to do some not-so-fun changes to revert to the original production disk (which is now outdated) and which obviously means loss of data.
Two, the “snapshot” will not get corrupted but it will eat all of the space on the datastore. This can cause all virtual machines on the datastore to become unbootable until more space is made available or the snapshot is consolidated. But consolidation takes time. I have known virtual machine snapshot consolidation to take anywhere from a little over an hour to several hours. This means outage, and loss of time and money.
Moral of the story
What can we do if we want to enjoy the convenience of VMware’s “snapshot” and avoid disaster? First, only use VMware “snapshots” for what they are meant to be used for- a short-term roll back plan.
My rule is to not let a “snapshot” stay alive for more than 48 hours (I have seen a VM blow up because of a “snapshot” that was 7 days old – It was a database server with constant data changes). To help you remember, stick in a reminder to go back and delete it. Make it part of your workflow. There are PowerCLI scripts that can be scheduled to check for snapshot on a constant basis and alert as necessary.
At TekTegrity, we are currently implementing a monitoring system that will allow us to keep tabs on any of the virtual machines that are running on snapshot disks and alert us long before any disaster happens. Our clients expect and appreciate this kind of proactivity, and so will yours.