By Ruby Nahal, Senior Engineer
Roadblocks to recovery
Right after I came back to the office after maternity leave, I had the pleasure (NOT) of having to deal with a terrible disaster recovery. We were trying to recover a database application from crash-consistent backups and meant lots and lots of hours on the phone. Needless to say, by the end, my nerves were shot.
There were many roadblocks to the recovery, each with their own lesson to be learned, but the biggest takeaway was how this disaster perfectly contrasted application-consistent backups with crash-consistent backups. Here, we primarily employ crash-consistent backups but in some use-case scenarios, they deploy application-aware backups.
I have always been the advocate of application-aware backups for database servers. Yeah, snapshots are great, even crash-consistent snapshots are great – you can recover quickly, but that still doesn’t take away the necessity of a proper backup solution for your core business services.
Crash-consistent snapshots vs application-aware backups
Crash-consistent backups take a snapshot after acknowledging the existence of the application running on the storage. But what can become dangerous is when a snapshot of the file system or volume with database data is taken, without interfacing with the application. So if snapshots are the backup method used in your infrastructure, they have to be crash-consistent snapshots. Crash-consistent backups are fine and work well for the most part for non-database applications.
However, for database applications, there is data in memory and there might be pending I/O operations at the time of the snapshot, which the crash consistent snapshot will fail to capture thus possibly causing inconsistencies in the database when recovered. As such, restoring from a crash-consistent backup requires extra work, before an application can be brought back online.
Application-consistent backups, however, capture all data in memory and all transactions in process. This is performed by using some type of client software, third party or native, like in the case of Microsoft Windows and Hyper-V, the VSS API is used to acquiesce the database application, flush its memory cache, complete all its writes in order and then perform the backup.
The most common example is SQL based backup that comes with Microsoft SQL server installations. Application-consistent backups may also provide point-in-time recovery for the database. When the backup or snapshot is complete, the software notifies the database application to continue its operations. A restoration of an application-consistent backup does not require additional work to restore the database application.
Other reasons why application-aware backups are vital to normal operation
The most common use case for application consistent backups is Microsoft Exchange and SQL Server. Ever tried recovering a 500 GB Exchange database from a non-application consistent backup? I have. It’s not fun. It may involve hours of downtime to get the database to the non-dirty state even after recovery of the server.
In their default configurations, non-circular logging, and full recovery modes, respectively, most database applications do not normally write directly to their database files but instead rely upon log files. This warrants that a crash during normal operations will pose a risk to the active log file, not the primary database.
But if application-consistent backups are not taken on a regular basis and these applications are left in their default configurations or backed up using a crash-consistent backup only, the log files will eventually consume all available space. Also, when backed with a non-application consistent backup, they will not commit their logs. So timely and point-in-time recovery becomes harder.
In a nutshell
If your infrastructure involves all colors and varieties of different technologies (DCs, database systems, Exchange servers, etc.), there cannot be a single method that can be employed as a backup solution. A proper strategy is required to ensure the backup system is ideal for each type of service so a timely recovery is possible.