DPM replica is inconsistent

When we started implementing Microsoft DPM we went through a series of headaches as we tried to get it working continually and consistently. We went through a number of online resources as well as conversations with other geeks who had successfully used DPM.

Overall it was a huge pain and I wished one resource would have broken it down from a holistic approach. This article will try to do that by addressing from a high level overview of how to get DPM working consistently. If you are looking for a 1’s and 0’s of how DPM 2007 works then you will need to find another resource, I recommend Microsoft technet where you will find some great web casts that explain DPM. If you are looking for the specifics on some the items reference in this article then Bing will be your friend.

DPM Requisites

1.) You need a rock solid network. Any continual outages or interruption of network service will impact DPM.
2.) You should not try DPM unless you have Gig ethernet to all the machines being backed up.
3.) Do not skimp on the horse power of the server. To backup a 30 server network we have an 8 Gig Quad Core server and it continually uses all those resources.

4.) You don’t need to be master at powershell but you can’t be scared of scripting with powershell to get DPM working. The DPM Shell combined with prewritten scripts and scheduled tasks is a must.
5.) Do not expect your experiences with other backup programs guide you to successfully getting DPM to work. You need to invest the time it takes to learn DPM otherwise DPM will become a four letter word.
6.) Do not expect much out of the reports, they are very difficult to read… Although very easy to automatically schedule.
7.) Not all alerts are bad. You need to understand the alerts and how DPM works to understand the severity of the alert. Then you need to script to minimize the number of alerts.
8.) If you don’t understand Microsoft VSS learn it.

Day to Day operations

The first thing you need to realize is failure does not always mean that data is not able to be restored. So when you log on and see all the alerts, how to do tell if the alert is a “I better spend the next 2 hours getting it working” or a “I feel Ok ignoring this since last Recovery point 4 hours ago was successful?” This question combined with the numerous quantity of alerts is what causes a large amount of administration time. For us it was signifigantly more than we were use to with other backup systems.
DPM jobs will often fail when the host or the DPM server itself is using too many resources(Network, CPU, Memory). Make sure optimize the protection groups so that the resource consumption is spaced out. Also don’t be afraid to run system state and express full back jobs on low impact servers during the day in order to space out the resource consumption. Again, DPM is not like traditional backups where you do most of your jobs in the evening. The only jobs you want to start in the evening are the ones on production servers that will cause a noticeable(user noticeable) impact. If you frame the problem of how to schedule your jobs in DPM with an emphasis on spacing out resource consumption(DPM and Host) equally you will have much less errors.

When you check the alerts on the console the goal should be trying to figure out what on the server being backed up or on the DPM server was running at that time that caused it to fail. This is different than Backup Exec or traditional programs where you try to figure out which check box you missed or if the backup device has a problem. DPM jobs are simple, provided there is free disk space and your auto grow feature is on, the problem is not usually with the device or the job configuration. Do not kill yourself triple checking and deleting and recreating backup jobs. This will just waste time…. This is an important concept… Again… To determine why a job failed do not look at the job itself but how the context in which the Job was running. There are 3 places to look(DPM Server, Host Server, Network).

Scripts

1.) You will need scripts and scheduled tasks to manage DPM. There are three important scripts. I found all of these online and did not create any of them myself. Alternative to scheduled tasks you can also setup MOM to automatically run scripts based on specified failure events although i have not completed this.
a. List problems – This script will give you text output of all the jobs that are in an invalid state. We scripted a scheduled task to run this and pipe it to a text file. Then we have a program that picks up the file and emails it to us for daily checking. For those that don’t remember list_problems.ps1 > results.txt is the command to pipe to text.
b. Autogrow – This script will autogrow the storage for jobs that are reaching a defined threshold. You can have scheduled task that runs this once a day.
c. Run Consistency Checks – This script will automatically go through each failed job and run a consistency check.

2.) The command to run a powershell script from the DPM shell is listed below.
3.) If you do not use these scripts you will be wasting valuable man hours logging on, right clicking on jobs to select 1 of 3 actions, and watching them run. If you use these scripts they will automatically do the right clicking and action selection for you thereby decreasing the amount of failed jobs when you do log on. It is always dangerous to have these things go automatically especially if you don’t understand what they are doing, the biggest danger is unintentionally impacting server performance. Please learn what these actions are actually doing in order to minimize that danager.

Optimization

You will want to optimize each protection group so that the jobs are evenly spaced apart.

Storage Requirements

Always make sure your storage has unallocated space. If it does not add more fast and let the autogrow script handle everything else. If you are afraid of using too much disk space because your storage has the potential to get out of control then you have a policy problem or a problem getting management to get you more disk. Although it is possible to let the disk space be the “warning tracks” for when your data has grown too much this does not mean it is good practices. You will waste far too much time trying to “cut it close” with DPM. It needs unallocated space to breath. If you absolutely can’t do this then make sure your high priority jobs, the ones that’s that if you can’t recover you lose your job, have unallocated space in a separate pool than say the system state on your help desk software.

Pushing Down Agents

This topic is out of the scope of this article. There are a number of online articles that explain this. Basically you will need to do a lot of patching on the hosts to get the agent installed. Patching can be OS patching and Application Patching(SQL, Exchange).

Other Notes
1.) If you continually error out when running a consistency check verify the server has enough space. With Microsoft DPM we noticed that different dataset types(DC, SQL, Exchange) have different host storage requirements. For example if a DC does not have enough local disk space it will fail. It is pretty common practice to make a small C partition on a DC that is a virtual machine so don’t be shocked if this happens to you.

DPM replica is inconsistent

Recent Posts

Categories

Contact Me