Plan for the Worst-Case Scenario
How to Implement a Successful Disaster Recovery Plan
The number one priority of business continuity planning (BCP) is protecting the company's most valuable assets: the health and safety of its employees. The second priority is the rapid recovery and restoration of business-critical systems. In this article, I will outline nine steps that should be taken in order to implement a successful disaster recovery plan.
Step No. 1: Getting started
Typically, the first step of BCP is organizing the BCP stakeholders and getting executive buy-in to the concept. There are several exercises moving through this process and it all depends on the level of executive support you have for this type of program and how much you have to sell them on the concept. It is important to be prepared, as you will need to justify costs. You will need to do so by presenting some number that identifies the cost of downtime and how much company revenue is at risk if business systems become unavailable for an extended period of time.
Step No. 2: Define the right plan
Defining a disaster recovery plan begins with understanding what keeps your business running and prioritizing the recovery of different systems that are most critical. This is usually conducted in the risk analysis and business impact study and you don't need to be a rocket scientist to pull this together. It is highly likely you already know and could create this list in your sleep.
Step No. 3: Learn the common mistakes and how to avoid them
There are many mistakes that are made when preparing for BCP. The most common is not allowing enough time to identify, plan or prepare for the design, implementation and exercise of the system. Regularly exercising or testing the business continuity plan can often be the most costly mistake. Just because you have successfully implemented recovery and restoration procedures doesn't mean you are done.
Every time a system update or change control process is initiated, the BCP should be retested to see if it has been impacted and still functions as designed. Do not skimp on exercising your BCP just because you can't seem to find the downtime. This is where using a virtualization platform (such as Microsoft Hyper-V) is extremely helpful, as you can spin up a virtual disaster recovery target and test without impacting the actual production system. This is accomplished through the virtualization technology that allows the machines to be segmented from the production network and create a virtual disaster recovery testbed.
Develop a Backup Plan for Your Backup Plan
Step No. 4: Develop a backup plan for your backup plan
Over the past nine years, I have been either directly or indirectly involved with over 1,600 business continuity implementations and there was always something to learn with each scenario. One such situation was planning a backup for the backup. During a disaster recovery implementation for over 70 virtual servers, the batteries of the uninterruptible power supply (UPS) that was the backup power supply for the data center ended up exploding. Because the main power supply ran through the UPS, it took out the power to the entire data center and about 40 servers that were offline. Luckily, we had just finished the implementation but hadn't actually completed the exercise training, so we had to do it as a live test.
Thanks to the brilliant engineers I work with-and the fact we had implemented these solutions a few hundred times-we were able to bring up all the business-critical systems at a disaster recovery facility within fifteen minutes. HazMat was called to begin cleaning up the contents of the exploded batteries in the data center, and we were able to recover all the data center operations to the original data center about five days later. This shows that even though you have a backup plan, you don't necessarily have a backup.
Step No. 5: Understand your business
During the initial Business Impact Analysis (BIA), the BCP team will identify and prioritize the level of protection and recovery for each of the business systems required. I have seen some companies skip the BIA and just apply the same solution to all servers; this isn't necessary and can increase the costs of the actual BCP solution. Not all servers need to be highly available with very little Recovery Time Objective (RTO) and Recovery Point Objective (RPO); only the servers that are identified as being business-critical to restoring operations to a functional level.
This is usually your messaging and communication systems, followed by internal and external-facing Websites (or any other system that services your customers and clients or allows your company to book revenue). All other systems such as file or print servers might be okay if it takes 12-24 hours to bring that back online, as long as the other systems are intact.
Step No. 6: Learn your cost of downtime
Identifying and understanding your business-critical systems during the BIA will help you calculate the cost of downtime. Putting a dollar value to the number of hours or minutes a specific system is unavailable will not only help you sell the need for infrastructure improvements to executives, it will help them understand the reality of doing nothing. This will also help define which business continuity controls should be put in place.
Obviously, business-critical systems will cost more if downtime occurs and will require more of the funds for the plan to keep them highly available. Less critical systems will not require a solution with a very little RTO and might be okay with existing tape recovery procedures already in place.
Plan for the Worst-Case Scenario
Step No. 7: Plan for the worst-case scenario
As I described above in "real-life lessons," another common mistake often made is not getting the data and business systems out of the existing data center and into an off-site location that can be recovered from. All too often, companies implement a fault-tolerant system thinking that they are highly available but fail to account for that scenario when they lose power to the entire building (or, in worst-case scenario, lose the building to a fire or other natural disaster). This is probably the exception versus the norm but this wouldn't be called BCP if you didn't plan for just a few scenarios. You have to plan for the worst-case scenario and, if you aren't, you are doing your company a disservice and putting them at risk.
Step No. 8: Think beyond tape
Tape has been a staple of restoration procedures since the invention of computers but just because it has been around the longest doesn't mean it is the best solution. Many companies are replacing tape backup solutions with disk-to-disk backup solutions because the data is readily available and greatly reduces the recovery time typically associated with tapes.
Step No. 9: Make sure your backup plan works
One of the biggest challenges of maintaining a business continuity plan is performing regular exercises of the solution. The biggest excuse for not testing is the usual downtime required of the production systems in order to test the failover and recovery process. Years ago, when I would perform these exercises, we would segment the two data center networks so we could test the recovery process at the disaster recovery facility without taking the production systems offline.
With the adoption of virtualization over the last few years, this is even less of an issue. With virtual machines, specifically Microsoft Hyper-V, the ability is built in that allows for this type of business continuity exercise to a test virtual machine infrastructure, as if it is your disaster recovery center-and it most likely is.
Don't use downtime as an excuse because it isn't. Part of being responsible for BCP is ensuring the functionality of the business continuity plan; exercising that functionality is the only way to ensure it will work when it's needed most.
Brace Rennels is a Professional Services Project Manager at Double-Take Software, and a Certified Business Continuity Professional (CBCP). He has been involved with over 1,600 disaster recovery installations at Double-Take Software. He is responsible for managing the message of the professional services organization, the partner channel/OEM-related services activities, and the implementation of new service programs to drive Double-Take Software's sales. Previously, Brace was Manager of Technical Consulting Services at OpenPages, Inc. There he worked for one of the fastest-growing content management systems for multiple channel publishing. He trained staff on how to conduct and develop Risk and Business Impact Analysis for clients. Additionally, he was a Solution Architect for designing enterprise publishing, print and Web solutions based on customer business requirements. He created the business model, methodology and mission statement for the Technical Consulting Services.
Before OpenPages, Brace worked for General Data Services (acquired by EMC Corporation in 1999) as a Senior Systems Engineer. There he performed Business Impact Analysis to assist architects for enterprise-wide solutions involving hardware, software and business processes. He was awarded the Professional Services Contributor of the Quarter award for outstanding efforts in FY99. He can be reached at email@example.com.