Good news everyone. Most of the pitfalls of disaster recovery can be avoided by applying the unsung heroes of disaster recovery planning.

For you business IT veterans, this will be a quick reminder of times when you front loaded disaster recovery planning into your projects. For those less well-versed in disaster recovery planning, here is a framework to keep in mind as you tackle your operational and project workload.

Fault Management

Let’s start with some baseline language for this discussion. Moving away from the terms business continuity and disaster recovery, let’s take a look at an end-to-end review of fault management. Fault management can be broken down into 4 phases.

FAULT AVOIDANCE is a design issue. Any services provided must be engineered to eliminate as many typical faults as possible. For example, a typical LAN fault is electro-magnetic interference, which can be caused by power cables and florescent lights. However, fiber-optic cabling is immune to this interference and can be implemented by design to avoid this fault.

FAULT TOLERANCE is a component/redundancy issue. For example, RAID 5 disk arrays offer storage fault tolerance by providing redundancy in the storage sub-system. Every sub-system should be reviewed for Fault Tolerance Options.

EARLY FAULT DETECTION is a monitoring issue. Most system components can be monitored against a preset fault threshold. For example, HP hot swap hard drives can monitor internal faults. When their fault threshold is exceeded, they can be swapped prior to failure.

RAPID FAULT RECOVERY is a planning issue. Rapid Fault Recovery can only be performed when a documented procedure captures what steps need to be taken and what order they need to be taken in, to ensure service can be resumed as quickly as possible.

Already we see that well-planned disaster recovery is just the last mile of a continuity plan based on a fault management framework. Therefore, starting with the fault management phases of avoidance, tolerance, and early detection in your designs and projects earns you big wins. These 3 phases are your unsung heroes of disaster recovery planning. Incorporating these into all your deliverables will significantly reduce the investment you will have to make in recovery when that time comes, and you know it will.

The Business Impact Analysis and Risk Assessment

Are you with me so far? Good. But wait, there's more! The major pitfall of managing faults is OVER or UNDER managing faults. For instance, investing more in prevention than the cure is worth can be detrimental to the business. Simply stated, don't spend $1000 on a vault to protect a $10 necklace (unless it was your great-great-great grandma's necklace). 

First, you need to create a Business Impact Analysis (BIA). The BIA details and documents 3 components of business processes:

1.      Which processes are central to the organization

2.      The maximum downtime the business can tolerate in that process

3.      An accounting of cost to the business when tolerances are exceeded

With the BIA report in hand, you are now equipped with a quantified view of impact.

We're not done yet, though. Knowing the impact of an outage is only part of the equation for knowing your organization’s exposure to faults. The last step requires stepping through a good old-fashioned risk assessment to identify possible factors that can undermine up-time.  Fortunately, we need not do a risk assessment of everything. We only need it for those critical items identified in the BIA.

Once we have completed the BIA, which gives us the financial impact of an outage, and the risk assessment stating the likelihood of an outage, we are now equipped with a budget for fault management. Think of this budget as a simple chart of money over time in an outage.   On this chart, the dark blue line represents the cost of an outage increasing as it takes longer to recover.  The declining gray line represents the investment in fault management. If you invest a lot, your recovery is quick.  If you invest little, your recovery takes longer. The intersection of these lines is the sweet spot for investing in fault management.


The 3 Unsung Heroes of Disaster Recovery Planning. Time Over Money of Fault Impact & Exposure.


That is all there is to it!  Thanks for sticking with me so far. I'll work to wrap this up quickly so you can get on with your day.  Let us review.  The 3 heroes of fault management are avoidance, tolerance, and early detection of faults. Since there is no such thing as a free lunch, we need a budget to fund our heroes. To accomplish this, we have the tag-team of BIA and risk assessment working together to reveal the investment sweet spot.

One Last Reminder

One final reminder: Investment in fault management is not a one-time occurrence. Ongoing maintenance and updates to plans and components are necessary to ensure that the robust operational environment is always up to date. When changes to staffing, hardware, software, etc. are made, the impact to our unsung heroes of fault management needs to be assessed. Once we assess this impact, necessary changes need to be incorporated into the business.

Thank you again for your time. Let me leave you with this. I am a senior IT consultant. I work for Afidence.  And I truly hope this was time well spent for you as I am always happy to help.

William Cilley

Senior IT Consultant | AfidenceIT