A plan for things to come
By Laura Bosworth-Bucher (October 2002)
A thorough business continuance strategy is the next best thing to seeing the future
Put away that crystal ball and forget the psychic hotline. Because you can't predict the future, comprehensive planning for the unknown is the only way to ensure the continuity of critical IT operations during and after an unforeseen disaster. In this article, we examine the best approach to creating a business continuance strategy, helping you safeguard against tomorrow's unknowns today.
At what cost?
Businesses try to avoid downtime because it is expensive, but to create a sensible plan for recovery, companies need to know exactly how expensive downtime can be. By calculating the cost of downtime for every application in your IT environment—including opportunity cost of inaccessible information and preventive measures—your business can approach business continuance with clear priorities.
IT staff can implement application availability in various levels—from 99 percent uptime to continuous operations in excess of 99.999 percent. But the cost of establishing a "five nines" availability program can be prohibitive because it includes the expenses of infrastructure equipment and day-to-day management. That's why it is important to consider any role-based variables that could reduce prevention costs.
Departments within an organization have different availability needs for a given application. For example, a company that communicates time-critical executive decisions through e-mail places extreme importance on messaging uptime at the executive level. But if this organization is a manufacturing company, the employees on the shop floor may check e-mail only once or twice each day. As a result, IT staff can assign tiered uptime levels to different teams based on their roles, which eliminates the need for full-time, 99.999 percent messaging uptime across the enterprise.
To determine the cost of inaccessible information, combine the revenue generated by an application with estimated indirect costs. In the e-mail example, a company would suffer the indirect cost of lost employee time. But the cost of an unavailable manufacturing application would be the total revenue lost by not shipping product, plus the indirect costs of lost employee and manufacturing facility time—potentially more expensive than a hiccup in the e-mail infrastructure.
The Cost of Downtime
|2004||Through 2004, 20 percent of enterprise mission-critical applications will experience severe performance problems that could have been avoided by modeling network/application interactions.|
|2005||By 2005, U.S. enterprises engaged in e-business will have lost more than $50 billion in potential revenue as a result of network failures. Through 2005, most enterprises will expend at least 25 percent more effort and time than necessary in troubleshooting application and network problems due to failure to use effective monitoring and testing tools.|
|2006||By 2006, large-scale backbone network service failures will have increased three-fold over today's levels.|
|2007||Through 2007, 70 percent of network upgrades completed will be ones that could have been delayed one to two years by they use of optimization technologies. Through 2007, enterprises that adhere to rigorous change management principles will incur 75 percent fewer network availability problems than enterprises that take a less disciplined approach to change control.|
Set your priorities
After determining the total cost of downtime, the IT department and various lines of business must set the overall service level agreements (SLAs) for the availability of each application. The SLA establishes the acceptable amount of downtime for any given environment. Other SLAs may go beyond recovery time to list applications by user roles, including performance commitments during peak and non-peak usage times.
What is your major malfunction?
|Capacity overload||Too many concurrent users accessing an application||Remove this threat by pretesting applications under anticipated usage scenarios. A safe margin for capacity planning is 150 percent of anticipated load.|
|Platform and operating system integration issues||Firmware/driver version incompatibility, among other causes||Thorough pretesting of precise configurations protects against these forms of disruption.|
|Performance bottlenecks||Application issues||Stress testing will help you access the overall quality of an application.|
|Unacceptable levels of human error||Lack of skill and discipline in the IT staff||Enforce well-documented policies and procedures. Develop IT policies and procedure throughout the planning and testing phases of implementation.|
Identify the culprits
Downtime comes in two varieties: planned and unplanned. Planned downtime refers to controlled systems maintenance, such as a new infrastructure implementation. Planned downtime typically is a factor of overall availability numbers because IT staff can design planned downtime so that it does not affect users.
Most unplanned downtime, however, relates to software and human error. Other causes of downtime are associated with the infrastructure, including hardware failure, network outages, power disruption, and true disasters such as floods, fire, and acts of terrorism. The best way to guard against these risks is through system redundancy.
A metric system
An IT organization should clearly understand the anticipated and actual mean time to failure (MTTF) for all categories of downtime. Business continuance plans should drive MTTF numbers as high as possible within business-cost objectives. In general, a redundant architecture with no single point of failure contributes to a high MTTF.
At the application server level, for example, clustered and load-balanced architectures increase availability by providing complete systems that take over application processing in the event of a failure. By switching services from one server to another, an administrator can perform upgrades and routine maintenance without an interruption in service. Other redundancy factors usually are outside the direct control of the IT staff, such as building power supplies and external Internet service providers (ISPs).
Power redundancy can include the use of uninterruptible power supplies, backup generators, or two separate power sources at a data center. As for ISPs, companies should utilize multiple vendors and thoroughly examine each provider's business continuance plans. Once again, cost will be a determining factor of how much insurance a company requires in those two areas.
Once failure occurs, another critical measurement of an IT department is mean time to recover (MTTR). For business—critical applications, MTTR numbers—typically measured in seconds, minutes, and hours-should be low. A business should measure recovery time from the point of problem detection, and include the time required to isolate the problem, reload data, and restart the application.
Overall, an IT architect must consider how much data the business can afford to lose, and this estimate will drive the cost trade-offs for implementing a data replication and recovery plan. Businesses have many choices that range from simple transportation of backed-up media at an off-site location to a fully automated, off-site backup operation. The standardization of storage area networks has made backup and recovery plans—once very costly—very manageable solutions.
To be continued
IT departments must consider several other factors when designing highly available application environments. These functional areas require in-depth implementation planning in conjunction with an overall business continuance plan, and include:
- Security design in areas such as virus protection and network security using firewalls, routers, and switches
- Load testing and architectural assurance of system and network bandwidth at all levels to ensure compliance with SLA objectives
- Proactive system monitoring, including problem detection, notification, and resolution
Because we have no sure-fire ways to predict disasters, the only way to guard against them is through preparation. By fostering a thorough understanding of IT needs, businesses can build insurance deep into the IT infrastructure and help ensure that no matter what tomorrow brings—come hail or high water—your business will keep on going.
About the author: Laura Bosworth-Bucher is director of Enterprise Systems Group Solutions Engineering at Dell.
Implementing a successful, cost-effective business continuance plan requires that you know your IT environment from both a technological and business point of view. How can you minimize the negative effects of a service outage?
- Perform a business impact analysis to classify applications by criticality and assign risks, vulnerabilities, and costs of downtime to various functional areas.
- Define your approach, including processes for planned outages, redundancy strategies for unplanned outages, and a disaster recovery implementation that will meet the recovery times and required SLAs of end users.
- Implement a change management policy with comprehensive documentation to control operator errors and maintain the integrity of the IT environment.
- Test continuously. IT environments are constantly evolving. Testing identifies potential issues with changes in technology and processes.
- Review your plan regularly to keep it up to date with changing business objectives.
- Use uptime metrics to ensure that the availability of business-critical applications has increased in accordance with the agreed-upon SLAs.