How to Protect Your Computing and Telecom Resources
By Ray Horak
Disasters are a fact of life. They are unpredictable and predictable. There's nothing we can do about them and everything we can do about them. How well we minimize their impact on our business and our lives depends on logic, planning, effort and basically money.
Disaster planning stretches a broad continuum. At one end is "do nothing." At the other end is complete duplication of everything we do and every system we use, from moment to moment, in two or three separate places.
We know something will happen to our business at some time. It could be in the form of an earthquake, a flood, a fire, an electricity outage , or a hacker attack. Let's take a look at Disaster Recovery Planning, also known as Business Continuity Planning. Planning is the operative word, for it makes the difference between catastrophe and continuity. The planning process comprises risk assessment, criticality assessment, loss assessment, and cost assessment.
Risk Assessment is the process of assessing the risk of failure of a network and its subnetworks. You assess the risk down to the level of individual network elements that comprise them. It includes all forces that might cause such a failure, whether the result of forces of nature, man, or machine. Risks can be categorized from low to high. The risk of a system failure often can be quantified in terms of MTBF (Mean Time Between Failures). The failures themselves can be categorized as either hard or soft. A hard failure is a total, or catastrophic, system failure. A soft failure is a performance failure, or degradation in performance, that falls short of total failure.
Criticality Assessment is the assessment of the importance of individual computing and communications resources, including computer systems and their resident databases, telecommunications systems, and voice and data networks and subnetworks. This assessment also includes groups of users, the applications on which they rely, and the business functions in which they are engaged. Levels of criticality might include critical or essential, important, and non-critical or non-essential. Or, the levels of criticality might be defined as very high, high, moderate, and low. Further assessment of criticality might establish acceptable recovery window, i.e., the length of time that the business can survive or thrive in the event of the total failure of a specific resource or function. Utmost in this overall assessment must be the very nature of the core business. Call centers, for example, cannot tolerate even short-lived failures of telecommunications systems or networks. Similarly, financial institutions cannot tolerate even short-lived failures of computer systems or networks. Airlines and courier services cannot tolerate either.
Loss Assessment is the process of assessing the cost to the business of the total failure of a specific resource or function. Losses clearly are sensitive to the criticality of the resource or function, and generally are sensitive to the length of the failure, or recovery window. For example, some critical functions might withstand a failure of 15 minutes, and some non-critical functions might withstand a failure of up to 30 days, which might be equivalent to a billing cycle.
Cost assessment, of course, is the final step in the development of a disaster recovery plan. The costs of implementing alternative business continuity plans must be considered in the context of the risks of failure, the criticality of various resources, and the potential losses arising from such failures over time. Striking this balance is the essence of optimization, and business continuity plans must be optimized.
Once the organization has assessed risk, criticality and loss, it's in a position to consider measures designed to prevent failures, and measures designed to recover from failures in the event that the preventative measures fail. All of this is in the context of cost, of course.
Barriers are designed to prevent disasters. Unfortunately, barriers are few in number, being limited to things like mechanical and electronic locks and deadbolts, electrical surge protectors, and software firewalls.
Backups are designed to assist in the recovery from failures. Redundancy translates into resiliency, which allows a business to snap back from either a hard or a soft failure. A hard failure forces a business to exercise a backup. A soft failure, on the other hand, affords the organization the choice of limping along for a period of time while the problem is diagnosed and corrected, or exercising the backup at any time.
Here are some of the things we can do to protect our businesses from disasters and to recover from them quickly. It's an abbreviated checklist. Entire books have been written on the subject.
Electrical continuity is critical, as most, if not all, network elements are electrically powered . The criticality of reliable electrical power became quite clear in the U.S. as a result of the power shortages during the summer of 2001. Surge protection is absolutely essential to protect network elements from voltage spikes caused by unclean power and lightening Grounding is very basic, but it is worth noting that a great many system failures, both hard and soft, are due to improper electrical grounding . Power supplies are redundant in fault tolerant computers and in carrier class switches and routers. UPS systems are always a good idea, and may comprise both battery backup and backup power generators. At the very least, a UPS system provides enough time to shut systems down gracefully. At the very most, backup generators may provide power indefinitely. Grid diversity, perhaps the ultimate in electrical continuity, involves access to multiple electrical grids, often through multiple utility companies.
System continuity clearly is critical. Fault tolerant computers and carrier class switches and routers routinely are highly redundant at the component level. Clearly, there are wide ranges of redundancy at the system level.and wide ranges in associated costs.
Data continuity is extremely important. Application programs and files should be backed up routinely to ensure the continuity of the business, itself. Contemporary backup media options run the range from floppy disks to magnetic tapes and CD-ROMs. SANs (Storage Area Networks) routinely make use of highly redundant storage systems including RAID (Redundant Array of Inexpensive Disks). The ultimate in data backup involves storing backed up programs and files at a separate site.
Access continuity ensures that network access is available on a highly reliable basis. Access continuity includes loop diversity and media diversity. Loop diversity involves multiple levels. Entrance diversity involves multiple points of loop entrance to a building. Pair diversity involves access via diverse pairs in a multi-pair cable system. Cable diversity entails access via multiple cables, which may be UTP (Unshielded Twisted Pair) or fiber in nature. (Note: SONET standards for fiber optic systems specify as many as four fibers.) Path diversity requires that the local loops follow multiple, diverse physical paths between the network edge and the customer premises. Aside from the inherent redundancy of SONET systems, loop diversity is unusual in all but the most critical applications scenarios. Media diversity involves the use of several media. In the event that the primary access medium fails, the backup medium can be initialized , perhaps with little, if any, disruption in service. Wireless media (e.g., microwave and infrared) routinely are used as backups for wired media (e.g., UTP and optical fiber). Some service providers offer microwave systems as a backup to infrared, and others offer infrared as a backup to LMDS (Local Multipoint Distribution Services), which is RF-based.
Transport continuity ensures that continuity of connectivity is maintained within the core of the service provider's network. This level of continuity typically is supported through the use of optical fiber transport systems based on SONET standards. SONET, as I mentioned above, specifies redundant fibers. In the core of the carrier networks, this generally is in the form of a 4FBLSR (Four Fiber Bidirectional Line Switched Ring). Carriers generally provide connectivity assurances in the form of guaranteed service restoral windows , as stated in SLAs (Service Level Agreements). Frame Relay connectivity can be protected through the use of backup PVCs (Permanent Virtual Circuits), which generally are available at discounted cost. SVCs (Switched Virtual Circuits), while inherently redundant, generally are not available from service providers. IP (Internet Protocol) networks are inherently redundant. Further, IP's inherently connectionless nature uses that redundancy to the fullest.
Carrier diversity involves the use of multiple carriers for a given service. Access to the circuit-switched PSTN (Public Switched Telephone Network), for example, might make use of a LEC (Local Exchange Carrier) for local calling purposes, and an IXC (IntereXchange Carrier) for long distance. In the event of a failure in the LEC network, the IXC typically can be used for local calling, and vice versa. Carrier diversity is relatively easily accomplished in the contemporary competitive environment, but is all too rarely employed.
Network diversity involves the use of multiple network services, ideally through diverse carriers. For example, ISDN routinely is used as a backup for Frame Relay. Dial up modem access through the PSTN routinely is used as a backup for Frame Relay and other data network services.
Site diversity takes several forms. Distributed vs. Centralized Operations: Distributed operations at multiple sites may involve additional costs, but certainly is less susceptible to catastrophic failure than is centralized operations at a single site. Mirrored Operations: Critical data centers and call centers, for example, often are mirrored. That is to say that an exact copy of the center is maintained at a backup site for standby purposes. Such an exact copy includes systems, applications, files, and networks. A hot standby is always powered up, and ready instantaneously, meaning that data is processed and files are maintained at the hot standby site concurrently (i.e., in parallel) with the operational sites. A warm standby is ready to fire up in a short period of time, perhaps once files are updated. A cold standby may require that some applications be updated, for example, which takes a bit longer. A number of companies are in the business of providing mirrored data centers, which sometimes are shared by multiple companies with the same general system and network configuration. Also, some companies provide backup call center capabilities on an outsourced basis.
Think of business continuity planning as a form of business insurance. Most businesses wouldn't even consider operating without medical insurance, liability insurance, fire and flood insurance, or vehicle insurance. Neither should they even consider doing without loss of business insurance, most of which policies require the existence of a well developed disaster recovery plan. It's not uncommon for a company to spend as much as 10% of its annual IT budget on the development and implementation of such a plan. It's just one more cost of doing business, and it's money well spent.