Section 24.9. Everything Starts with the Business


24.9. Everything Starts with the Business

The design of any disaster recovery system should be driven by the ability to make available to the business the critical systems and information systems required to conduct normal production activities, without making those systems and information available to the wrong people. We are not protecting this data because it is a school project or an interesting hobby. We are protecting the data because if the data is lost, the ability to conduct business operations is at risk. Thus, it all starts with the business.

24.9.1. Define the Core Competency of the Organization

When looking at data protection, the first question to ask is, "What are the core products and/or services that this organization offers?" followed by "What is the information required to provide that product or service, and what applications are required to effectively use the information?" The answer to the second question is what defines your organization's intellectual property (IP). Without these information systems, the organization would not be able to function.

Many types of information qualify as IP. For example, it could be your version of KFC's 11 herbs and spicesthe "secret recipe" that makes your company's product or service different from everyone else's product or service. Of course, without customers that secret sauce is not worth much. All of your customers' locations, names, and contact information are also part of your IP, as are the names of potential customers. Any plans that your company has for doing something different, reaching a different market, or selling to a new group of people are also part of your IP. If it is information that you do not want in the hands of your competitor, it is part of your IP. This broad definition can include many types of information.

Intellectual property is a wonderful thing, but it is not the only important information in your company. Circling around the creation and delivery of your product or service are a number of other systems, such as procurement, payroll, accounts receivable and accounts payable, sales, and customer support. Each system is also critical to your business, and each needs a particular set of information to perform its function.

24.9.2. Prioritize the Business Functions Necessary to Continue the Core Competency

Once you identify all intellectual property and supporting applications and systems, you must prioritize the business functions necessary to continue providing your company's core products or services. This phase is not just about importance; you must also consider urgency or criticality as well.

In order to establish what IP needs to be protected, you must understand the organization's core competencies. Next, you need to prioritize the protection and recovery of these systems should they become unavailable. If the core competency relies on the manufacturing of a product, the systems to continue the process are vital to the continuation of the business. Other systems, such as email, may not be vital to the core competency and do not require the same level of protection. However, systems supporting customer communications may be critical and thus perhaps these email systems should be treated as critical applications as well. It is important to understand what is vital and critical to the organizations supporting the business's core competency and not just protect whatever data happens to reside on your servers.

24.9.3. Correlate Each System to a Business Function, and Prioritize

Let us consider a power company as an example. If it did not deposit customer payments for a few weeks, some people might notice and many would not care, but the company's creditors would notice and care. Some overly conscientious customers might notice that their checks had not cleared yet, but it would not bother most of them. The company's creditors would only notice if it failed to make payment on a payable account. Even then, the company could probably explain to its creditors that it is in the middle of some sort of emergency, and the creditor would probably hold off the firing squad. However, what would happen if it stopped delivering electricity or gas for just a few minutes? It would be on the evening news, all its business customers would be angry at the impact to them, all the residential customers would have to reprogram their DVD players and microwaves, and the company could potentially cause a rolling blackout, similar to what happened in the U.S. Northeast in the early 2000s. (This happens in some parts of the world on a regular basis.) This means that the company's ability to deliver power is the most critical business function it hasits core competency.

Once you figure out what your IP and supporting systems are, and which ones are critical, you need to figure out where they reside and all of the resources required to use them. Is the information stored in a database? Is it stored in files on a filesystem somewhere? In most cases, the data is going to be stored on some type of computer system. Every computer and storage system must be assigned to a business function based on that business unit's level of criticality, thereby giving that system the same recovery priority as the business function to which it belongs.

A great example of this type of prioritization can be found in a publication of the U.S. Federal Communications Commission. It shows the FCC's different types of data and its criticality, and it is published at http://www.fcc.gov/webinventory/. Interestingly enough, its most critical systems are those required by law or presidential decree. It lists mission critical as the next level of criticality, followed by frequently requested data, and other data. For reference purposes, most companies use the term "mission critical" to describe their most important systems. In this case, the FCC has acknowledged that without governmental decree, it would have no mission. Therefore, it has another level higher than mission critical. The important thing to learn here is that each industry and company is different, and you must perform this prioritization of business functions specifically for your organization.

24.9.4. Define RPO and RTO for Each Critical System

Your recovery time objective, or RTO, is how quickly you want the system to be recovered. RTOs can range from zero seconds to many days, or even weeks. Each application serves a business function, so the question is how long you can live without that function. If the answer is that you cannot live without it for one second, then you have an RTO of zero seconds. If the answer is that you can live without it for two weeks, you have an RTO of two weeks.

The recovery point objective, or RPO, defines the point in time that is reflected once you have recovered a system, also referred to as how much data you can afford to lose. Consider two examples: customer orders and system logs. If you lose one customer order, the company loses significant revenue Therefore, many companies determine that they cannot lose any customer orders. That means they have an RPO of zero seconds for customer orders. On the other hand, system logs might be useful only when troubleshooting problems or when auditing systems. If you lose several days of them, it is a problem only if you need to troubleshoot a problem or audit the logs from that time period. However, if you acknowledge that the time period is lost anyway (due to a disaster), it is more important to just get the order system running immediately; the logging system is not as time-sensitive as the order system. Therefore, you can lose one week of system logs without really losing anything critical to the business. That means there is a one-week RPO for system logs.

Once you have established a priority for each system and determined the various outages that you are going to protect against, you must create an RTO and RPO for each system that is to be protected. Most of your customers really do not care what causes an outage or delay, so the RTO and RPO should be the same in all but the most extreme events, like a catastrophic earthquake affecting an entire region. Depending on their level of criticality, most systems have the same RTO and RPO for each disaster type. For some systems, however, you may find that a longer RPO and RTO is acceptable or unavoidable for major disasters.

24.9.5. Create Consistency Groups

It is often necessary to recover several systems to the same point in time. This is primarily caused by applications that pass data to one another. Consider a manufacturing company with the business processes of sales and custom manufacturing. There are possibly several different computer systems involved in this process, including the customer, orders, procurement, and manufacturing databases. If this business has a customer expecting a product, hopefully all four systems know that. What would happen, for example, if it was manufacturing a custom product and lost the original order, or it had the order, but didn't know which customer it went to? What would happen if it took the order, but the manufacturing database became corrupted, and the company did not know it was supposed to be making a product? This would represent a serious integrity problem.

Therefore, if your company has several systems that perform related business processes, those systems need to be in the same consistency group. In addition to determining an RTO and RPO, you must identify those systems that are related to each other because they need to be recovered to the same point in time. It is also important to identify a consistency window, or a window of time during which not all affected systems are changing.

If your consistency window is larger than (or the same as) your backup window, it's relatively easy to meet the consistency requirement for a consistency group. For example, a 5 p.m. to 8 a.m. window is a 15-hour consistency window. If all systems are down between 5:00 p.m. and 8:00 a.m. (no new data is being created), and backups start and finish sometime between 5 p.m. and 8 a.m., it does not matter if System A is backed up at 10:00 p.m. and System B is backed up at 2:00 a.m. They will still be in sync.

However, if your consistency window is too short to back up all systems in a consistency group, you need to do one of two things. One option is to create a custom backup window for those systems and ensure that they back up within that window. This option is certainly preferable because it does not significantly complicate your backup system. If your consistency window is too short for this approach, the second option is to augment your backup system with snapshots or business continuance volumes (BCVs). They allow you to quickly create a "virtual backup" on disk of several systems within a few seconds and then later convert the virtual backup to a physical backup (that is, back up the snapshot to tape or virtual tape).

24.9.6. Determine for Each Critical System What to Protect from

Once the business functions are prioritized, and each system is assigned to a business function, it is time to identify the things that can happen that trigger a recovery scenario. "Disasters" come in many forms. First, create a list of the different levels and types of disasters that are likely for your area and type of business.

The Disaster Recovery Institute states that each company should define its own levels of disasters. I've listed the way I define them, which starts with a loss of a single system.


Level 1 disasters are those that take out an entire application or server:

  • Disk or disk array outage

  • Internal corporate sabotage

  • Electronic terrorism (denial-of-service attack)

  • Disgruntled employee attack

Level 2 disasters are those that take out an entire data center:

  • Building fire or flood

  • Natural disasters (hurricane, tornado, earthquake)

  • Building condemnation (chlorine gas leak)

  • Physical terrorism (bomb)

  • Really disgruntled employee

  • Loss of all network connectivity

  • Loss of all electrical power

Level 3 disasters take out an entire campus, city, or metropolitan area:

  • Large-scale natural disasters

  • Physical terrorism (bombing of power plant)

  • Act of war

24.9.7. Determine the Costs of an Outage

Once you have determined all of the different types of disasters and their associated probability, you must assign a cost to each type of disaster, for each type of system. For example, if a fire took out your test server for a week, your cost may be nothing. However, if a fire took out a server that you deemed in the previous exercise to be mission critical, a loss of only a few minutes may cost you millions, depending on the level of criticality and the business you are in.

Such costs can come from a number of areas, starting with the loss of business. While your systems are down, you are not taking orders, making your product, or delivering your service. Another cost is a loss of reputation, which can result in the loss of future business. No company wants to be on CNN because it lost its data. Labor costs must also be added to the equation, and there are two kinds of labor costs. The first labor cost is when data was created and then lost; work has to be redone. The second type of labor cost is the opportunity cost of labor for workers who are not doing anything useful because the system is down. Depending on the type of business you are in, there may also be the loss of source materials used to create your product. For example, if you are a food maker of some sort, and the automation control systems are down, your ingredients may expire before the product is completed.

Another concept to think about when calculating the cost of an outage is that outage costs are logarithmic. Some costs can be avoided if the outage is minimal. A five-minute outage, for example, can be overlooked as a nice break for your employees in the middle of a busy workday. However, as the outage gets longer and longer, other contingency plans go into effect and the costs of the outage start increasing. Before you know it, companies that count on you for your product start looking elsewhere, and then things get really out of control.

24.9.8. Plan for All Types of Disasters

Many companies attempt to perform risk assessments of all possible disaster situations, determining for each the likelihood that it will happen to any particular data center. For example, coastal regions usually prepare for hurricanes or tsunamis. In the U.S., parts of Texas, Oklahoma, Kansas, and Nebraska have had so many tornadoes that they call it "Tornado Alley." Other parts of the world are more susceptible to earthquakes. Do not dismiss any particular type of disaster with a "that will never happen" type statement. Murphy's Law will find you. The disaster you do not prepare for is the one that will strike you.

Whereas natural disaster or malicious actions seem to be the most obvious cause of outage, accidental causes are probably most common. While people frequently accidentally delete files or misplace them, complete data loss has been caused by power outages when construction workers in the area cut power lines to the data center. Other common occurrences include water damage from broken water pipes and software upgrades gone awry. You should undertake some research to consider all of the possible types of disasters and make sure your disaster recovery plans take each of them into account.

24.9.9. Prepare for Cost Justification

Once you begin the process of selecting data protection systems, you need to justify the cost of each purchase. To be successful in doing so, you must have completed the steps mentioned previously in this section:

  1. Define your RTOs, RPOs, backup windows, and consistency group requirements.

  2. Determine for each critical system what to protect.

  3. Determine the cost of each outage.

  4. Plan for all types of disasters.

Once you have accomplished this, justifying the cost of each data protection system should be a relatively easy thing to do. You simply need to state your required RTOs and RPOs, what you're protecting against, and what a system to protect against those things costs. If any part of the system is turned down, you simply need to explain how that affects your ability to meet these requirements.




Backup & Recovery
Backup & Recovery: Inexpensive Backup Solutions for Open Systems
ISBN: 0596102461
EAN: 2147483647
Year: 2006
Pages: 237

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net