Contingency Management

   


For a long time, risk assessment and risk management have been topics associated with projects that must be addressed in the project plan. A disaster recovery plan is no different in that it is effectively a project. It involves a number of stages and carries a number of risks. A risk is something that might happen, such as the database server crashing or the computer room being destroyed by fire. Each risk that is identified must be managed ”that is, it must be addressed, and the best solution, usually based on money, must be found to somehow resolve it. As with project management, four main options can be applied:

  • Do nothing ”the risk is accepted, and the company hopes for the best. This is not an ideal option, but it's an option nonetheless.

  • Avoid the risk ”Here, the risk is eliminated completely and no longer presents a problem. For example, because of its location, a business could be deemed to be at risk from major. Housing the computer systems in the basement of the building flooding is considered to be high risk, high impact, and high probability. To avoid this risk, the computers could be moved to the top floor of the building. This might pose other risks, but the risk of flooding no longer exists. A much simpler example related to everyday life could be the risk that you will get wet if you go out today and it rains. To eliminate the risk, don't go out!

  • Reduce the risk ”The concept of risk reduction is to make the likelihood that the risk will materialize much smaller ”that is, preventive measures are taken. The risk is not totally eliminated, as with risk avoidance , but the chances of it happening are reduced. For example, a business might have a critical database server that is deemed a high-risk, single point of failure. Deciding to purchase an identical server, locate it at another site, and install the Sun cluster software is an instance of risk reduction. Even though the server might not now constitute a single point of failure, it is still possible that both servers could become unavailable at the same time, but it's much less likely. Similarly, one of the servers becoming unavailable has a significantly reduced impact on the operation of the business ”in fact, there would be no impact because the remaining server would provide the company with a continuing operation while the failed server was being fixed.

    In the real-life example, the risk of getting wet could be reduced by watching the weather forecast or taking an umbrella.

  • Transfer the risk ”The remaining option is to pass the risk on to a third party, normally by outsourcing to a supplier or by taking out additional insurance. By using the outsourcing option, the computer systems could be managed by a third-party company in a data center owned and maintained by the company. In this instance, the supplier remains responsible for all aspects of the mission-critical systems, including planning for disaster recovery. The insurance option means that the business will be fully covered against loss of income as the result of a disaster, with resources (and money) being available immediately following the disaster to ensure that the business survives until normal operating status is resumed.

    The real-life example here would mean that you would get someone else to go out for you and hence take the risk of getting wet on your behalf .

Contingency is an essential part of any disaster recovery plan in that it identifies the measures that are to be taken to protect the company. Unlike the discussion earlier in the chapter on disasters, where the focus is clearly on identifying key areas of vulnerability, this aspect focuses on what to do about them to reduce the impact of a disaster. It is intended to ensure, where possible, that the business stands the best possible chance of not only surviving, but also being able to resume normal running at the earliest opportunity ”or, better still, to continue running throughout.

The remainder of this section looks at potential contingency options for a business and shows how they can be used to maximize the chances of early recovery. They are not discussed in any particular order of importance.

Backups

Chapter 1, "Job Description," mentioned that the system manager is the custodian of the data for the company, and this is an awesome responsibility. Given the responsibility that this carries, it is amazing that the backups of a system are often viewed as a chore instead of an essential process.

Backups are crucial to the survival of the business, especially when considering the mission-critical, 24x7 systems in use today. Imagine a disaster striking the company only to find out that the backups haven't actually worked for the last two months ”and even if they had, the tapes were being kept in the computer room, which has just been destroyed.

It is not just the data relating directly to the business activities that needs to be backed up; there might also be legal requirements for the retention of audit trails, or financial information for tax purposes. Remember that these requirements cover several years , and the data may need to be restored, if required, during a future investigation or audit.

To guarantee the integrity of the system's data in any way, the system manager must have a reliable and tested backup policy. The media used for the backups must be stored securely, and ”probably the most important aspect of all ”the backups must be tested at regular intervals. If they are not tested and are subsequently unreadable, then it really doesn't matter whether they are stored in a secure location because they're useless.

Sun Microsystems has a fully scalable product called Solstice Backup that is designed to deliver the best possible protection for a company's data. For example, it will accommodate a small business with a single system (the Server edition), a medium- sized business with a number of clients distributed across the network (the Network edition), or a large enterprise environment with many systems at remote locations (the Power edition). Choosing the correct edition of the software depends on how much data is needed to be backed up and how efficiently it is to be done. Solstice Backup supports concurrent devices so that multiple tape drives , or jukeboxes, for example, can be written to or read from at the same time, greatly enhancing performance and reducing the time needed for a system backup.

Solstice Backup doesn't just provide a good backup utility, it also includes a Storage Node Module. This module allows the central control server to make use of further tape devices located on remote machines. There are two main advantages to this: The capability is significantly enhanced (as is the performance), and, more important for disaster recovery strategies, it provides an automatic failover facility if the backup devices become unavailable. And for systems using relational database management systems, there are add-on database modules for Oracle , Sybase , Informix , and Microsoft SQL .

If all this wasn't enough, other add-on modules enable Solstice Backup to work in a true heterogeneous environment, delivering support for clients from various architectures, including desktop PCs ( Microsoft Windows 95/98/NT ), NetWare systems , a number of other UNIX architectures , and Macintosh .

Figure 7.1 shows a possible backup configuration that would deliver a high level of resilience.

Figure 7.1. The Power edition of Solstice Backup provides not only resilience, but also flexibility in supporting a wide range of architectures, including online protection for databases.

graphics\07fig01.gif

Carrying out the actual backup is only part of the solution. A number of further issues need to be taken into account to guarantee, to any degree, the integrity and recoverability of the data stored electronically . These issues are discussed next .

Devising a Sensible Backup Policy

How often is a backup done? What data or file systems are backed up when one is done? Ideally, a full backup of everything would be done every day, but this is not always a practical option; it could prove a waste of resources and could undermine the reasons for the backup in the first place:

  • The result could be that the system spends the majority of its time executing backups instead of running core business applications.

  • Static data ”that is, data and file systems that haven't changed since the last backup ”are being backed up, wasting time and media.

  • The real business-critical data might fail to be backed up in favor of noncritical, static data, which could prove to be extremely damaging .

A good, practical strategy is to first back up critical file systems containing operating system data at specified periods, preferably with the system at single- user mode to ensure that there is no activity other than the backup process itself. These backups would normally be carried out using the standard Solaris backup and restore commands ( ufsdump and ufsrestore ). It is also good practice to have the operating environment on different disk partitions rather than the actual data. This makes the restoration of the system software much easier and manageable because its content rarely changes (unlike the live data).

When it is not practical to carry out a full backup on a daily basis, the best strategy is to do a full backup once a week, supplemented with incremental backups every other day of the week. Remember that if high availability is required, then clustering and RAID configurations should already be implemented, greatly reducing the need for the backup tapes to be required. In an emergency, however, when a whole building is destroyed, the capability to restore the entire system from the backup tapes exists, and the actual loss will be very small.

Media Rotation

After devising a backup regime that is both practical and sensible, it would be a shame to neutralize the effect by continually writing to the same tapes. The result will be unreadable backups ”worthless. Each backup tape has a finite life expectancy. Determine what this is, and replace the rotating tapes before the specified number of uses is reached. This simple task can significantly reduce the risk of I/O errors on tapes.

A further consideration when discussing the rotation of backups is the percentage of backups that are kept permanently. Some companies have a rotation policy of, say, four weeks ”every four weeks, the backup tape is overwritten as part of the backup cycle. The result, for example, is that a file that was accidentally deleted six months ago cannot be restored. Other companies have a policy in which, for example, every fourth full backup is retained permanently (and the entire backup set for that day is replaced with new tapes). The result here is that a file from six months ago can be restored, if required.

The reason this is mentioned is that frequently, when a member of staff either deletes a file accidentally or that file becomes corrupt, there might be a considerable time lapse before it is noticed. A bimonthly report, for example, would require information from the previous two months to actually compile the report, and the data required might have already been recycled.

Restoring Solaris Backup Software

The operating system backup would include the Solstice Backup software, which is installed as a package, probably in the /opt file system. In the event of a serious failure requiring a full restore, Solstice Backup itself would have to be restored (or reinstalled) before it could recover the rest of the system's data.


If this kind of data is likely to be required, then perhaps a further add-on module to Solstice Backup is the answer: Hierarchical Storage Management (HSM). Data is automatically migrated to the backup device based on specified policies and is also available for recall when needed.

Storage of Backup Media

The beginning of this section mentioned that there is little point in achieving a good backup strategy if the backup media is subsequently destroyed in a fire. This scenario is likely if the tapes are stored at the same location as the computer systems from which the data is taken. A good disaster recovery plan should state that off-site, secure storage is vital for any business-critical data. Any backup tapes that are stored on-site should be done so in a certified fire-proof vault.

One further point about off-site storage is worth mentioning here. If a serious disaster strikes the business, the data stored in an off-site facility must be accessible. Suppose that there is a serious disaster on Friday evening ”it is possible, if this has not been investigated, that you would not be able to retrieve the valuable backup tapes until Monday morning.

Scheduled Testing of Backups

Unfortunately, the most common error with regard to backups is that the company is lured into a false sense of security, purely because a backup has been taken of the systems. The real proof of a good backup is the capability to subsequently read the tape and restore the contents of the backup. This aspect is frequently overlooked until it is too late, when it is really needed. A periodic test to ensure that the tapes are readable should be a part of every backup strategy. It is not necessary to restore the entire contents of a tape; selecting a small sample of files will demonstrate its readability.

Media Storage Warning

When considering fire-proof safes or vaults, always examine the temperature rating of the safe to ensure that the media is fully protected in case of a fire. It is extremely important to know how cool the safe will keep the media and to ensure that this is within the tolerances of the media itself.


Alternate Site

If a serious disaster, such as a fire, occurs it may not be possible to use the site, in which case an alternate site is needed. For larger companies, relocating the entire operation at a moment's notice is no easy feat; in some cases, this could prove to be impossible . Disaster recovery terminology defines three types of alternate sites, depending on the state of readiness ”hot, warm, and cold sites:

  • Hot site ”This is essentially an alternate site that is ready for business. It includes sufficient hardware, networking, and so on so that it is capable of providing immediate backup support to the business. In a distributed clustered environment, where the operation is mirrored at each node location, the transition to another site is much less painful.

  • Warm site ”The warm site is not as prepared as the hot site, but it is partially equipped with hardware and networking. Greater effort is required to achieve operational status, but generally the company could be up and running within 48 hours of a major disaster.

  • Cold site ”This could be empty commercial space belonging to the company, or even a mobile trailer, for example, providing electricity, a controlled environment, and communications access. The space is used to install, configure, and operate replacement systems.

As an alternative to providing its own alternate site, a company can make use of a third-party company that specializes in disaster recovery. A prime example for Sun computer networks is Sungard Recovery Services, Inc. (http://www.e-recovery.com), the world leader in enterprise recovery solutions. With a network of so-called megacenters across the United States, Sungard provides continuous support on a 24x7 basis for companies during a disaster situation. Sungard was notably active during the World Trade Center bombing , Hurricane Floyd, and the San Francisco earthquake, to name a few. Its facility in Philadelphia is the largest of its kind in the world, comprising more than 350,000 square feet of usable operations space. Figure 7.2 shows a picture from the Sungard Web site of the second-floor space.

Figure 7.2. The huge expanse of usable space can accommodate the largest requirements for business continuity.

graphics\07fig02.gif

Not only does Sungard provide the space for a company's recovery, but it can also provide the systems, including the complete range of Sun servers, right up to the E10000, Sun's flagship server. Office space for the continuity of business is included in addition to the space for the systems themselves . Sungard also offers a fleet of mobile Metrocenters in most metropolitan areas that can be delivered to the company's site within 48 hours and that accommodate an office environment of approximately 50 people. Figure 7.3 shows what a Metrocenter looks like.

Figure 7.3. The Metrocenter can be deployed virtually anywhere , providing the most flexible solution for a company requiring urgent recovery facilities.

graphics\07fig03.gif

Sungard is ideal for large companies running business-critical applications, requiring continuous 24x7 access. The company has high-availability facilities to ensure, as far as possible, that the business survives the aftermath of a major disaster.

Emergency Replacement and Spares

Of course, an alternate site is necessary only in the event of a major disaster that destroys the current site or makes it unserviceable. An option that provides contingency when components of a server or network fail is to establish an agreement with the provider of the support contract, if it isn't with Sun themselves, so that a fast-track replacement of critical components can be delivered. For the simpler field- replaceable units, such as disk modules, controller cards, memory, and so on, it might be desirable to hold a stock of emergency spares.

Replacing Server Components

Only qualified, trained staff should attempt to replace components of a server. Sun Microsystems offers a number of hardware maintenance training courses, although it may still be necessary for a field engineer to attend for component replacement. Consult your hardware support vendor for advice.



   
Top


Solaris System Management
Solaris System Management (New Riders Professional Library)
ISBN: 073571018X
EAN: 2147483647
Year: 2001
Pages: 101
Authors: John Philcox

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net