For a long time, risk assessment and risk management have been topics associated with projects that must be addressed in the project plan. A disaster recovery plan is no different in that it is effectively a project. It involves a number of stages and carries a number of risks. A risk is something that might happen, such as the database server crashing or the computer room being destroyed by fire. Each risk that is identified must be managed ”that is, it must be addressed, and the best solution, usually based on money, must be found to somehow resolve it. As with project management, four main options can be applied:
Contingency is an essential part of any disaster recovery plan in that it identifies the measures that are to be taken to protect the company. Unlike the discussion earlier in the chapter on disasters, where the focus is clearly on identifying key areas of vulnerability, this aspect focuses on what to do about them to reduce the impact of a disaster. It is intended to ensure, where possible, that the business stands the best possible chance of not only surviving, but also being able to resume normal running at the earliest opportunity ”or, better still, to continue running throughout. The remainder of this section looks at potential contingency options for a business and shows how they can be used to maximize the chances of early recovery. They are not discussed in any particular order of importance. BackupsChapter 1, "Job Description," mentioned that the system manager is the custodian of the data for the company, and this is an awesome responsibility. Given the responsibility that this carries, it is amazing that the backups of a system are often viewed as a chore instead of an essential process. Backups are crucial to the survival of the business, especially when considering the mission-critical, 24x7 systems in use today. Imagine a disaster striking the company only to find out that the backups haven't actually worked for the last two months ”and even if they had, the tapes were being kept in the computer room, which has just been destroyed. It is not just the data relating directly to the business activities that needs to be backed up; there might also be legal requirements for the retention of audit trails, or financial information for tax purposes. Remember that these requirements cover several years , and the data may need to be restored, if required, during a future investigation or audit. To guarantee the integrity of the system's data in any way, the system manager must have a reliable and tested backup policy. The media used for the backups must be stored securely, and ”probably the most important aspect of all ”the backups must be tested at regular intervals. If they are not tested and are subsequently unreadable, then it really doesn't matter whether they are stored in a secure location because they're useless. Sun Microsystems has a fully scalable product called Solstice Backup that is designed to deliver the best possible protection for a company's data. For example, it will accommodate a small business with a single system (the Server edition), a medium- sized business with a number of clients distributed across the network (the Network edition), or a large enterprise environment with many systems at remote locations (the Power edition). Choosing the correct edition of the software depends on how much data is needed to be backed up and how efficiently it is to be done. Solstice Backup supports concurrent devices so that multiple tape drives , or jukeboxes, for example, can be written to or read from at the same time, greatly enhancing performance and reducing the time needed for a system backup. Solstice Backup doesn't just provide a good backup utility, it also includes a Storage Node Module. This module allows the central control server to make use of further tape devices located on remote machines. There are two main advantages to this: The capability is significantly enhanced (as is the performance), and, more important for disaster recovery strategies, it provides an automatic failover facility if the backup devices become unavailable. And for systems using relational database management systems, there are add-on database modules for Oracle , Sybase , Informix , and Microsoft SQL . If all this wasn't enough, other add-on modules enable Solstice Backup to work in a true heterogeneous environment, delivering support for clients from various architectures, including desktop PCs ( Microsoft Windows 95/98/NT ), NetWare systems , a number of other UNIX architectures , and Macintosh . Figure 7.1 shows a possible backup configuration that would deliver a high level of resilience. Figure 7.1. The Power edition of Solstice Backup provides not only resilience, but also flexibility in supporting a wide range of architectures, including online protection for databases.
Carrying out the actual backup is only part of the solution. A number of further issues need to be taken into account to guarantee, to any degree, the integrity and recoverability of the data stored electronically . These issues are discussed next . Devising a Sensible Backup PolicyHow often is a backup done? What data or file systems are backed up when one is done? Ideally, a full backup of everything would be done every day, but this is not always a practical option; it could prove a waste of resources and could undermine the reasons for the backup in the first place:
A good, practical strategy is to first back up critical file systems containing operating system data at specified periods, preferably with the system at single- user mode to ensure that there is no activity other than the backup process itself. These backups would normally be carried out using the standard Solaris backup and restore commands ( ufsdump and ufsrestore ). It is also good practice to have the operating environment on different disk partitions rather than the actual data. This makes the restoration of the system software much easier and manageable because its content rarely changes (unlike the live data). When it is not practical to carry out a full backup on a daily basis, the best strategy is to do a full backup once a week, supplemented with incremental backups every other day of the week. Remember that if high availability is required, then clustering and RAID configurations should already be implemented, greatly reducing the need for the backup tapes to be required. In an emergency, however, when a whole building is destroyed, the capability to restore the entire system from the backup tapes exists, and the actual loss will be very small. Media RotationAfter devising a backup regime that is both practical and sensible, it would be a shame to neutralize the effect by continually writing to the same tapes. The result will be unreadable backups ”worthless. Each backup tape has a finite life expectancy. Determine what this is, and replace the rotating tapes before the specified number of uses is reached. This simple task can significantly reduce the risk of I/O errors on tapes. A further consideration when discussing the rotation of backups is the percentage of backups that are kept permanently. Some companies have a rotation policy of, say, four weeks ”every four weeks, the backup tape is overwritten as part of the backup cycle. The result, for example, is that a file that was accidentally deleted six months ago cannot be restored. Other companies have a policy in which, for example, every fourth full backup is retained permanently (and the entire backup set for that day is replaced with new tapes). The result here is that a file from six months ago can be restored, if required. The reason this is mentioned is that frequently, when a member of staff either deletes a file accidentally or that file becomes corrupt, there might be a considerable time lapse before it is noticed. A bimonthly report, for example, would require information from the previous two months to actually compile the report, and the data required might have already been recycled. Restoring Solaris Backup Software The operating system backup would include the Solstice Backup software, which is installed as a package, probably in the /opt file system. In the event of a serious failure requiring a full restore, Solstice Backup itself would have to be restored (or reinstalled) before it could recover the rest of the system's data. If this kind of data is likely to be required, then perhaps a further add-on module to Solstice Backup is the answer: Hierarchical Storage Management (HSM). Data is automatically migrated to the backup device based on specified policies and is also available for recall when needed. Storage of Backup MediaThe beginning of this section mentioned that there is little point in achieving a good backup strategy if the backup media is subsequently destroyed in a fire. This scenario is likely if the tapes are stored at the same location as the computer systems from which the data is taken. A good disaster recovery plan should state that off-site, secure storage is vital for any business-critical data. Any backup tapes that are stored on-site should be done so in a certified fire-proof vault. One further point about off-site storage is worth mentioning here. If a serious disaster strikes the business, the data stored in an off-site facility must be accessible. Suppose that there is a serious disaster on Friday evening ”it is possible, if this has not been investigated, that you would not be able to retrieve the valuable backup tapes until Monday morning. Scheduled Testing of BackupsUnfortunately, the most common error with regard to backups is that the company is lured into a false sense of security, purely because a backup has been taken of the systems. The real proof of a good backup is the capability to subsequently read the tape and restore the contents of the backup. This aspect is frequently overlooked until it is too late, when it is really needed. A periodic test to ensure that the tapes are readable should be a part of every backup strategy. It is not necessary to restore the entire contents of a tape; selecting a small sample of files will demonstrate its readability. Media Storage Warning When considering fire-proof safes or vaults, always examine the temperature rating of the safe to ensure that the media is fully protected in case of a fire. It is extremely important to know how cool the safe will keep the media and to ensure that this is within the tolerances of the media itself. Alternate SiteIf a serious disaster, such as a fire, occurs it may not be possible to use the site, in which case an alternate site is needed. For larger companies, relocating the entire operation at a moment's notice is no easy feat; in some cases, this could prove to be impossible . Disaster recovery terminology defines three types of alternate sites, depending on the state of readiness ”hot, warm, and cold sites:
As an alternative to providing its own alternate site, a company can make use of a third-party company that specializes in disaster recovery. A prime example for Sun computer networks is Sungard Recovery Services, Inc. (http://www.e-recovery.com), the world leader in enterprise recovery solutions. With a network of so-called megacenters across the United States, Sungard provides continuous support on a 24x7 basis for companies during a disaster situation. Sungard was notably active during the World Trade Center bombing , Hurricane Floyd, and the San Francisco earthquake, to name a few. Its facility in Philadelphia is the largest of its kind in the world, comprising more than 350,000 square feet of usable operations space. Figure 7.2 shows a picture from the Sungard Web site of the second-floor space. Figure 7.2. The huge expanse of usable space can accommodate the largest requirements for business continuity.
Not only does Sungard provide the space for a company's recovery, but it can also provide the systems, including the complete range of Sun servers, right up to the E10000, Sun's flagship server. Office space for the continuity of business is included in addition to the space for the systems themselves . Sungard also offers a fleet of mobile Metrocenters in most metropolitan areas that can be delivered to the company's site within 48 hours and that accommodate an office environment of approximately 50 people. Figure 7.3 shows what a Metrocenter looks like. Figure 7.3. The Metrocenter can be deployed virtually anywhere , providing the most flexible solution for a company requiring urgent recovery facilities.
Sungard is ideal for large companies running business-critical applications, requiring continuous 24x7 access. The company has high-availability facilities to ensure, as far as possible, that the business survives the aftermath of a major disaster. Emergency Replacement and SparesOf course, an alternate site is necessary only in the event of a major disaster that destroys the current site or makes it unserviceable. An option that provides contingency when components of a server or network fail is to establish an agreement with the provider of the support contract, if it isn't with Sun themselves, so that a fast-track replacement of critical components can be delivered. For the simpler field- replaceable units, such as disk modules, controller cards, memory, and so on, it might be desirable to hold a stock of emergency spares. Replacing Server Components Only qualified, trained staff should attempt to replace components of a server. Sun Microsystems offers a number of hardware maintenance training courses, although it may still be necessary for a field engineer to attend for component replacement. Consult your hardware support vendor for advice. |
Top |