3.6 Data Protection | LANs to WANs: The Complete Management Guide

< Day Day Up >

The rise of powerful workstations interconnected by high-bandwidth networks and enterprise applications that must be extended to users regardless of location have generated vast amounts of data that must be protected. Often, the amount of data that needs to be manipulated exceeds the capacity of a single workstation (or available network bandwidth) to handle. This has given rise to the client-server model of computing where the client initiates requests and the server assumes much of the data processing load. Of note is that mainframes and midrange computers have also taken on the role of servers.

Although protecting the vast amount of information being stored on computer systems and networks has always been a concern of IT managers, it is even more pressing in the client-server environment, where terabytes of data may be stored at one or more servers that comprise a data warehouse. Losing access to this data, much of it mission-critical, could put a company at a severe competitive disadvantage. Even not being able to gain access to requested data in a timely fashion can adversely impact productivity and response to dynamic market conditions.

The server is the most vulnerable resource on a LAN. When a server crashes, it can take with it not only the availability of crucial data, but also corrupt or destroy the data. For this reason, fault tolerance is becoming an increasingly important issue in the LAN world. As LANs take on crucial tasks that were once tasked to mainframes, restoral mechanisms that keep the server and its data alive through disk crashes and power outages are becoming a requirement. While there are some software-only and hardware-only solutions to choose from, the most effective way to achieve a high degree of fault tolerance is with a combination of both.

3.6.1 Hardware Solutions

Hardware in a fault-tolerant server must be duplicated so that there is an alternate hardware component that can carry on after a failure. Such redundancy extends to the server’s CPUs, ports, network interfaces, memory, disks, tapes, and I/O channels. This duplication is often implemented via a “hot-standby” solution, where a complete duplicate systems is used. The secondary system does nothing but monitor the tasks of the primary system, in most cases duplicating its processing. That way, when a component in the primary system fails, the secondary system is prepared to take over where the primary system left off.

There are obvious disadvantages to this method: twice the amount of hardware must be purchased, with half of it remaining idle at any given time. The only way to cost-justify such a purchase may be to think of it as an insurance policy.

Another way to achieve fault tolerance is to have all hardware components function all the time, but with a load-balancing mechanism that reallocates the work to surviving components when a failure occurs. This arrangement requires a sophisticated operating system that continually monitors the system for errors and dynamically reconfigures the system upon sensing performance problems.

3.6.2 Software Solutions

For complete fault tolerance, software must react to hardware component failures. This allows the system to take advantage of duplicated hardware in order to keep the server running. It also ensures that the data is always available and uncorrupted.

The software monitors the system for failures and, in the event of failure, switches active primary components. The extent to which control is switched depends on whether the system is a hot-standby or a load-balancing system. With a hot-standby system, complete control is given to the standby system when a fatal error is detected, regardless of the component that failed.

With a load-balancing system, control is given to the backup just for the element that failed. Software designed for load-balancing fault-tolerant systems must be highly sophisticated, with the ability to orchestrate complex switching of control in the event of failure—with complete reliability. If the CPU fails, a secondary CPU gains control, but the primary I/O controller, disk controller, and disk drive remain in control. In the event of a CPU failure, the software must reallocate processes (such as the file server software) to active processors. When an I/O channel fails, the software must reroute disk access and communication I/O to an alternate channel. If the bus connecting the processors together fails, it must reroute inter-CPU communications to another bus. If a disk crashes, it must be able to switch to a backup disk drive.

Software has another function: to assure that files remain available and uncorrupted during a failure and that the server can pick up where it left off upon system recovery. Several mechanisms are available to provide optimum data availability and integrity, including the following:

Disk mirroring provides constant availability despite media failures. Disk mirroring allows all file updates to be written to two disks at the same time. If one disk (or the I/O channel to it) fails, the information can be read and written to the other disk.
To maintain constant file availability, mirroring mechanisms provide the capability for tape backup of disk data to be made while updates continue. After a failure, users should have disk access during data rebuilds.
Disk recovery after a failure requires the ability to bring the new secondary disk to the current state of the primary disk, so that the data is once again protected. To maintain availability, disk recovery should be performable on-line, without taking the system down and depriving users of access.

Depending on the level of fault tolerance desired and the price the organization is willing to pay, the server may be configured in several ways: unmirrored, mirrored, or duplexed.

Unmirrored Servers

An unmirrored server configuration entails the use of one disk drive, channel, controller, power supply, and interface cabling. This is the basic configuration of most servers. The advantage is chiefly one of cost: the user pays only for one storage system. The disadvantage of this configuration is that a failure in either the drive or any associated component could cause temporary or permanent loss of the stored data.

Mirrored Servers

The mirrored server configuration entails the use of two hard disks of similar size. There is also a single channel over which the two disks can be mirrored together. In this configuration, all data written to one disk is then automatically copied onto the other disk. If one of the disks fails, the other takes over, thus protecting the data and assuring all users of access to the data. The server’s operating system issues an alarm notifying the network manager that one of the mirrored disks is in need of replacement.

The disadvantage of this configuration is that both disks use the same channel and controller. If a failure occurs on the channel or controller, both disks become inoperative. Because the same disk channel and controller are shared, the writes to the disks must be performed sequentially—that is, after the write is made to one disk, a write is made to the other disk. This can degrade overall server performance under heavy loads.

Disk Duplexing

In disk duplexing, multiple disk drives are installed with separate channels for each set of drives. If a malfunction occurs anywhere along a channel, normal operation continues on the remaining channels and drives. Because each disk uses a separate channel, write operations are performed simultaneously, offering a performance advantage over servers using disk mirroring.

Disk duplexing also offers a performance advantage in read operations. Read requests are given to both drives. The drive that is closest to the information will respond and answer the request. The second request given to the other drive is canceled. In addition, the duplexed disks share multiple read requests for concurrent access.

The disadvantage of disk duplexing is the extra cost for multiple disks drives (also required for disk mirroring), as well as for the additional disk channels and controller hardware. However, the added cost for these components must be weighed against the replacement cost of lost information plus costs that accrue from the interruption of critical operations and lost business opportunities. Faced with these consequences, an organization might discover that the investment of a few hundred or even a few thousand dollars to safeguard valuable data is negligible.

3.6.3 Redundant Arrays of Inexpensive Disks

One method of data protection is growing in popularity: redundant arrays of inexpensive disks (RAIDs). Instead of risking all of its data on one high-capacity disk, the organization distributes the data across multiple smaller disks, offering protection from a crash that could wipe out all data on a single, shared disk. Other benefits of RAID include the following:

Increased storage capacity per logical disk volume;
High data transfer or I/O rates that improve information throughput;
Lower cost per megabyte of storage;
Improved use of data center floor space.

RAID products can be grouped into the following categories:

RAID Level 0. These products are technically not RAID products at all, since they do not offer parity or error correction data to provide redundancy in the event of system failure. Although data striping is performed, it is accomplished without fault tolerance. Data is simply striped (written) block-by-block across all the drives in the array. Since all disks seek in parallel, seek performance is greatly improved, but there is no way to reconstruct data if one of the drives fails.
RAID Level 1. These products duplicate data that is stored on separate disk drives. Also called mirroring, this approach ensures that critical files will be available in case of individual disk drive failures. Each disk in the array has a corresponding mirror disk and the pairs run in parallel. Blocks of data are sent to both disks at the same time. While highly reliable, Level 1 is costly because every drive requires its own mirror drive, which doubles the hardware cost of the system.
RAID Level 2. These products distribute the code used for error detection and correction across additional disk drives. The controller includes an error-correction algorithm, which enables the array to reconstruct lost data if a single disk fails. As a result, no expensive mirroring is required. But the code requires that multiple disks be set aside to do the error-correction function. Data is sent to the array one disk at a time.
RAID Level 3. These products store user data in parallel across multiple disks. The entire array functions as one large logical drive. Its parallel operation is ideally suited to supporting imaging applications that require high data transfer rates when reading and writing large files. RAID Level 3 is configured with one parity (error-correction) drive. The controller determines which disk has failed by using additional check information recorded at the end of each sector. However, because the drives do not operate independently, every time an image file must be retrieved all of the drives in the array are used to fulfill that request. Other users are put into a queue.
RAID Level 4. These products store and retrieve data using independent writes and reads to several drives. Error correction data is stored on a dedicated parity drive. In RAID Level 4, data striping is accomplished in sectors, not bytes (or blocks). Sector-striping offers parallel operation in that reads can be performed simultaneously on independent drives, which allows multiple users to retrieve image files at the same time. While multiple reads are possible, multiple writes are not because the parity drive must be read and written to for each write operation.
RAID Level 5. These products interleave user data and parity data, which are then distributed across several disks. Because data and parity codes are striped across all the drives, there is no need for a dedicated parity drive. This configuration is suited for applications that require a high number of I/O operations per second, such as transaction processing tasks that involve writing and reading large numbers of small data blocks at random disk locations. Multiple writes to each disk group are possible because write operations do not have to access a single common parity drive.
RAID Level 6. These products improve reliability by implementing drive mirroring at the block level so that data is mirrored on two drives instead of just one. This means that up to two drives in the five-drive disk array can fail without loss of data. If a drive in the array fails with RAID 5, for instance, data must be rebuilt from the parity information spanned across the drives. With RAID 6, however, the data is simply read from the mirrored copy of the blocks found on the various striped drives—no rebuilding is required. Although this results in a slight performance advantage, it requires at least 50% more disk capacity to implement.
RAID Level 10. Some vendors offer hybrid products that combine the performance advantages of RAID 0 with the data availability and consistent high performance of RAID 1 (also referred to as striping over a set of mirrors). This method offers high performance and high availability for mission-critical data.

There are other hybrid RAID solutions. RAID Level 30, for example, is achieved by striping across a number of RAID Level 3 subarrays. RAID 30 generally provides better performance than RAID 3 due to the addition of RAID 0 striping, but is not as efficient as RAID Level 0. RAID Level 50 is achieved by striping across a number of RAID Level 5 subarrays. RAID 50 generally provides performance better than RAID 5 due to the addition of RAID 0 striping. Although not as efficient as RAID Level 0, it provides better fault tolerance than the single RAID Level 5.

Businesses today have multiple data storage requirements. Depending on the application, performance may be valued more than availability; other times, the reverse may be true. Today, it is possible to have different data structures (text, image, audio) in different parts of the same application, making performance more valued than availability. Until recently, the choice among specific RAID solutions involved trade-offs between cost, performance, and availability—once installed, they cannot be changed to take into account the different storage needs of applications that may arise in the future. Vendors have responded with storage solutions that support a mix of RAID levels (hybrids) simultaneously.

Individual disk drives or groups of drives can now be configured via a PC-based resource manager for high performance, high availability, or as an optimized combination of both. This solves a classic data storage dilemma: meeting the exacting requirements of multiple current applications, while staying flexible enough to adapt to changing needs.

3.6.4 Server Blades

Some applications require the deployment of hundreds of servers in a coordinated manner, which can be quite expensive using conventional chassis-based server configurations. The now-familiar process of rack-mounting servers has been the most common approach to assembling large numbers of servers. Newer “blade” technology extends the concept of rack-mounting to allow hundreds of ultra-thin servers to be vertically mounted into a single rack. The blades are comprehensive computing systems that include processor, memory, network connections, and associated electronics—all on a single motherboard called a blade. Each blade can be configured to support a different operating system or application accessible from the LAN or WAN.

The blades slide into slots on a specially designed rack. While the server blade consists of essential processing and sometimes storage components, the rack unit provides for network and external storage connections, significantly reducing cabling and space requirements. Over 300 blades can fit into a standard-sized rack. And if a blade fails, it can be easily replaced by simply pulling and replacing it.

Server blades have a lot of appeal for IT organizations looking for space-efficient solutions with higher levels of serviceability, scalability, and manageability. Installing, servicing, and removing blades is much easier than working with chassis-mounted servers. Shared power supply, cabling, fans, and storage reduce the number of redundant and “failable” components in the environment. Instead of equipping each server with redundant power supplies, for example, a single set of power supplies is shared by all the rack-mounted blades. Management tools allow IT managers to readily monitor, configure, and troubleshoot systems.

The ability to add and remove components quickly not only allows IT managers to deal with outages more efficiently than in a traditional network setup, but adjust to fluctuations in traffic as well. For example, the management interface allows server blades to be set up to handle transaction software that is heavily used during the business day and then perform other tasks during other periods of the day. In addition, all blades in the system can be managed as a whole, or they can be assigned to different users, customers, or partners and managed separately.

3.6.5 Automated Operations

With the right management tools, network backup can be automated under centralized control. Such tools are a virtual necessity for mixed-vendor environments. They can go a long way toward lowering operating and resource costs by reducing time spent on backup and recovery. Some tools even implement unattended network backup, eliminating operator intervention and further reducing costs.

These tools enhance media management by providing overwrite protection, log file analysis, media labeling, and the ability to recycle backup media. In addition, the scheduling capabilities of some tools relieve the operator of the time-consuming tasks of tracking, logging, and rescheduling network and system backups. Another tool is journaling, which is the capability of a system to keep a history of database transactions. In the event of a disk failure or unscheduled interrupt, the system administrator can replay the transaction journal in an attempt to isolate the cause of the problem. Another useful tool is data compression, which reduces media costs by increasing media capacity. This feature also increases backup performance across a network by reducing the traffic load.

When these tools are integrated with high-level management platforms, such as Hewlett-Packard’s OpenView, problems or errors that occur during automated network backup are reported to the OpenView management console. The console operator is notified of the problem or error via a color change in the respective backup application symbol on the OpenView map. By clicking on the symbol, the operator can directly access the network backup application to determine the cause of the problem or correct the error, enabling the backup operation to resume.

The trend in backup systems is toward increasing their levels of intelligence. Backup systems must not only ensure that files are backed up, but that they are easily located and restored. Systems intelligence has already progressed to the point where the user need not know the tape, the location on the tape, or even the name of a lost file in order to restore it.

Increasing levels of automation are facilitating the backup of very large networks. From expert software that determines what files to back up, to automated tape changers that select, load, and unload tapes without operator intervention, backup is becoming as transparent and easily manageable as file sharing. Backup software is also supporting more types of applications, including on-line processing. As LAN backup systems move forward in intelligence, automation, application diversity, and media options, LANs can fulfill their potential as the primary means of corporate data management.

3.6.6 Off-Site Data Storage

Mission-critical data should be backed up daily or weekly and stored off site. There are numerous services that provide off-site storage, often in combination with hierarchical storage management techniques. In the IBM environment, for example, this might entail storing frequently used data on a direct access storage device (DASD) for immediate usage, whereas data only used occasionally might go to optical drives, and data that has not been used in several months would be archived to a tape library.

Carriers, computer vendors, and third-party service firms offer vault storage for secure, off-site data storage of critical applications. Small companies need not employ such elaborate methods. They can back up their own data and have it delivered by overnight courier for storage at a secure location, or bring it to a bank safety deposit box. The typical bank vault can survive even a direct hit by a tornado.

In addition to backing up critical data, it is advisable to register all applications software with the manufacturer and keep the original program disks in a safe place at a different location. This minimizes the possibility of both copies being destroyed in the same catastrophe. Manuals and supplementary documentation should also be protected, as should the software licenses.

For companies that can afford it, a mirror site is a good alternative to a backup tape system. It contains copies of applications and data, perhaps located at the other end of a leased line that runs miles away from the main site, so a natural or manmade disaster is unlikely to strike both sites simultaneously. If one site goes down, the other site takes over. Mirror sites are becoming more popular with high-volume e-commerce sites that cannot risk even a few minutes of downtime. The mirror site can be as large as a data center, or merely a rack-mounted server at a carrier’s collocation facility.

Both carrier-neutral and carrier-specific collocation facilities are commonplace. It is simply a secure environment that brings together the equipment and lines of multiple service providers and customers. Space is leased in the form of cages, cabinets, and racks into which customers place their own equipment. Customers pay a fixed monthly charge for the space. For small companies, this arrangement is often more economical than having to set up and maintain their own secure environment.

Some service providers offer site recovery options. This type of service is meant to deal with the loss of a primary data center that runs mission-critical applications. If an organization’s data center suffers from a catastrophic fire or natural disaster, for example, traffic will be quickly rerouted to another comparably equipped site. When disaster strikes, the customer calls the carrier and requests activation of links to the alternate site, a process that may take about two hours to complete and which may entail the uploading of new routing tables to each router to reflect the changes. This service is far more economical than having to set up and maintain a live data center and supporting infrastructure.

< Day Day Up >