Designing Storage Systems for Backup and Recovery

< Day Day Up >

Storage systems must be designed with backup and recovery in mind. All too often, they are afterthoughts to primary storage systems. This has led to a number of problems, including:

Inadequate backup capacity for the amount of data stored. A constant problem is the need to match backup capacity with the amount of potential disk storage. Often, there is too much data for the available media storage capacity. This leads to failed backups or data being left unprotected.
Backups take too long. The classic point of pain for the IT department is the time it takes backups to complete. If backups take too long, the backups will exceed their backup window. Overlong backups affect systems running during the working day or keep systems from running at all.
Recovery takes too long. It is axiomatic that restore operations take much longer than backups. Unfortunately, restore is always something that you need to do as quickly as possible. If it takes hours or even days to restore data to a critical system, an organization will notice the negative effects. For some organizations, a slow restore of critical data can shut down the entire operation, perhaps for good.
Backup data is corrupted or damaged. As is the case with all complex procedures, things can go wrong. Somehow, the data on the backup media is damaged and unusable. This is often discovered when an attempt is made to recover from a data loss.
Bad data is saved; good data is not. Another common problem is backing data that is damaged. Something has caused the data to be corrupted on the primary storage system, which is then backed up. When the bad data is found, a restore of the data will be attempted. Because the backed-up data is only a copy of the corrupted data, restoring from the backup yields the bad data again.
Critical data is not saved; unimportant data is. To compensate for limited backup capacity, not all data is backed up. Perhaps only data on servers is copied. Maybe only data placed in certain directories or folders is saved. In any event, sooner or later a critical piece of data won't be in the right place and will not be backed up.
Backup data gets copied over by accident. It could be as simple as mislabeling media. Perhaps a system administrator doesn't have the right tape when performing backups and chooses to use another one. Some system administrators, when rotating media, accidentally use the wrong media or get one out of order. There are many ways that the current backup can get overwritten, taking away the insurance that backups are supposed to provide.
The cost of performing backups is skyrocketing. As the amount of data increases, the cost of performing backups is increasing as well. More media is constantly needed as more servers are deployed. Soon, the media begins to cost a considerable of money and takes a lot of time to manage.
Backup media is old and in poor condition. Because nothing is indestructible, all media wears out over time. This is a problem that archivists know well; they are accustomed to paper that dissolves, and to stone that chips and cracks. Electronic media is fragile, too. Over time, tapes stretch and can break. Hard drives can fail (which is one of the reasons we perform backups in the first place), and CD-RW dyes can fade. It is typical to discard a tape when it reaches a certain age, rather than risk breakage. Tape replacement doesn't always happen on time. If there are no tapes available, old ones might get used, even when they are past their useful lifetimes.
It is hard to recover just one piece of data. Backup software is very good at copying large amounts of data and then restoring it all at once. It is commonplace to back up an entire volume, hard drive, or disk array and then later restore it to the exact state it was in before. When only one file needs to be restored, it can be extremely difficult to find and recover that one file. Tapes do not allow for random access, and it is slow to find and recover a single piece of data.

Good system design can alleviate these problems. In the same way that products are designed for manufacture, storage and server systems must be designed for backup and restore from the start.

Doing the Math Behind Backup

Several factors affect the speed at which backup can occur: the speed of the interface to the backup unit, the access time of the drives being backed up, the speed of the computer's I/O bus, how much memory is available in the computer doing the backup, and the speed of the backup drives. The factors that have the most impact on backup speed are the I/O bus data transfer rate, the speed of the interface, and the access time of the backup drives. The access time of the hard drives rarely comes into play, because it is either very fast or aggregated within an interface that is at least as fast as the backup unit's interface.

As in most computer systems, total throughput is determined by the slowest component in the system: the bottleneck. Usually, that is the tape drive, which is the most common backup media. A very fast tape drive might operate between 35 megabytes per second and 30 megabytes per second uncompressed. Even when transferring compressed data, most tape drives top out at less than 75 megabytes per second. These transfer times are much less than the bandwidth of common storage network connections. Most networks can transfer data at 1 to 10 gigabits per second roughly 100 to 1,000 megabytes per second.

Two sets of design principles can be derived from this. First, match the interface to the backup drive carefully. It is not worth spending money on a fast interface for a slow drive. Second, aggregation is very cost effective. When data can be streamed to multiple drives, the device's interface's bandwidth is better utilized and larger amounts of data can be backed up.

Recovery Time Objective and Recovery Point Objective

When designing backup systems, two important metrics must be considered. They are Recovery Time Objective (RTO) and Recovery Point Objective (RPO). RTO is the amount of time that a system must be up and running after a disaster. In enterprise data centers, the RTO is a matter of a few hours to mere minutes.

RPO is the point in time to which data must be restored. An RPO may require that data be restored up until one hour before the failure, for example. In other situations, it may be acceptable to restore data to the close of business the previous day.

Different systems and types of data often have a dissimilar RTO and RPO. The recovery metrics set for a departmental file server are often different from those set for an enterprisewide database.

Internal DAS Backup

The simplest backup method is to mount a drive internally within an individual server or other computer. A drive is embedded in the computer, and software writes a copy of the primary storage drive's data to the media on a regular basis. This drive can be almost any media, including tape, CD-ROM/RW, another hard drive, or a flash memory device. Tape is the usual media for a server, whereas desktop PCs often use other types of media, such as CD-RW.

The advantage of internal DAS backup is that it is easy to build, configure, and manage. Also, it doesn't create network congestion problems, because no data is being sent over the LAN. Internal backups are self contained and have less chance of being interrupted by forces outside the server.

On the other hand, internal DAS backup subjects a computer to a heavy strain on its memory, I/O bus, system bus, and CPU. It is not unusual for the load on the server to get so high that it can't function at all during the backup. Fine-tuning the backup software and setting a proper backup window are critical and worth the time.

Managing lots of separate media is also a chore and can lead to costly mistakes. It is an especially difficult task, especially when servers are spread out over some area. There is a tendency to leave media in the drives and just let the backup happen automatically. This causes each previous day's data to be overwritten. Not changing media ensures that data can be restored only from the most recent backup. If it takes several days to discover that a critical file was accidentally deleted or corrupted, there will no longer be a credible backup of the file. Unless there is another version on a different computer, the data is lost forever. This defeats the purpose of doing backups.

A similar problem arises when the amount of data to be backed up exceeds the capacity of a single target media. The backup will fail if there is no one around to change the media. It is important to have spare media available and equally important to have someone to load them into the drive. When it is impractical to run around and change media, look to network backups or DAS systems that can change media automatically.

Have the End-User Change the Media: A Really Bad Idea

When the computers in an enterprise are geographically dispersed, it is tempting to have the on-site people change the backup media, even if they are not skilled IT personnel. This is a very bad practice. To begin with, the people at some remote location are not trained in the arcane ways of backups. They are apt to make bad choices if something goes wrong. It is also likely that they will choose to do nothing at all, leading to backup failures or no backups. Process failures such as this tend to go undiagnosed for long periods of time.

Worse, end-users are more likely to make serious process mistakes. They might choose to leave the media in the drive, for example. They do this because they are too busy with tasks that are central to their jobs (changing media is not).

There are several choices for unattended backup, including external libraries, jukeboxes, and autoloaders, as well as network backup. A service provider might also be a good option to offload remote backup. These are better choices than relying on end-users to perform backups.

External DAS Backup

When the amount of data becomes large or the backup resource load becomes too disruptive, it is important to look to external DAS solutions. With this type of architecture, the backup unit is mounted in its own chassis, external to the computer being backed up, and attached by a cable. The internal DAS model is extended past the computer chassis and into a separate unit. The most common technology used for this is parallel SCSI. A SCSI Host Bus Adapter is mounted in the server and a cable attached. The far end of the cable is attached to the backup unit. The backup software sees the device as though it were mounted in the computer and works the same way as the internal DAS system (Figure 3-1).

Figure 3-1. External Direct Attach Storage tape unit

There are several advantages to this type of architecture. To begin with, the backup unit has its own processor and memory. This immediately takes a portion of the load off the computer's resources, because some I/O is offloaded to the backup unit.

By having a larger chassis, external units can accommodate multiple sets of media. Media is loaded automatically by robotics or mechanical loading trays. This takes away the problem of having to change media or making mistakes while changing media. External backup devices also allow for multiple drives and SCSI connections to be used for backup, enhancing performance and availability. Media management software is often included with these units, which helps manage unattended backups. Features such as bar code readers keep track of which media belongs to which backup set, and sensors can tell when media is wearing out.

Tip

In general, storage systems that deploy external DAS primary storage, such as a disk array, call for external DAS systems. That way, capacity and performance can be more easily matched.

LAN-Based Backup

At some point, managing individual backup units for each server, whether internal or external, becomes difficult and costly. It becomes more effective to centralize backup in one or more large backup units that can be used for many servers. When bandwidth to the backup unit is not an issue, but management and cost are, LAN-based backup should be considered.

LAN-based backup uses a server with a backup unit attached to it, usually by some form of Parallel SCSI, to perform and manage backups. The backup server is in turn connected to the network via a standard Ethernet connection with TCP/IP support. This server, unlike an external DAS solution, is dedicated to backup and is used to back up the primary storage of other servers on the LAN (Figure 3-2).

Figure 3-2. LAN-enabled backup

A common variant of this model is to have a separate server control the backup process. The server to which the backup unit is attached only manages the I/O to the backup drives. This has the advantage of allowing one server to manage several backup units, centralizing management of the backup process.

The third variant uses a backup unit that is network enabled. Sometimes called NAS-enabled backup, it has an operating system and NIC embedded in it. The system administrator can then attach the backup unit directly to the network without the need for a server to host it. For small installations, this greatly simplifies installation and management, but at the expense of flexibility (Figure 3-2).

Note

NAS backup and NAS-enabled backup sound very similar. NAS backup is the practice of backing up NAS devices. NAS-enabled backup is LAN-enabled backup, which uses an embedded operating system to eliminate the server to which the backup unit normally would have been attached. It is analogous to NAS disk arrays for primary storage.

In all cases, the backup software on the server reads and writes the blocks of data from the primary storage and sends the data across the LAN to the backup server. The backup server in turn manages the I/O to the backup devices. Often, several backups can be run simultaneously.

The advantages of LAN-based backup are scalability and management. Rather than deal with many different backup drives in many hosts, backup is consolidated into a single unit capable of serving a collection of hosts. Adding a new server does not necessarily mean adding a new tape drive. Instead, excess capacity on existing units is utilized, saving money and time. Backup windows can also be scheduled centrally, with all the backups taken into account. Tape management is much easier when all the tapes are stored in a central location, rather then spread out over a building or data center. LAN-based backup also allows backup devices to be located in a different part of a building or data center from the servers they are backing up, providing flexibility in physical plant design.

The problem that most system administrators run into with LAN-based backup is the strain it places on network resources. Backups are bandwidth intensive and can quickly become a source of network congestion. This leaves the system administrator with the unattractive choice of either impacting overall network performance (and having angry end-users) or worrying that backups won't complete because of network-related problems. Worse, both can occur at the same time.

At some point, the only way to alleviate the network problems is to create a separate network and have a very low server-to-backup-unit ratio. Either method takes away much of the advantage of deploying the LAN-enabled backup solution. Separate networks mean many additional and redundant network components, reducing cost savings and increasing management resource needs. Having few servers backed up over the network reduces the scalability of the architecture until it eventually approaches that of the external DAS solution.

Some backup systems need to read and write data through the file system if they are to deliver it over the network. The backup server needs to mount the remote file system, open and read the files, and then write them as block data to tapes or other media. As is often the case with productivity applications, the files are small but numerous. Backing up a file server in this way means opening and closing many small files. This is very inefficient and slow. Software that can prevent this situation by placing agents on the servers may limit platform choices but allow the software to sidestep the file system.

Another important problem is the issue of the backup window. Unless the backup design calls for a separate network, there will be some impact on the network when backups are being performed. The backup window has to be strictly adhered to, or end-users will feel its effects, even those who do not use the system being backed up. With many companies operating around the clock, the backup window may be very small or nonexistent when network effects are taken into account.

Interrupted Backups: Slamming Your Fingers in the Backup Window

Few things cause IT personnel more pain then interrupted backups. All system administrators would like backups to operate in an unattended fashion without fail. This rarely happens consistently. Some common reasons that backups fail are system failure, network congestion (which causes timeouts), damaged media, backup time exceeding the backup window, excessive CRC check errors (for example, corrupted data), or a full tape or disk without a handy replacement.

When a backup cannot be completed, there are several paths one can take. The first option is to start over from the beginning. If the backup media is tape (which is likely), restarting in the middle of the backup is usually not an option. This is not too bad if the backup is ten minutes into a five-hour backup; it is generally not acceptable at hour three of that same operation. Restarting the backup process is likely to cause it to exceed the backup window, causing other problems.

Another option is to ignore the error and move on with the backup. In some cases, this is possible. If a single piece of data fails a CRC check, the software likely can continue, with only an error log entry being generated. Excessive CRC check failures are a cause for concern, because much of the data is now suspicious.

Sometimes it is impossible to move on, owing to some underlying problem that cannot be fixed immediately. If the backup is exceeding the available media's capacity, there's little that can be done until more of the media is available. If a bad drive controller is causing too many errors, the underlying cause may need to be repaired before backups can continue.

SAN Backup

The single most prevalent reason for the first deployments of Fibre Channel Storage Area Networks has been to alleviate the backup woes that large enterprises experience. By moving to a separate, high-capacity network, tailor-made to storage applications, many issues with backup windows, network congestion, and management can be addressed.

As is always the case, things have not worked out that way. There are many special challenges to performing backups across a Fibre Channel SAN, including the performance of systems on the SAN when backups are running. Still, the basic value proposition remains. FC SANs are an important step in having more reliable and less intrusive backups. With SAN Backup, the backup unit is connected via a Storage Area Network, usually Fibre Channel, to the servers and storage that will be backed up.

Much of the benefit derived from SAN backup comes from the fact that it is performed on a separate high-speed network, usually Fibre Channel. It is arguable that switching backup to a separate Ethernet network without changing anything else provides much the same advantage as a SAN. In many cases, that is true. The problems with network congestion are fixed, and there is enough bandwidth to back up more data.

SANs, however, have the advantage of being able to perform block I/O over a network. Unlike other network backup schemes, in SANs, blocks of data can be read from a disk and delivered directly to the backup unit, which writes the blocks as they are. There is no need for intermediate protocols or encapsulation of data. This makes even IP-based SANs, such as iSCSI, more efficient for backup. Fibre Channel SANs provide the additional benefit of having a very efficient network stack, which again boosts performance.

The SAN also provides connectivity with performance. Many storage devices can be backed up at the same time without impact on the primary LAN or servers. The high bandwidth of Fibre Channel and Gigabit Ethernet relative to tape drives, the most common backup media, allows data from several storage units to stream to multiple drives in the same library at the same time.

It is important to remember that all networks enable the distribution of functions from a single entity to several entities. LANs allow us to break processing and memory resources into many computers, some of which have special purposes. This way, expensive resources can be shared for cost-control purposes, scalability, and increased performance. SANs do the same by distributing storage resources. This is a big part of the attraction of SAN-based backups. By distributing the resources needed to back up data across a network, greater performance, scalability, and cost savings are realized over the long term.

There are two common architectures for a SAN-based backup system. The first, and most common, is LAN-free backup; the other is called server-less backup or server-free backup. Both have certain advantages and disadvantages, though they share the overall advantages of SAN backup.

LAN-Free Backup

As the name suggests, LAN-free backup occurs off the LAN and on the SAN. The storage, servers, and backup units are connected to a network, across which block I/O can occur. This is a basic SAN backup system. All I/O goes across the storage network, and none travels through the LAN. Management and control information may still rely on the presence of a IP network, making LAN-free not completely free of the LAN (Figure 3-3).

Figure 3-3. LAN-free backup

Common Misconceptions about SAN Backup

Although SAN backup has some important advantages, the disadvantages are often overlooked. Two common misconceptions about SAN backup are that it has no network impact and that you can eliminate backup system components.

To begin with, the N in SAN stands for network. Performing backups still consumes bandwidth and creates congestion. With an IP SAN, the backup is perhaps more efficient, but dragging large amounts of data across the network will still have a negative effect on performance. Assuming that a separate network is used for the backup system, running several backups at the same time will eat up network capacity until backups fail.

With a Fibre Channel network, the bandwidth in the switch and between ports is still finite. If the SAN is used for more than just backup, which is almost always the case, performance to and from storage devices may be affected by backups. There will be changes in the performance of applications that perform heavy I/O, such as databases, during times when several backups are being performed. SANs do not eliminate the backup window.

Another common fantasy that system architects have is that SAN backup systems will somehow need fewer components and, hence, cost less than LAN-based backup systems. That's not necessarily the case. The storage devices are still needed, as are the backup server, HBAs, and switches. SANs do offer an opportunity to consolidate storage devices and servers, but new network components are also added. In backups as in life, there is no free lunch.

If the primary storage is also on the SAN, the backup server only interacts with servers to get system information. The backup server uses system information for purposes of maintaining catalogs and ensuring proper locks on objects. Actual data is copied by the backup server directly from the storage devices to the backup drives. The data path does not go through the application server. The immediate effect is to reduce the backup load on the application server. It no longer needs to read and write data during backups. This allows backups to be performed while the server continues to operate with reasonable efficiency.

LAN-free backup is a good method of relieving stress on servers and the primary LAN. As is the case with all I/O-intensive applications, performance of the SAN is impacted but does not affect the end-user much. LAN congestion and slow servers are more obvious to the end-user. They impact on their ability to get their daily tasks done. The server I/O can be slower than peak performance without the majority of end-users feeling inconvenienced. Let network response time become too slow, and the help desk will be flooded with angry calls. Customers will abandon the e-commerce application the company spent millions to build and roll out. LAN-free backup provides welcome relief to networks and servers overstressed at backup time.

There is another form of LAN-free backup, in which the backup software resides on the individual servers rather than on a dedicated backup server. This is easier to deploy and less expensive then a dedicated backup server. The downside is that the server is still in the data path; also, resources are still taxed during backups. It is also not a particularly scalable architecture. As the system grows, each server will need additional backup software, which will have to be managed separately. If the drag on servers during backups is not all that onerous, and there aren't enough servers to warrant consolidation of the backup function, the argument for performing backup across a SAN is weak.

Server-less Backup

The ultimate backup architecture is the server-less backup. There is no server in the data path at all. The data is moved from the primary storage to backup by an appliance or through software embedded in one of storage devices. Software on a server tells the storage devices what to transfer to the backup unit, monitors the process, and tracks what was moved. Otherwise, it stays out of the way of the data. The data is moved from primary storage to the backup unit directly (Figure 3-4).

Figure 3-4. Server-less backup

This is superior to the LAN-free backup because data moves only once and in one direction. Data does not have to be copied to a backup server before being written to the backup unit. An important performance bottleneck is eliminated, and more backup servers are not required to scale the system. Data travels through the SAN only once, reducing the amount of network bandwidth consumed

From a system perspective, two things are needed for this design to work. First, there needs to be an appliance capable of handling the data transfer, called a data mover. The data mover has to be either embedded in one of the storage devices or in an appliance that sits in front of one of them.

There also needs to be a protocol that would tell the data mover which blocks of data to move from primary to backup storage while monitoring the results. This is provided by a set of extensions to the SCSI protocol called Extended Copy. By issuing commands to the data mover, the backup software causes the backup to commence and is able to monitor results without being involved in moving the data itself.

Terminology Alert

Never have so many names been given to a protocol. Extended Copy is also commonly known as Third Party Copy, X-Copy, and even E-Copy. Many vendors refer to this SCSI extension as Third Party Copy, because it enables a third party (the data mover) to perform the backup. Technically speaking, Third Party Copy should refer to the architecture and not the protocol.

Server-less backup has never been fully realized. Products claiming this capability have been finicky, exhibiting serious integration and compatibility issues. The performance gains did not outweigh costs for many IT managers. Network and server bottlenecks also turned out not to be as serious an issue as the throughput of the backup media. Disk-based backup is producing a greater impact on backup systems than server-less backup.

Backing Up NAS

There are three models for backing up NAS devices. The first mimics the DAS backup mechanisms of servers. A backup device is embedded within or attached to a NAS disk array system, which has specialized software to perform the backup. It provides a very fast and convenient method of backup and restore. The software can perform dedicated block I/O to the backup unit from the NAS disks. The software also already understands the NAS file system and does not need to use a network protocol to transfer the data on the disks. This is important, because many high-performance NAS arrays have file systems that are proprietary or that are optimized versions of common file systems such as NTFS. Backup software companies typically design custom versions of their software for these environments, or the NAS vendor will produce its own.

Although this is a solution optimized for performance, the backup unit embedded in or attached to the NAS array is dedicated to it. It cannot be shared, even if it is underutilized. Use of a shared backup unit for failover is not feasible. To have a robust system requires duplicate backup units, which are rarely fully utilized.

The second common architecture for NAS backup is via a LAN-enabled backup. As far as the LAN-enabled backup server is concerned, the NAS array is a file server or similar device using the CIFS or NFS protocol. The backup software queries the file system and reads each file to be copied. These files are then backed up through the backup server as usual. This approach has all the advantages and disadvantages of network backup. It is flexible, robust, and makes good use of resources. It is also slow if the NAS array has many small files, which is often the case. Opening and closing each file produces overhead that can bog down data transfer and negatively affect network and NAS system performance.

NAS arrays can also be backed up over a SAN. Most backup software vendors have SAN agents specific to common NAS. SAN backup of NAS devices eliminates the network overhead associated with copying files over a LAN. What detracts from this design is cost. A SAN needs to be in place or built. Additional agent licenses cost more money as more NAS devices are added to a system.

NAS and SANs in Backup

Many NAS arrays use SANs for the back-end storage system. In most cases, this SAN is self-contained within the unit or rack in which the NAS system is mounted. There is a definite advantage in terms of scalability to this architecture. A backup system may be part of this back-end SAN. If this is the case, it is no different from a DAS backup unit contained within the NAS system. To the outside world, the backup devices are embedded. The advantage is that as the NAS device scales, there is opportunity for scaling the backup system as well.

NAS-SAN hybrid systems are another story. With access to the data on the disk at file and block level, different configurations are possible. A NAS device that can be attached to a SAN may utilize the resources of the SAN for backup. This marries the ease of use and file I/O performance of a NAS while offering excellent backup options.

NAS Backup Using NDMP

To back up files, a tape backup program has to access the data on the NAS array. The open protocols available (such as NFS and CIFS) allow backup software to see only the files, not the underlying blocks. The backup then has to be performed using file I/O. Each file has to be opened, the contents read and streamed to a backup device over the network or via an internal SCSI bus, and then closed. This is not a big problem if you want to back up only very large files or a small number of files. If you want to back up many small files, files must be opened and closed constantly, and the overhead associated with that will make the system very slow.

The network model is often preferred for backing up NAS devices, yet going through the file system creates problems. Many NAS vendors have implemented their own protocols for streaming data to backup software. The downside of this approach is that the protocols are proprietary, requiring backup software vendors to support many different protocols.

In response, vendors involved in NAS backup developed a standard protocol to manage NAS backup while using standard backup software running on a network. Called the Network Data Management Protocol (NDMP), this protocol defines a bidirectional communication based on XDR (Extended Data Records) and a client-server architecture optimized for performing backup and restore. NDMP allows for a standard way of backing up NAS arrays over a network while removing much of the complexity.

NDMP requires that the backup software support the protocol and that the NAS array have an NDMP agent running on it. The agent is the server, and backup software is considered to be the host.

What's the Best Way to Perform Backup?

Because backup is the most common form of data protection, getting the design right is very important. It is impossible to say what the "best" design is, just as it's hard to say what the best song ever recorded is. It is relative to the system's desired level of protection, backup window, budget, and overall storage architecture. Backup architectures offer different advantages and disadvantages. These are described in Table 3-1.

Table 3-1. Advantages and Disadvantages of Backup Architectures
Architecture	Advantages	Disadvantages
DAS internal	High performance No LAN load	Lacks flexibility and scalability High server loads Can back up only small amounts of data
DAS external	High performance No LAN load Can back up large amounts of data	Lacks flexibility and scalability High server loads
LAN-based backup	Flexible Can back up large amounts of data	Low performance Substantial LAN load Lacks scalability High server loads
SAN backup (LAN-free, FC)	High performances Very flexible and scalable Can back up large amounts of data Low load on LAN	Moderate to high server loads High cost, especially if no SAN is already in place.
SAN backup (server-less)	High performances Very flexible and scalable Can back up large amounts of data No load on LAN Low server loads	High cost, especially if no SAN is already in place. Limited software and hardware support
SAN Backup (LAN-free, IP SAN)	Flexible and scalable Can back up large amounts of data Lower server loads Moderate cost No load on LAN	Moderate performance Lower cost than FC SAN backup
NAS (NDMP)	Moderate to high performance Flexible Can back up large amounts of data Low cost structure	Limited scalability Substantial load on LAN High server loads

The backup architecture has to fit the overall storage system. If it is a SAN being backed up, it is likely that a SAN backup architecture will fit best. The cost is incremental compared with the cost of the rest of the SAN. A SAN backup solution isn't a requirement LAN-based backup can still be deployed, for example but has advantages over other solutions.

Backup and Restore Software

Software is the primary ingredient in backup and restore systems. Almost any type of backup unit can be used successfully, but without backup software, it is as inert as a doorstop, except less useful.

All backup software must perform three important functions. First, it must copy the data to the backup media. Without the copy, there are no backups and no protection for the data. Second, it must catalog the data objects so that they can be found later. To restore data, the software has to find it and copy it back to the primary storage media. The catalog provides the mechanism for identifying what is on the backup media, what the characteristics of the original primary storage (including name, location, and assorted state conditions) were, and where on the backup media it is located. Catalogs provide the backup software the ability to restore data objects to the state they were in at backup time.

Finally, it must be able to restore data to the exact state it was in when it was backed up. That may mean that an entire disk is re-created on a new disk or a single file is restored to what it was last Tuesday.

Other common features that must exist in any enterprise backup and restore software are

Include support for multiple backup devices. Different versions of the software may support different architectures or methods of backup, but the extended functionality should be a simple add-on or upgrade. Most backup software will work with tape and disk-based backup units. Some software offers options for CD-R/RW, DVD-R/RW, and magneto-optical drives.
Include support for multiple architectures. Most backup software supports SAN, DAS, and LAN-enabled backup models. The software should be able to access various types of disk drives, using either block or file I/O. Unfortunately, an upgrade to the "enterprise" version of the software (which costs much more) is often required to get multiple device or architecture support.
Have the ability to perform unattended backup. One clear difference between robust, enterprise-grade backup software and simple desktop software is how well they handle unattended backup. Desktop backup software usually relies on time-delay programs, such as Windows' Scheduled Tasks or the UNIX cron utility, to schedule a backup. Most don't gracefully recover from errors without human intervention. Enterprise-level backup software should be able to initiate and monitor backup, detect and recover from errors, and provide for remote administration. The key is to be able to recover from errors with specific objects, such as an incorrect CRC, without aborting the entire backup.
Have at least 2:1 compression. Considering the disparity between disk storage capacity and most backup media, compression is an important feature. When compression is used, more data is squeezed onto the backup media. It is slower but is often necessary. Most backup software also has provisions for hardware-assisted compression. The hardware assist may be on the backup unit's controller or through an external unit. Support for hardware-assisted compression is a must for enterprise-grade backup software.
Have the ability to back up open objects. It is important to be able to back up files and other objects that are in use, a feature often called shadow copy. Backups can often fail if the software lacks this capability. The software encounters an open object and either hangs or skips the object.
Be able to restore at different levels. One of the biggest problems with restore (besides the time it takes to do so) is the level of granularity of the recovery. The software should be able to restore individual files, directories, volumes, and entire disk images. The ability to do this effectively is hampered by what is reasonable for the media to do. The software, however, should support it.
Provide detailed error reporting. Backups are processes that the backup software automates. For proper management of backup, auditing, and error detection, detailed reports are necessary. Besides managing the catalog file, backup software maintains logs that detail (to various degrees) the success or failure of a backup or restores process. These logs are a valuable source of information when something has gone awry or for fine-tuning of the processes.
Include NDMP support. Clearly, if NAS is a major part of the environment, NDMP is a feature worth paying for. It will allow the backup system to adapt to a growing environment without having to rearchitect and rebuild it completely.

Old-Fashioned UNIX Scripts and Applications

Some system administrators swear by their tried-and-true collection of UNIX scripts and utilities. Typically, the tar utility is combined with cron to create a form of unattended backup. tar creates compressed or uncompressed archives, which can be created or copied to any mounted UNIX (or Linux) device, including tape drives. cron is used to schedule jobs in the background. By combining these two features of UNIX and Linux with other utilities, such as grep (to parse out errors) and the various scripting languages available, a rudimentary backup system can be assembled.

The major advantages of this type of system are that is costs nothing except the programmer's time, of course and is completely under the control of the creator. That's about all that is good about this approach. It might work well for a single UNIX system with an embedded tape drive but will do very poorly for any other form of backup.

There are severe limitations to this approach. For one, scripts break very easily. Error conditions that normally would be logged and dealt with by backup applications can cause a script to cease functioning. Error reporting itself is limited, and error handling in scripts can be difficult to implement in a robust fashion. Making changes to the backup procedure or devices is a programming job instead of a point-and-click operation.

For something as important as backup and restore, spending the money on professional software, constantly revised by an entire company, backed by technical support and field service, and with the support of all major backup device vendors, is worth it.