Section 2.5. Deciding How to Back Up | Backup & Recovery: Inexpensive Backup Solutions for Open Systems

2.5. Deciding How to Back Up

Once you've decided when you're going to back up, you have to decide how you are going to back up the data. But first, look at what types of problems you are protecting yourself from.

2.5.1. Be Ready for Anything: 10 Types of Disasters

As stated earlier, how you want to do your restores determines how you want to do your backups. One of the questions that you must ask yourself is, "What are you going to protect yourself from?" Are the users in your environment all "power users" who use their computers intelligently and never make dumb mistakes? Would your company lose a lot of essential data if the files on your users' PCs were accidentally deleted? If a hurricane took out your whole company, would it be able to continue doing business? Make sure that you are aware of all the potential causes for data loss, and then make sure your backup methods are prepared for all of them. The most exhaustive list of potential causes of data loss that I have seen is in another O'Reilly book called Practical Unix and Internet Security by Simson Garfinkel and Gene Spafford. Their list, with my comments attached, follows:

User error: This has been, by far, the cause of the biggest percentage of restores in every environment that I have seen. "Hey, I was sklocking my flambality file, and I accidentally pressed the jankle button. Can you restore it, please?" This one is pretty easy, right? What about the common question: "Can you restore it as of about an hour ago?" You can do this with continuous data protection systems and snapshots, but not if you're running backups once a night.
System-staff error: This is less common than user error (unless your users have root or administrator privileges), but when it happens, oh boy, does it happen! What happens when you newfs your database's raw device or delete a user's document folder? These restores need to go really fast, because they're your fault. As far as protecting yourself from this type of error, the same is true here as for user errors: either typical nightly backups or snapshots can protect you from this .
Hardware failure: Most books talk about protecting yourself from hardware failure, but they usually don't mention that hardware failure can come in two forms: disk drive failure and systemwide failure. It is important to mention this because it takes two entirely different methods to protect yourself from these failures. Many people do not take this into consideration when planning their data protection plan. For example, I have often heard the phrase, "I thought that disk was mirrored!" when a drive or filesystem is corrupted by a system panic. Mirroring does not protect you from a systemwide failure. As a friend used to say, if the loose electrons floating around your system decide to corrupt a drive or filesystem when your system goes down, "mirroring only makes the corruption more efficient." Neither do snapshots protect you from hardware failureunless you have the snapshot on a backup volume.
Disk drive failure: Protecting your systems from disk drive failure is relatively simple now. Your only decision is how safe you want to be. Mirroring, often referred to as RAID 1, offers the best protection, but it doubles the cost of your initial drive and controller hardware investment. That is why most people choose one of the other levels of Redundant Arrays of Independent Disks (RAID), the most popular being RAID 5, with RAID 6 gaining ground. RAID 5 volumes protect against the loss of a single drive by calculating and storing parity information on each drive. RAID 6 adds more protection by storing parity twice, thus allowing for the failure of more than one drive.
Systemwide failure: Most of the protection against systemwide failure comes from good system administration procedures. Document your systems properly. Use your system logs and any other monitoring methods you have at your disposal to watch your systems closely. Respond to messages about bad disks, controllers, CPUs, and memory. Warnings about hardware failures are your chance to correct problems before they cause major disasters. Another method of protecting yourself is to use a journaling filesystem. Journaling treats the filesystem much like a database, keeping track of committed and partially committed writes to the filesystem. When a system is coming up, a journaling filesystem can roll back partially committed writes, thus "uncorrupting" the filesystem.

The Windows change journal does not make NTFS a journaling filesystem in this sense. It contains only a list of files that have been changed; it does not actually contain the changes. Therefore, it cannot roll back any changes.

Software failure: Protecting yourself from software failure can be difficult. Operating system bugs, database bugs, and system management software bugs can all cause data loss. Once again, the degree to which you protect yourself from these types of failures depends on which type of backups you use. Frequent snapshots or continuous data protection systems are the only way to truly protect against losing data, possibly a lot of data, from software failure.
Electronic break-ins, vandalism, and theft: There have been numerous incidents of this in the past few years, and many have made national news. If you do lose data due to any one of these, it's very different from other types of data loss. While you may recover the data, you can never be sure of what happened to the data while it wasn't in your possession. Therefore, you need to do everything you can to ensure that this never happens. If you want to protect yourself from losing data in this manner, I highly recommend reading the book from which I borrowed this list, Practical Unix and Internet Security, by Simson Garfinkel and Gene Spafford (O'Reilly).
Natural disasters: Are you prepared for a hurricane, tornado, earthquake, or flood? If not, you're not alone. Imagine that your entire state was wiped out. If you are using off-site storage, is that facility close to you? Is it prepared to handle whatever type of natural disasters occur in your area? For example, if your office is in a flood zone, does your data storage company store your backups on the first floor? If they're in the flood zone as well, your data can be lost in one good rain. If you really want to ensure yourself against a major natural disaster, you should explore real-time, off-site storage at a remote location, discussed later in this chapter in the section "Off-Site Storage."
Other disasters: I remember how we used to test our disaster recovery plan at one company where I worked: we would pretend that some sort of truck blew up on the street that ran by our data center. The plan was to recover to an alternate building. This would mean that we would have to have off-site storage of media and an alternate site that was prepared to accommodate all our systems. A good way to do this is to separate your production and development systems and place them in different buildings. The development systems can then take the production systems' place if the production systems are damaged, or if power to the production building is interrupted.
Archival information: It is a terrible thing to realize that a rarely used but very important file is missing. It is even more terrible indeed to find out that it has been gone longer than your retention cycle. For example, you keep your backups for only three months, after which you reuse the oldest volume, overwriting any backups that are on that volume. If that is the case, any files that have been missing for more than three months are impossible to recover. No matter how insistent the user is about how important the files are, no matter how many calls he makes to your supervisors, you will never be able to restore the files. That is why you should keep some of your backups a little bit longer. A normal practice is to set aside one full backup each month for a few years. If you're going to keep these backups for a long time, make sure you read the following sidebar "Are You Keeping Your Archives Too Long?" and the "Backup and Archive" section in Chapter 24.

"How Were the Backups Last Night?"

I suppose I've heard thousands of administrator-error horror stories, like people typing rm -r /* . I remember a guy who wanted to delete a junk file in /bin called ?*&(&^JI($SF ))FS%$#T, or something like that. He typed rm /bin/?* (which deleted all the files starting with any characterthat's rightall of them). But there's one story that I witnessed firsthand that still makes me laugh.

A consultant was given the task of cleaning up our home directories. Apparently, my company was very good about deleting logins for people who had left the company, but we weren't very good about deleting their home directories. The consultant wrote a program that basically did the following:

cd into /home1

find, looking for directories that did not match an entry in the password file and were not owned by root or administrator

rm -r that directory

Each user's home directory was located under a directory that was the first letter of her login. For example, the home directory for cpreston was in /home1/c/cpreston. The scenario went something like this. The idea was that /home1/c would be owned by root and thus would not be deleted. Unfortunately, over the years, an administrator or two would cd into /home1/c/cpreston and try to correct an ownership problem. To do that, the administrator would type chown cpreston .*. Well, if you've ever done that as root, you know that .* includes .., which in this case would be /home1/c. Thus, the /home1/ c ends up being owned by me!

The consultant did not foresee this and so would interpret /home1/c as a user's home directory and look for the user called "c" in the password file. Of course, there was no such user, so the program said rm -r /home1/c. I'm not sure when my friend realized what was happening, but I do remember being on my way out the door and getting a weird phone call. "How were the backups of /home1 last night?" my friend askedvery sheepishly and very mysteriously. "Fine, as always," was my response, "Why?" There's something beautiful about the power that the backup guy yields at that magic moment when someone really needs some files restored. Up to that point, you're the guy who comes in early and stays late, watching the backup drives spin. In one moment, you're transformed into the most important person he knows! Cool.

2.5.2. Automate Your Backup

If you work in a shop with a modest budget, you probably looked at this heading and said, "Sure, if I could afford it." Although automation that involves expensive jukeboxes and autochangers is nice, that is not the type of automation I am talking about. There are two types of automation. One type allows your backups to complete an entire cycle without requiring any manual intervention from you, such as ejecting and loading new volumes. This type of automation can make things much easier but can also make them much more expensive. If you can't afford it, a less expensive alternative is to have your backup system notify you when you need to do something manually. At the very least, it should notify you if you need (or forgot) to change a volume. If things aren't going right, you need to know. Too many times people look at their backup logs only when they need to do a restore. That's when they find out that their backups have failed for days or weeks. A slightly intelligent backup system could email you or page you if things don't go the way you expect them to go.

The second type of automation is actually much more important. This type of automation refers to how your backups "think." Your backup process should know what to back up without you telling it. If a DBA installs a new database, your backups should know about it. If a system administrator installs a new drive or filesystem, your backups should automatically include it. This is the type of automation that is essential to safe backups. A good backup system should not depend on a human brain to remember to do something.

Are You Keeping Your Archives Too Long?

Some governments have laws and regulations that govern how long certain types of data are allowed to be kept in a company's files. We're not talking about regulations that say you must keep data for a certain number of years. We're talking about a regulation that says you must delete data after a certain number of years. For example, you may be told that your personnel department can keep disciplinary paperwork for only two years. If an employee believes that her chances for a promotion are reduced because of a disciplinary action that is more than two years old, she can sue for damages. Many lawsuits have been filed based on these laws.

What happens when the disciplinary action "paperwork" is actually a file on someone's computer? The laws extend to the computers too, and the files must be deleted. But what if that file is on an archive volume that is being kept forever? Many companies have backup policies that dictate that one volume per system per year is kept "forever." In recent years, some companies have lost lawsuits because of policies like this.

The only way around this is to exclude from regular backups any directories that contain this type of information and archive them using a different schedule that conforms to the document retention laws of your state. I admit this is a pain. You will never read that I think that doing something special for anything is a good thing, but in these litigious times, this issue should not be overlooked.

2.5.3. Plan for Expansion

Another common problem happens as a backup system grows over time. What works for one or two boxes doesn't necessarily work for 200. As the volume of data grows, the need for a standardized backup system becomes greater and greater. This is a problem because most administrators, as they are writing their shell script to back up five or six boxes, do not think ahead to the time when there may be many more. I can remember my early days as the backup guy. I had 10 or 11 systems, and the "monster" was an Ultrix box. It was "huge," we said in those days. (It was almost 8 gigabytes!) The smallest tape drive we had was a 10 GB (with compression) Exabyte. We used the big 10 GB tape drive for the 8 GB system. We had what I considered to be a pretty good in-house backup script that worked without modification for two years.

Then came the HPs. The smallest system was 20 GB, and the biggest was much bigger than that. But these big systems came with a little 2 GB (4 with compression) DDS drive. Our backup script author never dreamed of a system that was bigger than a tape. One day I woke up, and our system was broken. I then spent months and months hacking up that shell script to support splitting the drive or filesystem into two tapes. Eventually, I gave up and bought a commercial product. My point is that if I had thought of that ahead of time, I might have been able to overcome the limitation without losing so much sleep.

When you are designing your backup systemor your data center, for that matterplan on your systems getting bigger and more numerous. Plan for what you will do when that happenstrust me, it will happen. It will be much better for your mental health (not to mention your job security) if you can foresee the inevitable and plan for it when you design the system the first time. Your backup system is something that should be done right the first time. And if you spend a little time dreaming about how to break it before you design it, you can save yourself a lot of money in antacids and sleeping pills.

2.5.4. Don't Forget Unix mtime, atime, and ctime

Unix, Linux, and Mac OS systems record three different times for each file. The first is mtime , or modification time. The mtime value is changed whenever the contents of the file have changed, such as when you add lines to a logfile. The second is atime , or access time. The atime value is changed whenever the file is accessed, such as when a script is run or a document is read. The last is ctime , or change time. The ctime value is updated whenever the attributes of the file, such as its permissions or ownership, are changed.

Administrators use ctime to look for hackers because they may change permissions of a file to try to exploit your system. Administrators also monitor atime to look for large files that have not been accessed for a long time. (Such files can be archived and deleted.)

2.5.4.1. Backups change atime

You may be wondering what this has to do with backups. You need to understand that any backup utility that backs up using the filesystem modifies atime as it reads the file to back it up. Almost all commercial utilities, as well as tar , cpio, and dd,^[§] have this feature. dump reads the filesystem via the raw device, so it does not change atime.

^[§] dd has this feature when you're using it to copy an individual file in a filesystem, of course. When using dd to copy a raw device, you will not change the access times of files in the filesystem.

2.5.4.2. The atime can be resetwith a penalty

A backup program can look at a file's atime before it backs it up. After it backs up the file, the atime obviously has changed. It can then use the utime system call to reset atime to its original value. However, changing atime is considered an attribute change, which means that it changes ctime. This means that when you use a utility such as cpio or gtar that can reset atime, you change ctime on every file that it backs up. If you have a system that is watching for ctime changes, it will think that it's found a hacker for sure!

Make sure that you understand how your utility handles this issue.

2.5.5. Don't Forget ACLs

Windows files stored on an NTFS filesystem and some files stored on modern Linux filesystems use access control lists (ACLs) to grant or restrict permissions to users. ACLs say who can read, write, execute, modify, or have full control over a file. Figure 2-1 shows an example of such ACLs.

Figure 2-1. Access control list example

You need to investigate how your backup product is handling ACLs. The proper answer is that they are backed up and restored. This is a feature common with commercial products, but unfortunately, not all open-source products do this. Make sure you look into this when evaluating open-source tools.

2.5.6. Don't Forget Mac OS Resource Forks

Mac OS files stored in MFS, HFS, or HFS Plus filesystems have two forks: the data fork and the resource fork. The data fork contains the actual data for the file, such as its text. The resource fork contains related structured data, such as offsets, menus, dialog boxes, and icons for the file. The two forks are tightly bound into a single file. While they are typically used by executables, every file can have a resource fork, and other applications can use it as well. For example, a word processing program may store a file's text in the data fork and the file's images in the file's resource fork.

These resource forks, like Windows ACLs, need to be backed up, and not all backup products back them up properly. Make sure you investigate what your backup system does with the data fork and resource fork.

2.5.7. Keep It Simple, SA

K.I.S.S. Have you seen this acronym before? It applies double or triple to backups. The more complicated your backup scheme is, the more likely it is to fail. If you do not understand it, you cannot implement it. Remember this every time you consider adding a new bell or whistle to your backup system. Every change puts your data at risk. Also, every change might make your backup system that much more complexand more difficult to explain to the new backup person. One of the heads of support for a commercial backup product said that he sees the same thing over and over again. One person gets to know the software really well and writes various scripts to automate this and that. Backups become a well-oiled machineuntil they are turned over to the trainee. The trainee doesn't understand all the bells and whistles, and things start breaking. All of a sudden, your data is in danger. Keep that in mind the next time you think about adding some cool new feature to your backup script.

This next comment also relates to the previous section about "thinking big." One of the common judgment errors is to not automate in the beginning. It's so much easier to just put a hardcoded include list in a file somewhere or put it in the cron or scheduled task entry itself. However, that creates many different backup methods. If each box has its own special customized backup system, it is very hard to monitor your backups and explain them to the new person.

Remember, special is bad. Just keep saying it over and over again until you believe it.

It's not such a big deal when you have two or three systems, but it is when you grow to 200 systems. If you have to remember every system's idiosyncrasies every time you look at your logs, things inevitably get out of control. Exceptions for each system also can mean that things get overlooked. Do you remember that nine months ago you excluded /home* on apollo? I hope so, if apollo just became your primary NFS server, and it now has seven home directories.

If you cannot explain your backups to a stranger in less than a few hours, things are probably too complex. You should look at implementing things like centralized logging, standardized backup scripts, and some level of automation.

Read Those Manuals

The IP address of the backup server for a large software company was constantly changing to different, seemingly random IP addresses. The only identifiable pattern was that each new IP address the backup server would be assigned would be an IP address of one of the backup clients. Support cases were opened with vendors and all engineers were working 24/7 to resolve it, yet nobody could figure it out.

It turned out that a backup operator assigned to resolving backup issues was troubleshooting using the standard troubleshooting procedures for the group. But the new backup operator mixed up a few commands, so when trying to do basic name resolutions for the backup hosts (nslookup hostname), the command issued became ifconfig -a hostname instead. This changed the IP address of the backup server to whatever host was having backup issues, at random times of the day, and only on the days that operator was working.

Jorgen Lie