Section 2.3. Deciding What to Back Up | Backup & Recovery: Inexpensive Backup Solutions for Open Systems

2.3. Deciding What to Back Up

Experience shows that one of the most common causes of data loss is that the lost data was never configured to be backed up. The decision of what to back up is an important one.

2.3.1. Plan for the Worst

When trying to decide what files to include in your backups, take the most pessimistic technical person in your company out to lunch. In fact, get a few of them together. Ask them to come up with scenarios that you should protect against. Use these scenarios in deciding what should be included, and they will help you plan the "how" section as well. Ask your guests: "What are the absolute worst scenarios that could cause data loss?" Here are some possible answers:

An entire system catches fire and melts to the ground, leaving an unrecognizable mass of molten metal and blackened, smoking plastic.
Since this machine was so important, you had it replicated to another node right next to it. Of course, that machine catches fire right along with this one.
You have a centralized server that controls all backups and keeps a record of backup volume locations and what files are on what volumes, and so on. The server that blew up sits right next to this "backup server," and the intense heat takes this system with it.
The disastrous chain reaction continues, taking out your DHCP and Active Directory servers, the NIS master server, the NFS and CIFS home directory servers, and the database server where you house the inventory of all your backup volumes with their respective locations. This computer also holds the telephone database listing all service agreements, vendor telephone numbers, and escalation procedures.
You haven't memorized the number to your new off-site storage vendor yet, so it's taped to the wall next to your backup server. You realize, of course, that the flames just burned that paper beyond recognition.
The flames set off the sprinkler system, and water pours all over your backup volumes. Man, are you having a bad day....

What do you do if one of these scenarios actually happens? Do you even know where to start? Do you know:

What volume contains last night's backup?
Where you stored it?
How to get in touch with the off-site storage vendor to retrieve the copies of your backup volumes? And once you find that out, whether your server and network equipment will be available to recover?
Who to call to get replacement equipment at 2:00 a.m. on a Saturday?
What the network looked like before all the wires melted?

First, you need to recover your backup server, because it has all the information you need. OK, so now you found the backup company's card in your wallet, and you've pulled back every volume they had. Since your media database is lost, how will you know which one has last night's backup on it? Time is wasting....

All right, you've combed through all the volumes, and you've found the one you need to restore the backup server (easier said than done!). Through your skill, cunning, and plenty of help from tech support, you restore the thing. It's up and running. Now, how many disks were on the systems that blew up? What models were they? How were they partitioned? Weren't some of them striped together into bigger volumes, and weren't some of them mirroring one another? Where's that information stored? Do you even know how big the drives or filesystems were? Man, this is getting complicated....

Validated, My Eye

A biotech firm with a number of servers that were considered validated systems for FDA CFR21 purposes lost a critical database that was running on one such server. When they went to their backup server to restore it, they discovered to their horror that that server had not been backed up for approximately three months. Somehow, it had been removed from the backup schedule, so no "errors" were showing up, and they were now without anything remotely approaching a current backup. The problem escalated up to the CEO of the company.

Jim Damoulakis

Didn't you just install that big jumbo kernel patch last week on three of these systems? (You know, the one that stopped all those network broadcast storms that kept bringing your network down in the middle of the day.) You did make a backup of the kernel after you did that, didn't you? Of course, the patch also updated files all over the OS drive. You made a full backup, didn't you? How will you restore the operating system drive, anyway? Are you really going to go through the process of reinstalling the operating system just so you can run the restore command and overwrite it again?

Filesystems aren't picky about size, as long as you make them big enough to hold the data that you restore to them, so it's not too hard to get those filesystems up and running. But what about the database? It was using raw partitions. You know it's going to be much pickier. It's going to want /dev/rdsk/c7t3d0s7, /dev/dsk/c8t3d0s7, and /dev/dsk/c8t4d0s7 right where they were and partitioned just as they were before the disaster. They also need to be owned by the database user. Do you know which drives were owned by that user before the crash? Which disks were those again?

It could happen.

Part IV covers these Catch-22 situations.

2.3.2. Take an Inventory

Make sure you can access essential information in the event of a disaster:

Backups for your backups: Many companies have begun to centralize control of their backups, which I think is a good thing. However, once you centralize storage of all your backup information, you have a single point of failure for your entire backup plan. You can't restore the backup server because you don't have the database of your backups. You don't have the database of your backups because you need to restore your backup server. Restoring this server would be the first step in any multisystem outage. For things like media inventory, don't underestimate the value of an inventory printed on paper and stored off-site. That paper may just get you out of this Catch-22. Given the single-point-of-failure factor, the recovery of your backup server should be the easiest and best-documented recovery that you have. You may even want to investigate creating a special tar, ntbackup, or rsync backup of that data to make it even easier to recover during a disaster.
What peripheral devices did you have?: Assuming you back up your disk drive configuration on a regular basis, you might have a list of all the disk drives, but do you know what models they are? If you have all Brand-X 500 GB drives, you have no problem, but many servers have a mixture of drives that were installed over time. You may have a collection of 40 GB, 100 GB, and 500 GB drives, all on the same system. Make sure that you are recording this in some way. Unix and Mac OS systems record this information in the messages file, and Windows stores it in the registry, so hopefully you're backing those up.
How were they partitioned?: This one can really get you, especially if you have to restore the operating system drive or a database drive. Both drives are typically partitioned with custom partitions that must be repartitioned exactly the same as before for a proper restore to occur. Typically, this partition information is not saved anywhere on the system, so you must do something special to record it. On a Solaris system, for example, you can run a prtvtoc on each drive and save that to a file. Search on the Internet for scripts for capturing this information; a number of such free utilities exist.
How were your volume managers configured?: A number of operating system-specific volume managers are available, including Veritas Volume Manager, Windows Dynamic Drives, Solstice (Online) Disk Suite, and HP's Logical Volume Manager. How is yours configured? What devices are mirrored to what? How are your multidisk devices set up? Unbelievably, this information is not always captured by normal backup utilities. In fact, I used Logical Volume Manager for months before hearing about the lvmcfgbackup command (it backs up the LVM's configuration information). Sometimes if you have this properly documented, you may not need to restore at all. For example, if the operating system disk crashes, simply put the disks back the way they were and then rebuild the stripe in the same order, and the data should be intact. I've done this several times.
How are your databases set up?: I have seen many database outages. When I ask a database administrator (DBA) how her database is set up, the answer is almost always, "I'm not sure...." Find out this information, and record it up front.
Did you document how you set up DHCP, Active Directory, NFS, and CIFS?: Document, document, document! There are a hundred reasons to properly document things like this, and recovery from a disaster is one of them. Good documentation is definitely part of the backup plan. It should be regularly updated and available. No one should be standing around saying "I haven't set up NIS/AD/NFS from scratch in years. How do you do that again? Has anyone seen my copy of O'Reilly's book?" Actually, the best way to do this is to automate the creation of new servers. If your operating system supports it, take the time to write scripts that automatically install various services, and configure them for your environment. Put these together in a toolkit that is run every time you create a new server. Better yet, see if your OS vendor has any products that automate new server installations, such as Sun's Jumpstart, HP's Ignite-UX, Linux Kickstart, and Mac OS cloning features.
Do you have a plan for this?: The reason for describing the earlier horrible scenarios is so you can start planning for them now. Don't wait until there's 20 feet of snow in your front yard before you start shopping for a snow shovel! It's going to snow; it's only a question of when. Take those pessimists out to lunch, let them dream of the worst things that could happen, and then plan for them. Have a fully documented, step-by-step plan for the end of the computer world as you know it. Even if the plan needs a little modification when you actually have to use it, you will be glad you have a starting point. It's a whole lot better than standing around saying, "What do we do now? Has anyone seen my résumé?" (You did keep a hardcopy of it, right?)
Know what's on your boxes!: The best insurance against almost any kind of loss is for the backup/recovery person to be familiar with the systems he is protecting. If a particular server goes down, you should know immediately that it contains an Oracle or SQL Server database and should be running for those volumes. That way, the moment the server is ready for a restore, so are you. Become very involved in the installation of any new system or database. You should know what database platforms you are using and how they are set up. You should know about any new drives, filesystems, databases, or systems. You need to be very familiar with every box, what it does, and what's on it. This information is vital so that you can include any special backups for that type of system.

It Pays to Watch Your Logs

It was my very first gig out of college, so I was primarily supposed to be doing desktop support while learning at the feet of a high-priced Unix consultant, who we'll call Fred.

We were supporting a ForEx trading app called Opus that ran on SunOS. When it stored trades, half the information was in the path. For example, if someone made a USD-GBP trade on June 15 with someone from Bank of New York, the path and file would look like this:

/opt/app/opus/transactions/portfolio/third-party/...etc...etc.../USD/CAI/GBP/BONY/ask/19970615120453.2372149821335

This insipid design was surely not Fred's fault, but he did set up the backups for it. I discovered that he had set up a tar job using v, which produced logs so big he wasn't looking at them. Once I removed v and started watching the logs, I found out that backups had been failing. The version of tar that shipped with SunOS at the time choked on file paths longer than 100 characters or so. The trading stubs were all about nine characters too long. Fred was basically tarring up a huge directory tree with no files at the bottom. Had he ever looked at the logs, he would have known that.

I was designated the primary Unix admin the next day. The company didn't renew Fred's contract.

Jim "Sparky" Donnellan

2.3.3. Are You Backing Up What You Think You're Backing Up?

I remember an administrator at one of my previous employers who used to say, "Are we getting this on tape?" He always said it with his trademark smirk, and it was his way of saying "Hi" to the backup guy. His question makes a point. There are some global ways that you can approach backups that may drastically improve their effectiveness. Before we examine whether to back up part or all of the system, let us examine the common practice of using include lists and why they are dangerous. Also, let's consider some of the ways that you can avoid using include lists.

What are include and exclude lists? Generically speaking, there are two ways to back up a system:

You can tell your backup system to back up everything, except what is in an exclude list, for example:

For Unix, Linux, and Mac OS servers:

Include: * Exclude: /tmp, /junk1, /junk2

For Windows servers:

Include: * Exclude: *.tmp, *Temporary Internet Files*, ~*.*, *.mp3

You can tell your backup system to back up what is in an include list, for example:
- For Unix, Linux, and Mac OS servers:
- ```
Include: /data1, /data2, /data3
```
- For Windows servers:
- ```
Include: D:\, E:\
```

Looking at these examples, ask yourself what happens when you create /data4 or the F:\ drive? Someone has to remember to add it to the include list, or it will not be backed up. This is a recipe for disaster. Unless you're the only one who adds drives or filesystems and you have perfect memory, there will always be a forgotten drive or filesystem. As long as there are other administrators and there is gray matter in your head, something will be left out.

I Hate It When That Happens

I was working at a major publishing company when an image server died. When those involved went to the backup administrator and asked for a restore of all the images from the server, he had no record of the server. It appears that after placing the server into production a year earlier, no one had formally requested that the server be added to the backup system. They lost thousands of images.

Chris Pritchard

However, unless your backup utility supports automated drive or filesystem discovery, it takes a little effort to say, "Back up everything." How do you make the list of what systems, drives, filesystems, and databases to back up? What you need to do is look at files such as /etc/vfstab or the Windows registry and parse out a list of drives or filesystems to back up. You can then use exclude lists to exclude any drives or filesystems you don't want backed up.

Oracle has a similar file in Unix, called oratab, which can be used to list all Oracle instances on your server.^[*] Windows stores this information in the registry, of course. You can use oratab to list all instances that need backing up. Unfortunately, Informix and Sybase databases have no such file unless you manually make one. I do recommend making such a file for many reasons. It is much easier to standardize system startup and backups when you have such a file. If you design your startup scripts so that a database does not get started unless it is in this file, you can be reasonably sure that any databases that anyone cares about will be in this file. This means, of course, that any important databases are backed up without any manual intervention from you. It also means that you can use the same Informix and Sybase startup scripts on every system, instead of having to hardcode each database's name into the startup scripts.

^[*] You can install an Oracle instance without putting it in this file. However, that instance will not get started when the system reboots. This usually means that the DBA will take the time to put it in this file. More on that in Chapter 15.

How do you know what systems to back up? Although I never got around to it, one of the scripts I always wanted to write was a script that monitored the various host databases, looking for new systems. I wanted to get a complete list of all hosts from Domain Name System (DNS) and compare it against a master list. Once I found a new IP address, I would try to determine if the new IP address was alive. If it was alive, that would mean that there was a new host that possibly needed backing up. This would be an invaluable script; it would ensure there aren't any new systems on the network that the backups don't know about. Once you found a new IP address, you could use nmap to find out what type of system it is. nmap sends a malformed TCP packet to the IP address, and the address's response to that packet reveals which operating system it is based on.

Some commercial data protection management software packages now support this functionality.

2.3.4. Back Up All or Part of the System?

Assuming you've covered things that are not covered by normal system backups, you are now in a position to decide whether you are going to back up your entire systems or just selected drives or filesystems from each system. These are definitely two different schools of thought. As far as I'm concerned, there are too many gotchas in the selected-filesystem option. Backing up everything is easier and safer than backing up from a list. You will find that most books stop right there and say "It's best to back up everything, but most people do something else." You will not see those words here. I think that not backing up everything is very dangerous. Consider the following comparison between the two methods.

2.3.4.1. Backing up only selected drives or filesystems

Here are the arguments for and against selective backups.

Save media space and network traffic.

The first argument that is typically stated as a plus to the selected-filesystem method is that you back up less data. People of this school recommend having two groups of backups: operating system data and regular data. The idea is that the operating system backups would be performed less often. Some would even recommend that they be performed only when you have a significant change, such as Windows security patches, an operating system upgrade, a patch installation, or a kernel rebuild. You would then back up your "regular" data daily.

The first problem with this argument is that it is outdated; just look at the size of the typical modern system. The operating system/data ratio is now significantly heavier on the data side. You won't be saving much space or network traffic by not backing up the OS even on your full backups. When you consider incremental backups, the ratio gets even smaller. Operating system partitions have almost nothing of size that would be included in an incremental backup, unless it's something important that should be backed up! This includes Unix, Linux, and Mac OS files such as /etc/passwd, /etc/hosts, syslog, /var/adm/messages, and any other files that would be helpful if you lost the operating system. It also includes the Windows registry. Filesystem swap is arguably the only completely worthless information that could be included on the OS disk, and it can be excluded with proper use of an exclude list.

Harder to administer.

Proponents of piecemeal backup would say that you can include important files such as the preceding ones in a special backup. The problem with that is it is so much more difficult than backing up everything. Assuming you exclude configuration files from most backups, you have to remember to do manual backups every time you change a configuration file or database. That means you have to do something special when you make a change. Special is bad. If you just back up everything, you can administer systems as you need to, without having to remember to back up before you change something.

Easier to split up between volumes.

One of the very few things that could be considered a plus is that if you split up your drives or filesystems into multiple backups, it is easier to split them between multiple volumes. If a backup of your system does not fit on one volume, it is easier to automate it by splitting it into two different include lists. However, in order to take advantage of this, you have to use include lists rather than exclude lists, and then you are subject to the limitations discussed earlier. You should investigate whether your backup utility has a better way to solve this problem.

Easier to write a script to do it than to parse out the fstab, oratab, or Windows registry.

This one is hard to argue against. However, if you do take the time to do it right the first time, you never need to mess with include lists again. This reminds me of another favorite phrase of mine: "Never time to do it right, always time to do it over." Take the time to do it right the first time.

The worst that happens? You overlook something!

In this scenario, the biggest benefits are that you save some time spent scripting up front, as well as a few bytes of network traffic. The worst possible side effect is that you overlook the drive or filesystem with your boss's budget that just got deleted.

2.3.4.2. Backing up the entire system

The pros for backing up the entire system are briefer yet far more compelling:

Complete automation.

Once you go through the trouble of creating a script or program that works, you just need to monitor its logs. You can rest easy at night knowing that all your data is being backed up.

The worst that happens? You lose a friend in the network department.

You may increase your network traffic by a few percentage points, and the people looking after the wires might not like that. (That is, of course, until you restore the server where they keep their DNS source database.)

Backing up selected drives or filesystems is one of the most common mistakes that I find when evaluating a backup configuration. It is a very easy trap to fall into because of the time it saves you up front. Until you've been bitten though, you may not know how much danger you are in. If your backup setup uses include lists, I hope that this discussion convinces you to rethink that decision.