Section 13.3 Switch to Auxiliary Control (Hot Backups) | Real World Linux Security Prentice Hall Ptr Open Source Technology Series

13.3 Switch to Auxiliary Control (Hot Backups)

In many cases, it is well worth the money to have backup systems. As I write this, my beloved eBay (where I bought my Rolls-Royce at auction in 1999 from www.pawnbroker.com) has been down most of Friday and again this weekend. Friday, eBay's stock fell by 9 percent causing a loss to stockholders of about $180 million! This was the biggest decline on NASDAQ on this day. A backup system (which they did not have) would have much been less expensive. eBay later reported that the system went down because of a software upgrade by Sun Microsystems that did not work. The lesson and solution is the same, however: Have a backup system! Also do careful testing of upgrades before implementation.

13.3.1 Which Systems Should Have Backup Systems?

Simply, those systems for which the cost of a backup system is less than the consequences of not having a backup system should have backup systems. In some cases, multiple backup systems for redundancy would be appropriate. To compute the cost of downtime, start with the value of sales profits going through your site per day and double this to assume being down for two days. Then estimate what percentage will be lost and multiply. The percentage lost typically will be 10 50 percent. Add to this the costs associated with the estimated 20 60 percent of users who attempt to use your site during those two days who will find your competitor's URL and permanently switch to your competitor. Government departments risk getting their funding cut due to the value of their public good being considered to be diminished. Do not forget the cost of the bad publicity.

Add in the cost of your people who use the system not being able to work, allowing for the company's cost for a person typically being double her gross salary. A decent small backup server can be built for $600. This equates to five hours of lifetime downtime for a 20-person engineering department with average salaries of $60,000. Furthermore, there are hard-to-determine costs of downtime such as lost deadlines, employee frustration resulting in lower morale, or higher turnover, etc.

If you work for a large company or agency, expect your downtime to be national news. As I write this, the current news includes crackers taking down the Web sites for the White House (that uses SGI UNIX), the Federal Bureau of Investigation, and the Department of Energy. As previously mentioned, the eBay site has been down due to software or hardware problems. These, too, are reasons for having backup systems.

If your company's stock is sensitive to bad news, try to estimate the effect and multiply the per-share effect by the number of shares outstanding. For agencies, similar estimates of funding should be done.

Having a backup system does not mean doubling your costs. Because the backup system is intended to be used only for a few hours or a few days, it does not need to be as "big" or as expensive as the main system because, usually, slower performance is acceptable. Additionally, the backup system might not need to support less critical or less time-critical applications, reducing the "size" needed.

13.3.2 The Two Types of Backup Systems

Many people are familiar with a backup system that is used to take over if the primary system has a hardware failure. I will call this hardware backup or HardBack. Typically, a hardware backup system will be "online" with identical hardware and software and an up-to-date (or almost up-to-date) copy of databases, ready to spring into action when required. A backup system intended to take over when the primary system's security gets breached is different. I call this security backup system a SecBack.

Clearly, a HardBack, being a duplicate of the primary system, can be broken into as easily as the primary system.

This author enjoyed the fruits of this while being a Computer Science student and gray hat at U.C. Berkeley. The computer center got a new computer system. Rather than bothering with a full UNIX installation from tape, they copied the disks from a running system that we had left Trojan horses in.

We did not know that that was how they set the new system up until we noticed that the security holes we added to an existing system worked on the brand-new system! (The existing system was the Cory Hall PDP-11/70 where the early Berkeley UNIX work was done by Ken Thompson, Bill Joy, Chuck Haley, Jeff Schriebmann, and others. It was the first PDP-11/70 to run UNIX because it was the one that Ken ported UNIX to from the PDP-11/45 during my freshman year.)

13.3.3 Security Backup System Design

So how can you prevent crackers from breaking into your SecBack, realizing that if you are switching to it, your primary system (which has a similar configuration) has been broken into? There are no guarantees here, just probabilities.

Keep the backup system "off the network" until it is needed. Thus, if it takes the crackers time to break into a system, you will have this much time from the time you Switch to Auxiliary Control to find the problem or clean up your primary system.
"Off the network" might mean no network services except SSH to only a few accounts with different passwords from other systems and also requiring authentication key verification. These passwords need to be hard to guess. A firewall or TCP Wrappers should be used to limit which systems may connect to it.
Preferably, "off the network" means physically disconnected from the network. For very high security, the SecBack should be in a separate room or building with different door keys and possibly even different personnel.
Use different passwords on the backup system, in case the break-in was by cracking a password. At least they will need to start over and this probably will take hours or days.
Limit services to the bare minimums. It is reasonably likely that they used a less critical service to break in.
Limit hours of operations. Many systems are used only during business hours. Crackers know that most systems are running at 3 A.M., but not monitored then. While your SecBack is supplying primary services, shut it down (or disconnect it from the network) during off hours.
Monitor the backup system more carefully after switching to it. You probably will be doing this anyway. With careful monitoring, most attempts to crack a system can be discovered before they are successful.
Plan in advance the ways to alter the amount of monitoring. The Deception Tool Kit (DTK) or the Cracker Trap especially can be helpful here, because they will detect probing of unused ports by crackers. Add crontab entries to e-mail log files to yourself every few hours to spot attempts to break in.
Even better, adjust /etc/syslog.conf to forward log entries to another system with even tighter security and fewer network services. Any old 486 gathering dust could be drafted for this purpose, because performance will not be an issue.
Run Tripwire frequently to detect whether the system has been altered.
Use different software. Use slightly different versions of Linux and "add-on" software. Have the backup system be one or two revisions behind (but with security patches applied), in case the crackers discover a hole in the new version before you can patch it.
If your primary system runs Slackware, consider running Red Hat on your backup system (or vice versa) to prevent a distribution-specific vulnerability from also taking down your backup system. This strategy may be applied to Web Servers, Database software, etc.
Try to repair and secure your primary system as soon as possible and switch back to it.

13.3.4 Keeping the Security Backup System Ready

You could keep a HardBack ready by updating its database (or equivalent) from the primary system. However, for a Security Backup System (SecBack) this could backup compromised or corrupted data. For a financial system, this could allow the crackers to steal vast sums of money.

There is no simple universal solution to this problem. You might start with daily backups of the database to the SecBack. If performance is not too much of a problem, backups during the business day or shortly after its close are preferable to late at night. This is because you will have people in the office and they are more likely to discover cracking attempts quickly. Also, crackers tend to work at night.

For Web servers that just provide fixed pages to browsers and allow users to generate e-mail, the data on disk does not change much and so there is not a problem of "keeping the SecBack's data up-to-date." Had the White House or FBI followed this strategy on their Web sites, they would not have had the embarrassing lengthy downtimes following their sites being cracked. The use of a source code control system, such as CVS, Perforce, or RCS, is suggested to detect both unintentional and malicious changes. Its use also allows the quick recreation of the tree. Some sites also use a source code control system to manage the system's configuration files, such as /etc/passwd, /etc/hosts, and /etc/sendmail.cf. Some might prefer to use tar to store snapshots of these files.

If some of the Web page forms invoke CGI programs that affect the disk (by taking sales orders, etc.), you could isolate the CGIs on other computers. This is so simple to do. In the form's FORM ACTION tag, simply specify the CGI's URL as being on the other computer. This would allow the SecBack to be deployed immediately if the Web server is cracked.

In extreme cases, you could simply disable the normal Order Entry processing. You might have the SecBack instead generate e-mail to your order processing folks from an HTTP form where the customers could supply their name, address, and items to buy (instead of the normal fancy processing). Use https if you can. If even this is not possible, provide alternate Web pages to put up a message saying that the Order Entry system is down temporarily and providing a toll-free phone number where customers can place a telephone order.

Because some "Web types" have a strong preference for operating over the Web, have this page also provide a form where the user can enter an e-mail address where she can be notified when the system is back to normal operation. Consider offering a discount or small gift certificate to users who are inconvenienced by this problem. This author has seen the excitement of an Amazon customer receiving $5 for being inconvenienced by downtime. Amazon has paid out, perhaps, $15 to this customer, but she does thousands of dollars of business with them annually.

You might have unchanging data in partitions of a disk physically wired to be Read/Only. This will block vulnerabilities that would allow the data to be altered but are unable to cause the programs to look elsewhere for it (such as to /tmp). This strategy can be used on the normal server as well as the SecBack. You will need to mount these partitions Read/Only to suppress the kernel from attempting to write inode data with updated file access time data and generating write errors. This is covered in detail in Part I as a normal strategy. Even if the physical Read/Only option does not make sense for normal operations it may for the SecBack.

If your disk hardware does not offer a jumper scheme to enable Read/Only mode, downloading the IDE or SCSI specification and building a custom cable or modifying the existing cable to disable the WRITE signal wire will work. You will want to provide the "do not write" voltage level to this wire, possibly through an appropriate resister.

Some useful URLs for High Availability Linux are listed here. (You type the dash in High-Availability.)

http://linux-ha.org/

http://linux-ha.org/failover/

http://fake.sourceforge.net/

http://ibiblio.org/pub/Linux/ALPHA/linux-ha/High-Availability-HOWTO.html^[1]

^[1] This document might have moved from this directory by the time you read this.

http://directory.google.com/Top/Computers/Software/Operating_Systems/Linux/Hardware_Support/High_Availability/

13.3.5 Checking the Cache

There are two data items about your server that are cached by other systems (other than the state of active connections). These are the server's numeric IP address and Ethernet MAC address. Because the IP address is cached by name servers around the world it is best to have your SecBack use the same address as your primary server. Either leave it at this IP address when standing by but with the cable disconnected or change it to this address when deployed to take over. If you will be changing its IP address when deployed, carefully test all the services that you will be offering because many server programs are not designed to work after the system's IP address is changed "underneath them." Restarting these servers or rebooting might be needed.

The Ethernet MAC address (Ethernet address) is cached only by systems on the LAN segment. If the primary system will be disconnected such that its Ethernet card (NIC) cannot transmit data after the failover, the easiest solution would be to have the SecBack system simply change the MAC address that its card uses. Almost all cards support this capability. The following examples will use the eth0 interface. The following command will set the MAC address:

 ifconfig eth0 hw ether 00:81:43:07:07:07

If you do not want to use the same MAC address as the primary server, you will need to inform the other systems on the Ethernet segment that matter. You trigger this by issuing the following commands on the SecBack for each client that needs to know your new MAC address. It is the commands causing the SecBack to issue an ARP request to each client that causes the clients to cache your new ARP address, as required by RFC 826.

 arp -i eth0 -d cli.pentacorp.com ping -c 1 cli.pentacorp.com&

The -c 1 does only a single ping, which forces the ARP protocol to be run. The "&" will prevent a hang up if the client system cannot be reached. Windows 3.1 system and those with some DOS IP stacks will not update due to bugs and might need rebooting. If those remote systems have the server's ARP address type set to permanent in their cache, it will need to be deleted explicitly using arp's -d flag with your host name or IP address specified. (You might want to have the /etc/ethers file on each of them updated or they will lose connectivity to the server upon reboot.) Linux systems normally cache ARP addresses for 60 seconds. Reading or writing the following file will get or set this value:

 /proc/sys/net/ipv4/neigh/eth0/gc_stale_time

Whatever auxiliary control method is used should be tested periodically and the hardware certainly should be checked at least weekly to detect whether it has failed.

Also see MAC in the index.

13.3.6 Brother, Can You Spare a Disk?

If your budget or time does not allow separate SecBack systems, another possibility would be a backup disk for the primary system that remains physically disconnected until deployed but configured to have the same unit number as the primary disk. To deploy, bring the system down as quickly as possible. Because it is presumed corrupt anyway, for many sites issuing a sync command, waiting a second and powering off will do. Then unplug the primary disk, plug in the SecBack disk, and power up. Although slightly slower to recover this way, it still could be accomplished in two minutes plus "sprint time." The backup disk will need to be synchronized to the primary disk periodically, depending on the type of data that the server handles. This is far faster than recovering from backup tapes or CD-RWs.

If you are condemned to having only backup tapes or CD-RWs and no auxiliary control, be sure to have ready access to the most recent backup versions. Bank safety deposit boxes are not accessible 24 hours a day except with dynamite, and that might damage the tapes.

Top