13.3 Switch to Auxiliary Control (Hot Backups)In many cases, it is well worth the money to have backup systems. As I write this, my beloved eBay (where I bought my Rolls-Royce at auction in 1999 from www.pawnbroker.com) has been down most of Friday and again this weekend. Friday, eBay's stock fell by 9 percent causing a loss to stockholders of about $180 million! This was the biggest decline on NASDAQ on this day. A backup system (which they did not have) would have much been less expensive. eBay later reported that the system went down because of a software upgrade by Sun Microsystems that did not work. The lesson and solution is the same, however: Have a backup system! Also do careful testing of upgrades before implementation. 13.3.1 Which Systems Should Have Backup Systems?Simply, those systems for which the cost of a backup system is less than the consequences of not having a backup system should have backup systems. In some cases, multiple backup systems for redundancy would be appropriate. To compute the cost of downtime, start with the value of sales profits going through your site per day and double this to assume being down for two days. Then estimate what percentage will be lost and multiply. The percentage lost typically will be 10 50 percent. Add to this the costs associated with the estimated 20 60 percent of users who attempt to use your site during those two days who will find your competitor's URL and permanently switch to your competitor. Government departments risk getting their funding cut due to the value of their public good being considered to be diminished. Do not forget the cost of the bad publicity. Add in the cost of your people who use the system not being able to work, allowing for the company's cost for a person typically being double her gross salary. A decent small backup server can be built for $600. This equates to five hours of lifetime downtime for a 20-person engineering department with average salaries of $60,000. Furthermore, there are hard-to-determine costs of downtime such as lost deadlines, employee frustration resulting in lower morale, or higher turnover, etc.
If your company's stock is sensitive to bad news, try to estimate the effect and multiply the per-share effect by the number of shares outstanding. For agencies, similar estimates of funding should be done. Having a backup system does not mean doubling your costs. Because the backup system is intended to be used only for a few hours or a few days, it does not need to be as "big" or as expensive as the main system because, usually, slower performance is acceptable. Additionally, the backup system might not need to support less critical or less time-critical applications, reducing the "size" needed. 13.3.2 The Two Types of Backup SystemsMany people are familiar with a backup system that is used to take over if the primary system has a hardware failure. I will call this hardware backup or HardBack. Typically, a hardware backup system will be "online" with identical hardware and software and an up-to-date (or almost up-to-date) copy of databases, ready to spring into action when required. A backup system intended to take over when the primary system's security gets breached is different. I call this security backup system a SecBack. Clearly, a HardBack, being a duplicate of the primary system, can be broken into as easily as the primary system.
13.3.3 Security Backup System DesignSo how can you prevent crackers from breaking into your SecBack, realizing that if you are switching to it, your primary system (which has a similar configuration) has been broken into? There are no guarantees here, just probabilities.
13.3.4 Keeping the Security Backup System ReadyYou could keep a HardBack ready by updating its database (or equivalent) from the primary system. However, for a Security Backup System (SecBack) this could backup compromised or corrupted data. For a financial system, this could allow the crackers to steal vast sums of money. There is no simple universal solution to this problem. You might start with daily backups of the database to the SecBack. If performance is not too much of a problem, backups during the business day or shortly after its close are preferable to late at night. This is because you will have people in the office and they are more likely to discover cracking attempts quickly. Also, crackers tend to work at night. For Web servers that just provide fixed pages to browsers and allow users to generate e-mail, the data on disk does not change much and so there is not a problem of "keeping the SecBack's data up-to-date." Had the White House or FBI followed this strategy on their Web sites, they would not have had the embarrassing lengthy downtimes following their sites being cracked. The use of a source code control system, such as CVS, Perforce, or RCS, is suggested to detect both unintentional and malicious changes. Its use also allows the quick recreation of the tree. Some sites also use a source code control system to manage the system's configuration files, such as /etc/passwd, /etc/hosts, and /etc/sendmail.cf. Some might prefer to use tar to store snapshots of these files. If some of the Web page forms invoke CGI programs that affect the disk (by taking sales orders, etc.), you could isolate the CGIs on other computers. This is so simple to do. In the form's FORM ACTION tag, simply specify the CGI's URL as being on the other computer. This would allow the SecBack to be deployed immediately if the Web server is cracked. In extreme cases, you could simply disable the normal Order Entry processing. You might have the SecBack instead generate e-mail to your order processing folks from an HTTP form where the customers could supply their name, address, and items to buy (instead of the normal fancy processing). Use https if you can. If even this is not possible, provide alternate Web pages to put up a message saying that the Order Entry system is down temporarily and providing a toll-free phone number where customers can place a telephone order. Because some "Web types" have a strong preference for operating over the Web, have this page also provide a form where the user can enter an e-mail address where she can be notified when the system is back to normal operation. Consider offering a discount or small gift certificate to users who are inconvenienced by this problem. This author has seen the excitement of an Amazon customer receiving $5 for being inconvenienced by downtime. Amazon has paid out, perhaps, $15 to this customer, but she does thousands of dollars of business with them annually. You might have unchanging data in partitions of a disk physically wired to be Read/Only. This will block vulnerabilities that would allow the data to be altered but are unable to cause the programs to look elsewhere for it (such as to /tmp). This strategy can be used on the normal server as well as the SecBack. You will need to mount these partitions Read/Only to suppress the kernel from attempting to write inode data with updated file access time data and generating write errors. This is covered in detail in Part I as a normal strategy. Even if the physical Read/Only option does not make sense for normal operations it may for the SecBack.
Some useful URLs for High Availability Linux are listed here. (You type the dash in High-Availability.) http://linux-ha.org/ http://linux-ha.org/failover/ http://fake.sourceforge.net/ http://ibiblio.org/pub/Linux/ALPHA/linux-ha/High-Availability-HOWTO.html[1]
http://directory.google.com/Top/Computers/Software/Operating_Systems/Linux/Hardware_Support/High_Availability/ 13.3.5 Checking the CacheThere are two data items about your server that are cached by other systems (other than the state of active connections). These are the server's numeric IP address and Ethernet MAC address. Because the IP address is cached by name servers around the world it is best to have your SecBack use the same address as your primary server. Either leave it at this IP address when standing by but with the cable disconnected or change it to this address when deployed to take over. If you will be changing its IP address when deployed, carefully test all the services that you will be offering because many server programs are not designed to work after the system's IP address is changed "underneath them." Restarting these servers or rebooting might be needed. The Ethernet MAC address (Ethernet address) is cached only by systems on the LAN segment. If the primary system will be disconnected such that its Ethernet card (NIC) cannot transmit data after the failover, the easiest solution would be to have the SecBack system simply change the MAC address that its card uses. Almost all cards support this capability. The following examples will use the eth0 interface. The following command will set the MAC address: ifconfig eth0 hw ether 00:81:43:07:07:07 If you do not want to use the same MAC address as the primary server, you will need to inform the other systems on the Ethernet segment that matter. You trigger this by issuing the following commands on the SecBack for each client that needs to know your new MAC address. It is the commands causing the SecBack to issue an ARP request to each client that causes the clients to cache your new ARP address, as required by RFC 826. arp -i eth0 -d cli.pentacorp.com ping -c 1 cli.pentacorp.com& The -c 1 does only a single ping, which forces the ARP protocol to be run. The "&" will prevent a hang up if the client system cannot be reached. Windows 3.1 system and those with some DOS IP stacks will not update due to bugs and might need rebooting. If those remote systems have the server's ARP address type set to permanent in their cache, it will need to be deleted explicitly using arp's -d flag with your host name or IP address specified. (You might want to have the /etc/ethers file on each of them updated or they will lose connectivity to the server upon reboot.) Linux systems normally cache ARP addresses for 60 seconds. Reading or writing the following file will get or set this value: /proc/sys/net/ipv4/neigh/eth0/gc_stale_time Whatever auxiliary control method is used should be tested periodically and the hardware certainly should be checked at least weekly to detect whether it has failed. Also see MAC in the index. 13.3.6 Brother, Can You Spare a Disk?If your budget or time does not allow separate SecBack systems, another possibility would be a backup disk for the primary system that remains physically disconnected until deployed but configured to have the same unit number as the primary disk. To deploy, bring the system down as quickly as possible. Because it is presumed corrupt anyway, for many sites issuing a sync command, waiting a second and powering off will do. Then unplug the primary disk, plug in the SecBack disk, and power up. Although slightly slower to recover this way, it still could be accomplished in two minutes plus "sprint time." The backup disk will need to be synchronized to the primary disk periodically, depending on the type of data that the server handles. This is far faster than recovering from backup tapes or CD-RWs.
|
Top |