13.2. Constant Availability

The most probable causes of data loss are hacker attacks or equipment failures. In the first case, data can be restored by simply replacing the destroyed files with the backup copies; in the latter case, the equipment may have to be replaced and the system may need to be installed from the ground up. So that the restoration process will not take too much time, it is best to have a spare set of parts most likely to fail: hard drive, memory, motherboard, and processor.

If it is unacceptable for your network to allow even a minute of a server downtime, you can either build a cluster of servers or maintain backup servers.

Building a server cluster may be a more reliable choice. In this case, if one of the cluster servers fails, its workload is picked up by another server in the cluster. This allows almost 100% failproof system operation to be achieved. But building server clusters is a rather complex and expensive task; therefore, companies try other, less expensive ways to make their data secure.

Most industrial software already offers cluster operation tools, which are easy and inexpensive to use. One of the network servers is assigned the role of master, with one or more other servers being slaves. The master server regularly sends information to the network about its operability status; it also sends information about database changes to the slave servers so that all servers have an identical copy of the database. If the master server fails, the slave servers take over the operation.

In addition to enhanced reliability, clusters may enhance productivity if all servers work in parallel and the slave servers handle part of the workload. This makes for more efficient equipment and network bandwidth use.

A less expensive way is to use reserve servers equipped with a Redundant Array of Independent Disks (RAID). In this case, the hard drives of a server are organized into mirroring RAID, that is, RAID 1 or RAID 1+0. Here, data are protected by the RAID system, which saves data to two hard drives in parallel. If one of these drives fails, the second hard drive is placed into operation.

But what if the motherboard or processor fails? Replacing these takes time, which in this scenario was declared unacceptable. To minimize the downtime in such a situation, a backup server of the same hardware configuration as the main one is maintained . When some hardware of the main server fails, simply connect RAID to the backup server and switch the network cable to continue operating. Because the hardware of the reserve server is the same as that of the main server, RAID will work on the backup server without forcing the administrator to edit the configuration files.

If there are several identically-configured servers in your network, one backup server can be used for any of them. Assuring data safety in this way is much less expensive than building a server cluster.

I saw an interesting solution in this respect at one company. All client computers were equipped with a small hard drive holding only the operating system and the necessary utilities and application software. In addition, each of these computers was equipped with a large hard drive installed in a mobile rack, allowing the drive to be easily replaced. Every evening, the administrator removed the large hard drives and backed them up at his computer. In case of a hardware or software failure, the large hard drive would be connected to another computer prepared especially for this purpose.