23.1 An Introduction to High Availability


Let's get started with high-availability systems. The first question is what high availability means.

23.1.1 What Is High Availability?

In business-critical applications, the systems a company relies on need to be up and running 24 hours a day, 7 days a week. Every minute you lose costs money. Imagine a call center where dozens of people are working around the clock. Every minute people cannot work with the system costs money and annoys your customers, which costs money as well.

High-availability systems reduce downtime. Usually the availability of a system is measured in percent. A few years ago downtime was not as critical as it is today because a lot of recovery was done during the night while most people were sleeping. This way, a company's IT people had enough time to solve various problems during the night. With the arrival of modern Web technologies, things have changed. There is no "silent night" any more because Web applications must be available any time. Companies are global players, and European customers wouldn't be too happy if the site of an American vendor was down in the morning because at that time it is night in the U.S. In other words, having reliable systems is an important point in modern IT business. We still haven't answered the key question of this section what is a high-availability system? There are several levels of availability, as shown in Table 23.1.

Table 23.1. Availability and Downtime
Type of System Uptime Downtime per Year
General-purpose system 99% 87 hours 36 minutes
  99.5% 43 hours 48 minutes
Most high-availability systems 99.9% 8 hours 30 minutes
"Best" high-availability today 99.95% 4 hours 23 minutes
  99.99% 53 minutes
Continuous availability system 99.999% 5 minutes

Ninety-nine percent sounds like a lot, but it isn't. Ninety-nine percent can be compared to rebooting your system after every three hours of work, and this is most likely not what you are looking for. It is important to see what 99% means, especially when talking about downtime. It means about 87 hours per year, or about 11 working days of 8 hours each. Eleven days means about two weeks. In other words: If you manage to increase the availability of your system, you can save two weeks of work. You can gain two weeks for every person working with the system. With 25 people, you can save up to 1 person simply by increasing the stability of your IT system. These numbers show you how much time you can save by using reliable software and hardware.

Today only a few systems can achieve extremely high availability (> 99.99%). Usually these systems are expensive but reliable and powerful IBM or SGI mainframe machines used by huge companies to store extremely business-critical data. In many cases these systems run proprietary software in proprietary hardware, which makes it more unlikely to get wounded by a "standard" virus. In addition, high-availability systems are protected by various security facilities. This is an important point.

23.1.2 Hardware Issues

To achieve a higher level of availability, it is important to use reliable and redundant hardware. Redundant means that every device is available at least twice. If one component fails, a second component will start doing the work of the component that has failed. This way downtime can be minimized.

In the case of storage, building redundant systems often means using Redundant Array of Independent Disks (RAID) systems. RAID systems are sophisticated tools. They combine various hard disks into an array that is redundant. If not more than a fixed number of hard disks fail, the system will work just as if no error has occurred. Hard disks especially can fail from time to time because they consist of mechanical as well as electronic components. RAID systems help to reduce the risks of errors occurring in the storage system.

One point that is often neglected is redundant electricity supply. This means that if the power supply goes down, your systems will stay up and running. From time to time (let's say once a year if you do not live in California), electricity is turned off for a few minutes for maintenance purposes. During this time it is necessary to have an external electricity supply.

Another important issue is to have a backup server. If one machine fails, you can switch to a spare machine that does the job until the primary machine has been repaired.

23.1.3 Network Issues

In addition to hardware issues, a reasonable network is essential as well. It is no use to connect an expensive IBM zSeries machine to a dial-up connection to run a Web server. If you are investing a lot of money in good reliable hardware, it is also necessary to have an eye on the quality of your connections to the Internet. For real high-availability solutions, we recommend working with two independent lines. If one fails, it is still possible to use the second line instead.

An important thing is to have a firewall that protects your production system. Keep in mind that your firewalling system should be redundant as well because otherwise your system might not be available because of a broken firewall. In this case the firewall would be a potential threat to your system.



PHP and PostgreSQL. Advanced Web Programming2002
PHP and PostgreSQL. Advanced Web Programming2002
ISBN: N/A
EAN: N/A
Year: 2004
Pages: 201

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net