Hardware failure, data corruption, and physical site destruction all pose threats to a Web site that must be available nearly 100 percent of the time. You can enhance your site’s availability by first identifying services that must be available and then identifying the points at which those services can fail. Increasing availability also means reducing the probability of failure. Decisions about how far to go to prevent failures are based on a combination of your company’s tolerance for service outages, the budget, and your staff’s expertise. System availability depends on the hardware and software you choose and the effectiveness of your operating procedures. This lesson introduces you to three fundamental strategies that you can use to design a highly available site: developing operational procedures, ensuring adequate capacity, and reducing the probability of failure.
You can design availability into a Web site by identifying services that must be available, determining where those services can fail, and then designing the services so that they continue to be available to customers even if a failure occurs. You can use three fundamental strategies to design a highly available site:
One of the most effective means of ensuring site availability can also be inexpensive to implement. Creating well-documented and accurate standardized opera-tional procedures is an effective means of ensuring site availability.
With the capacity of database systems often in the 100 GB range and higher, deploying proper data protection mechanisms is essential. This is becoming even more critical as databases approach the terabyte (TB) and some in the pentabyte (PB) size ranges. In particular, RAID systems can enhance both scalability and performance of disk systems, but such systems can simultaneously enhance data integrity. Because of the decreasing cost of disks compared to the increasing cost of downtime, redundant storage subsystems are even more attractive now than when they were introduced.
Consistent, detailed monitoring procedures are critical to deploying systems for high availability. First, you should restrict logical and physical access to servers. Second, you should monitor the system event log regularly in order to prevent failures and potential failures of systems from going undetected. Implementing an infrastructure that continuously monitors all of your systems and the entire network provides the best means of preventing and detecting system failures. Devices on the Hardware Compatibility List (HCL) are required to use the event log to record problems. Many systems designed for maximum reliability are able to continue operation with a single failure, such as a failed disk in a RAID-5 volume. A subsequent failure will cause an outage and even loss of data. You should set up automated procedures for alarm notification, such as pager notification of SNMP alarms.
Another way to avoid problems is to keep up with and understand the risks and benefits of system upgrades and service packs. Most large organizations establish their own testing organizations to qualify service packs and define baselines.
Operational procedures should include the following types of management:
Microsoft has created a knowledge base called the Enterprise Services frameworks (Microsoft Readiness Framework, Microsoft Solutions Framework, and Microsoft Operations Framework) to describe industry experience and best practices for such procedures. You can find more information online at http://www.microsoft.com/trainingandservices/default.asp?PageID=enterprise&PageCall=frameworks.
When you have a stable set of operational procedures, you can begin to explore ways to improve hardware and software availability. System availability doesn’t depend only on how redundant your hardware and software systems are.
Site services can become unavailable if site traffic exceeds capacity and can become less reliable after operating for prolonged periods at peak load. You should scale your server farm to accommodate increased site traffic and to maintain site performance in a cost-effective manner. Capacity requirements are discussed in more detail in Chapter 7, "Capacity Planning."
To design a highly available site, you should know what techniques you can use to help reduce failures. This section describes these techniques.
Use the following techniques to reduce possible application failures:
Use the following techniques to reduce possible climate control failures:
Use the following techniques to reduce possible data failures:
Microsoft SQL Server 2000 is the only version of SQL Server that supports log shipping.
Use the following techniques to reduce possible electrical power failures:
Use the following techniques to reduce possible network failures:
Use the following techniques to reduce possible security failures:
Use the following techniques to reduce possible server failures:
Use the following techniques to reduce possible hardware failures:
You can use three fundamental strategies to design a highly available site: developing operational procedures that are well documented and appropriate for your goals and your staff’s capabilities, ensuring that your site has enough capacity to handle processing loads, and reducing the probability of failure. Operational procedures should include change management, service-level management, problem management, capacity management, security management, and availability management. To design a highly available site, you should use techniques that can help reduce failures related to applications, climate control, data, electrical power, network, security, servers, and hardware.