Lesson 3: Ensuring System Availability | MCSE Training Kit (Exam 70-226): Designing Highly Available Web Solutions with Microsoft Windows 2000 Server Technologies (MCSE Training Kits)

Hardware failure, data corruption, and physical site destruction all pose threats to a Web site that must be available nearly 100 percent of the time. You can enhance your site’s availability by first identifying services that must be available and then identifying the points at which those services can fail. Increasing availability also means reducing the probability of failure. Decisions about how far to go to prevent failures are based on a combination of your company’s tolerance for service outages, the budget, and your staff’s expertise. System availability depends on the hardware and software you choose and the effectiveness of your operating procedures. This lesson introduces you to three fundamental strategies that you can use to design a highly available site: developing operational procedures, ensuring adequate capacity, and reducing the probability of failure.

After this lesson, you will be able to

Identify three fundamental strategies that you can use to design a highly available Web site

Estimated lesson time: 20 minutes

Designing a Highly Available Web Site

You can design availability into a Web site by identifying services that must be available, determining where those services can fail, and then designing the services so that they continue to be available to customers even if a failure occurs. You can use three fundamental strategies to design a highly available site:

Develop operational procedures that are well documented and appropriate for your goals and your staff’s capabilities.
Ensure that your site has enough capacity to handle processing loads.
Reduce the probability of failure.

Developing Operational Procedures

One of the most effective means of ensuring site availability can also be inexpensive to implement. Creating well-documented and accurate standardized opera-tional procedures is an effective means of ensuring site availability.

With the capacity of database systems often in the 100 GB range and higher, deploying proper data protection mechanisms is essential. This is becoming even more critical as databases approach the terabyte (TB) and some in the pentabyte (PB) size ranges. In particular, RAID systems can enhance both scalability and performance of disk systems, but such systems can simultaneously enhance data integrity. Because of the decreasing cost of disks compared to the increasing cost of downtime, redundant storage subsystems are even more attractive now than when they were introduced.

Consistent, detailed monitoring procedures are critical to deploying systems for high availability. First, you should restrict logical and physical access to servers. Second, you should monitor the system event log regularly in order to prevent failures and potential failures of systems from going undetected. Implementing an infrastructure that continuously monitors all of your systems and the entire network provides the best means of preventing and detecting system failures. Devices on the Hardware Compatibility List (HCL) are required to use the event log to record problems. Many systems designed for maximum reliability are able to continue operation with a single failure, such as a failed disk in a RAID-5 volume. A subsequent failure will cause an outage and even loss of data. You should set up automated procedures for alarm notification, such as pager notification of SNMP alarms.

Another way to avoid problems is to keep up with and understand the risks and benefits of system upgrades and service packs. Most large organizations establish their own testing organizations to qualify service packs and define baselines.

Operational procedures should include the following types of management:

Change management
Service-level management
Problem management
Capacity management
Security management
Availability management

Microsoft has created a knowledge base called the Enterprise Services frameworks (Microsoft Readiness Framework, Microsoft Solutions Framework, and Microsoft Operations Framework) to describe industry experience and best practices for such procedures. You can find more information online at http://www.microsoft.com/trainingandservices/default.asp?PageID=enterprise&PageCall=frameworks.

When you have a stable set of operational procedures, you can begin to explore ways to improve hardware and software availability. System availability doesn’t depend only on how redundant your hardware and software systems are.

Ensuring Site Capacity

Site services can become unavailable if site traffic exceeds capacity and can become less reliable after operating for prolonged periods at peak load. You should scale your server farm to accommodate increased site traffic and to maintain site performance in a cost-effective manner. Capacity requirements are discussed in more detail in Chapter 7, "Capacity Planning."

Reducing the Probability of Failure

To design a highly available site, you should know what techniques you can use to help reduce failures. This section describes these techniques.

Application Failures

Use the following techniques to reduce possible application failures:

Create a robust architecture based on redundant, load-balanced servers. (Note, however, that load-balanced clusters are different from Windows application clusters. Commerce Server 2000 components, such as List Manager and Direct Mailer, are not cluster aware.)
Review code to avoid potential buffer overflows, infinite loops, code crashes, and openings for security attacks.

Climate Control Failures

Use the following techniques to reduce possible climate control failures:

Maintain the temperature of your hardware within the manufacturer’s specifications. Excessive heat can cause CPU meltdown, and excessive cold can cause failure of moving parts, such as fans and disk drives.
Maintain humidity control. Excessive humidity can cause electrical short circuits that result from water condensing on circuit boards. Excessive dryness can cause static electricity discharges that damage components when you handle them.

Data Failures

Use the following techniques to reduce possible data failures:

Conduct regular backups. In addition to regular backups, archive backups offsite. For example, to save space you can archive every fourth regular backup offsite. If your data becomes corrupted, you can restore the data from backups to the last point before the corruption occurred. If you also back up transaction logs, you can then apply the transaction logs to the restored database to bring it up to date.
Replay transaction logs against a known valid database to maintain data. This technique, which is also known as log shipping to a warm backup server, is useful for maintaining a disaster-recovery site (a "hot site").
Microsoft SQL Server 2000 is the only version of SQL Server that supports log shipping.
Deploy failover clusters for your back-end database servers. Two examples of failover clustering technologies are the Cluster service and Veritas Cluster Server, which also can be configured to provide load-balancing capabilities depending on the hardware platform you’re using to host the cluster. Commerce Server uses data stores such as SQL Server and the Active Directory service. SQL Server provides access to data and services such as catalog search. SQL Server can use the Cluster service to provide redundancy. Active Directory provides access to profile data and can provide authentication services. Active Directory uses data replication to provide redundancy. In general, clustering is more effective for dynamic (read/write) data and data replication is more effective for static (read-only) data.
Minimize the probability and impact of a SQL Server failure by clustering SQL Server servers or by replicating data among SQL Server servers. If you’re using Microsoft SQL Server 7, the full-text search feature is available only in a nonclustered configuration, so you must use a replication strategy for the product catalog. However, note that SQL Server 6.5 and 7 don’t support high-availability configurations. Microsoft Data Access Components (MDAC) 2.6 is not supported for SQL Server 6.5 or SQL Server 7, when either release is in a failover cluster configuration. SQL Server 2000 is fully supported for high-availability configurations.
Back up Active Directory stores (if you use Active Directory). You can back up the stores while Active Directory is online.
Use at least two Active Directory domain controllers in each physical site, with a replication schedule appropriate to your requirements. Restoring a domain controller can be time-consuming and requires the domain controller to be offline. If you have peer domain controllers, you can minimize downtime if you must restore your site from backups.

Electrical Power Failures

Use the following techniques to reduce possible electrical power failures:

Use an uninterruptible power supply (UPS) for all power connections. Because a UPS is typically battery-powered, it’s useful only for outages that last for short periods. Be sure to use a UPS that has the same power rating as your equipment.
Use power generators as secondary backups to the UPS. You can use generators for an indefinite period of time because they’re fuel-powered (diesel or gasoline), and you can refuel them if necessary.

Network Failures

Use the following techniques to reduce possible network failures:

Use multiple network interface cards (NICs), multiple routers, switches, local area networks (LANs), or firewalls.
Contract with multiple ISPs or set up identical equipment in geographically dispersed locations.

Security Failures

Use the following techniques to reduce possible security failures:

Contract an independent security audit firm to evaluate your environment.
Deploy intrusion-detection tools.
Deploy multiple firewalls.
For the latest strategies and techniques for handling security issues, see http://www.microsoft.com/windows2000/technologies/security/default.asp.

Server Failures

Use the following techniques to reduce possible server failures:

Deploy redundant, load-balanced servers. Single-IP solutions increase site capacity by distributing HTTP requests proportionally according to each server’s capacity for handling the required load.
Verify that users are referred only to operating servers when using a single IP address solution.

Hardware Failures

Use the following techniques to reduce possible hardware failures:

Use hardware-implemented RAID-1, RAID-5, or RAID-10, dual disk controllers, and mirrored cache on the RAID controllers with integrated battery backup to minimize disk failures. Most hardware vendors now integrate an onboard RAID controller chip into the system board to provide some RAID support. Check with the hardware vendor to determine if your servers have this built-in feature and exactly what RAID capabilities are supported. Several excellent third-party solutions are available to reduce downtime related to disk failure.
If you’re implementing a Fibre Channel SAN, use redundant Fibre Channel host bus adapters, Fibre Channel hubs or fabric switches, and redundant disk array controllers for the Fibre Channel storage array. In the event of a failure of an adapter, hub, fiber cable, controller, switch, or some other component on the primary loop, the system will automatically switch over to the redundant or backup loop, providing an alternate path to the external storage array or SAN. When dealing with a Fibre Channel SAN, it is crucial to minimize or eliminate all single points of failure in the SAN. This will ensure that the SAN will have the best possible performance, availability, and overall integrity.
Deploy other redundant hardware components.

Lesson Summary

You can use three fundamental strategies to design a highly available site: developing operational procedures that are well documented and appropriate for your goals and your staff’s capabilities, ensuring that your site has enough capacity to handle processing loads, and reducing the probability of failure. Operational procedures should include change management, service-level management, problem management, capacity management, security management, and availability management. To design a highly available site, you should use techniques that can help reduce failures related to applications, climate control, data, electrical power, network, security, servers, and hardware.