The High-Availability Plan: Seven Must-Haves for Building High-Availability Solutions | Advanced Macromedia ColdFusion MX 7 Application Development

You have seen all the monitoring reports, and you have responded to the ColdFusion alarms. You now have the information you need to start building a plan. Start by looking at the failure points.

Once you have a good idea of how much traffic your servers can take, it's time to start building a plan to solidify the availability of your site and achieve that 99.99 percentile. The following action items are the most important considerations to ensure that your site will be up, available, and free of single points of failure that can dead-end site traffic:

Implement a load-balanced Web server cluster to make server downtime invisible.
Choose a network host that offers circuit redundancy.
Install a correctly configured firewall to protect against unwanted visitors.
Use RAID Level 5 on database servers.
Implement a backup and recovery strategy and process.
Calculate a level of risk that is both business-smart and cost-effective.
Choose fault tolerance systems to reduce failure points.

The following seven sections describe each of these items in detail.

Implement a Load-Balanced Web-Server Cluster

The easiest and most effective way to make server downtime invisible and increase the availability of any site is to provide load balancing and failover for a Web server cluster. Use of load balancing devices allows the system to distribute traffic load evenly among all systems in your cluster, ensuring that no single server becomes unavailable due to intense load. Failover specifically applies when a server in your cluster becomes unresponsive due to a disaster such as software or hardware failure. Having a failover system allows your cluster to switch to backup hardware, seamlessly shifting trafficfor example, from the main database server to a backup database server.

Load balancing and failover accomplish two goals:

Maximize server efficiency by balancing Web traffic between servers
Redirect traffic from nonresponsive Web servers, allowing server failures to go unnoticed by the end user (this is the failover)

Load balancing technology comes in three flavors:

Software-based
Hardware-based
Combination software and hardware

Software-Based Load Balancing

Macromedia's ColdFusion MX Enterprise server includes ClusterCATs (described in Chapter 3). Software-based load balancers communicate on the network level and maintain a heartbeat with other servers in the cluster to identify server health. If a server in the cluster fails to respond to the heartbeat, the server fails overthat is, traffic is redirected away from the affected server. You can set up server probes, similar to the system probes in ColdFusion Administrator, in ClusterCATs to match content and determine whether a server is responding properly.

A number of open-source and free open-source software load balancing solutions are available, especially for Linux (http://lcic.org/load_balancing.html). However, software-based load balancing is usually only good for smaller systems, because at some point the software used to load-balance a cluster may begin to affect the cluster's performance. The occurs because each machine has to spend some of its available resources running the clustering software, as well as sending and receiving information over the wire to determine which machines are running and busy, so that the software can decide where to route traffic. Hardware-based solutions are usually faster, much more reliable, and offer a number of features not included in software solutions.

NOTE

Server heartbeat is defined as continual communication of a server's status to all other servers within the cluster and/or the load balancing software or device.

Hardware-Based Load Balancing

Cisco's LocalDirector and F5's BigIP series use a server-based architecture to load-balance in front of the Web server cluster. Each server-based load balancer works differently. Hardware-based load balancers are more efficient (and more costly) than software-based ones because they actively monitor each connection to each server in the cluster (rather than relying on the servers to manage their own connections and balance the load). The hardware load balancer contains the virtual address of the site (usually the www.domain.com name) and redirects traffic to each of the servers in the cluster according to a predefined algorithm (such as round robin or least connections). When the load balancer determines that a server is nonresponsive or is displaying bad content, the load balancer removes that server from the cluster.

Hardware load balancers are a better choice for high-traffic sites because they offload the cluster-management overhead onto a dedicated machine. In addition, they are more flexible when it comes to things like managing persistent (sticky) sessions and filtering traffic. It is generally best practice with any load balancing system (hardware or software) to make sure there is some redundancy. By configuring two hardware load balancers in tandem, you can set one to fail over in case the other goes down, thus eliminating the single point of failure inherent in placing a single server in front of your Web cluster. Figure 1.1 demonstrates how a hardware load balancer handles site traffic.

Figure 1.1. A typical hardware load balancing configuration.

Combination Software and Hardware Load Balancing

Using Macromedia ClusterCATs in tandem with a hardware load balancer, you can combine the monitoring and reporting capabilities of ClusterCATs with the cluster-management features of a hardware load balancer. ClusterCATs can also supply redundancy if the hardware load balancer fails.

Choose a Network Provider with Circuit Redundancy

When most users type a Web address into their browser, they do not realize that data can go through 10 to 15 stops en route to the destination Web server. These stops (called hops) can be local routers, switches, or large peering points where multiple network circuits meet. The Internet really is similar to a superhighway, and like any congested highway, it's prone to traffic jams (called latency). As far as your users are concerned, your site is down if there are any problems along the route to your site, even if your ColdFusion servers are still alive and ready to deliver content. Imagine that you are driving along the freeway on a Monday morning and it becomes congested. Knowing an alternate route will allow you to move around the congestion and resume your prior course. Hosting your Web applications on a redundant network allows them to skirt traffic problems in a similar fashion.

Always choose a hosting provider that can implement redundant network circuits (preferably two major Tier 1 upstream providers such as WorldCom, Sprint, or AT&T). Many hosting providers have multiple circuits from multiple providers configured with Border Gateway Protocol (BGP). A BGP configuration enables edge routers linked to the Internet to maintain connectivity in the event one of the upstream providers fails. Without some form of network redundancy, you're at the mercy of a single network provider when it comes to fixing the problem.

For sites with truly massive traffic and to guarantee best performance, many organizations (such as eBay) opt for geographic redundancy. This involves creating clusters of duplicate systems that service users within designated regions, to guarantee availability as well as the fastest possible network performance. These configurations are complex and expensive to set up and run, but companies such as Cisco are now making products that midsized businesses can afford for establishing geographically distributed systems. When you need the best performance and availability, you may want to consider geographic redundancy and load balancing which is sometimes also called global load balancing (for a excellent discussion on global load balancing refer to http://www.foundrynet.com/services/documentation/sichassis/gslb.html).

NOTE

If you are hosting your Web application in-house, make sure you have a backup circuit to a network provider, in case the primary circuit becomes overutilized or unavailable. Also, make sure you've got a tested action plan in place to reroute traffic if necessary.

Install a Firewall

Every day, Internet hackers attack both popular and unpopular Web sites. In fact, most hackers don't target a particular site intentionally, but rather look for any vulnerable site they can use as a launching point for malicious activity. Web servers deliver information on specific ports (for example, HTTP traffic is delivered on port 80 and SSL on 443), and generally listen for connections on those ports (although you can run Web traffic on a different port if you wish). Hackers examine sites on the Internet using any number of freely available port-scanning utilities. These utilities do exactly what their name suggests: They scan points on the Internet for open ports that hackers can exploit. The best practice is to implement a front-end firewall solution, and then, if possible, place another firewall between the front-end Web servers and the database servers.

Firewalls accomplish two tasks:

Mitigate downtime risk by examining all incoming packets, allowing only necessary traffic to reach front-end Web servers.
Protect database and integration servers against unauthorized Internet access by allowing only communication directly from front-end Web servers.

NOTE

Broadband Report.com (www.dslreports.com/scan) has a free port-scanning utility that runs from the Web, letting you know which open ports are running on your server. Although the site is geared toward DSL and cable users, anyone can use the port scan.

You can build an efficient and inexpensive firewall solution using Linux's ipchains package. Red Hat 7.3, for example, uses GNOME Lokkit for constructing basic ipchains networking rules. To configure specific firewall rules, however, use iptables in Red Hat (see www.redhat.com). For better security, the most commonly implemented front-end firewall solutions include Cisco's PIX Firewall (www.cisco.com), Netscreen's Firewall (www.netscreen.com), and Checkpoint's Firewall-1 (www.checkpoint.com). You must ensure that your firewall is secure as well. This means you should not run any other services on the firewall except those that are absolutely necessary.

Most vendors, including Cisco, sell load balancing switches with built-in firewalls. The best thing to do is create a list of desired capabilities and establish a budget; then contact several vendors for quotes on affordable solutions that will meet your needs and restrictions. Be aware, too, that many modern firewall tools offer features other than port blocking. Many provide intrusion detection, intrusion alerts, blocking Denial of Service attacks, and much more.

NOTE

If you really cannot implement a front-end firewall solution, when installing Windows 2000 Server you should be cautious about the configuration options you choose. By default, Windows 2000 can install lots of goodies, such as an FTP server, a terminal server, and so on, but each of these services opens an additional port on your Web server. Do not install services you won't use, and survey those you do use to make sure they're necessary.

Use RAID Level 5 on Database Servers

Although you can build a database cluster in addition to your Web server cluster, database clusters are more complex to manage and might be impractical, depending on the size of your Web application. Always ensure that you set up single database servers in a RAID Level 5 configuration. RAID (Redundant Array of Inexpensive Disks) stripes data across a number of disks rather than one, while reserving a separate disk to maintain CRC error checking.

TIP

Always give your transaction logs the best-performing volumes in the disk array. In any busy online transaction processing (OLTP) system, the transaction logs endure the most input/output (IO).

Disks in a RAID array are SCSI hot-swappable. If one disk in an array fails, you can substitute another in its place without affecting the server's availability. Additionally, it is a good idea to replicate your database at regular intervals to another database server.

Calculate Acceptable Risk

There is always a trade-off between cost and fault tolerance. Some organizations utilize two or three Web servers configured in a cluster with a single, "strong" nonclustered database server. The database server has redundant CPUs, power supplies, disk drives, disk and RAID controllers, and network connections. This offers a good degree of availability without the additional cost of a second database server and clustering technology. Implementing a network-based tape backup strategy is another effective, cost-saving alternative and should be part of any disaster recovery plan.

Although these are reasonable risk-management approaches for some, they will be insufficient for those who need 99.999 or even 100 percent uptime. For organizations needing absolute availability, the costs and complexity of creating and managing such systems rapidly increase. If you can afford to lose a few hours or days worth of data, a simple web cluster without a database cluster is more than reasonable.

Only your budget limits the amount of redundancy you can incorporate into your system architecture. In other words, analyze your needs and plan accordingly. Any hardware can fail for virtually any reason. It is always best when arranging high availability to imagine the worst disaster and then plan based on that.

Redundant Server Components vs. Commodity Computing

It is recommended that you implement a fault-tolerant configuration with redundancy at every level, in order to achieve better than 99.9 percent uptime for a Web application. Most server manufacturers offer dual or triple power supplies, cooling fans, and so on in their server configurations. Choose redundant power supplies to keep servers operating in case of power supply failures. In addition, ensure that you have an uninterruptible power supply (UPS) that will power the server for a limited time in case of total power failure. Most major co-location facilities will also have their own backup generators in case of major power outagesanother important consideration. In many server lines, the very low-end servers do not offer the capability to add any of these options.

Another popular approach (at Google, for instance) is to have lots of very cheap redundant servers instead of lots of redundant components. Often this arrangement is far less expensive and easier to manageespecially with recent super-low-cost blade computersthan maintaining high-end, massively redundant servers. This approach is gaining in popularity and is a major part of the emerging "grid" computing paradigm being pushed by IBM, Oracle, HP, Dell, Microsoft, and other major vendors.

Figure 1.2 shows a standard, highly available application design, including clustered Web servers, clustered database servers, Network Array Storage NAS, redundant switches and routers, and redundant firewalls.

Figure 1.2. Basic high-availability site design.

Disaster Planning

Disaster planning and recovery processes are critical when designing and developing a high-availability system, but for some reason these needs are rarely adequately addressed. Unless your data, code, application, and hardware are not important to you, the first thing and last thing to consider is what to do when everything goes wrong. Making your system redundant and having offsite backup to prevent loss of data is not enough. Recovering from a disaster may involve rebuilding servers, applying specific patches, making tuning and configuration changes, preventing sensitive data from being exposed, as well as validating and "scrubbing" data.

Recovering from a disaster, especially one of large magnitude, can be a daunting affair if you have not clearly and systematically addressed the recovery process. Here are some excellent resources for coming to grips with disaster recovery and planning:

Disaster Recovery Journal (http://www.drj.com/)
Disaster Resources (http://www.disaster-resource.com/)
Simply googling the Web will reveal a wealth of tutorials, papers, and actual plans from various organizations that you can reuse to suit your specific needs.