Why Five-9s Might Be a Bad Idea | Microsoft Windows Server 2003 Insider Solutions

A lot of companies seem obsessed with the concept of 99.999% uptime, or Five-9s . This suggests that the system will be down roughly five minutes per year. This means that you have five minutes per year to install patches, run system maintenance, install upgrades, replace failed hardware, and reboot the system. That's a heck of a trick. The only way you can manage this kind of uptime is to eliminate a few tasks . Many administrators do not bother with any patch installations citing a company expectation that servers are to run 24 hours a day, 7 days a week. However over time, a server that has not been patched, updated, or maintained faces the risk of having a security flaw exploited by a virus or worm, or a lack of hardware or operating system maintenance can cause a system to fail due to system errors.

Although a good portion of downtime can be eliminated by technologies such as clustering or load balancing, ask yourself if it's worth the expense just to avoid a little planned downtime. Maintenance of a clustered server adds several levels of complexity to the equation and you want to avoid unnecessary complexity.

BEST PRACTICE: What Does Clustering Actually Accomplish?

Some companies try to eliminate maintenance windows through technologies like clustering. A cluster consists of two or more servers that are providing a specific service and share a common source of data. By performing maintenance, such as an application patch, on the passive node first, there is no interruption in service. By failing the service over to the upgraded node, the unpatched node becomes the passive node and then the active node can then be patched. This is fine for patch management but it doesn't address data integrity. If you never take the cluster down to perform an integrity check of the shared data, you are still vulnerable to a failure due to corruption of the data. One could argue that by mirroring the data and then "breaking" the mirror, an integrity check could be performed offline. Although this is true, if inconsistencies are found in the data, the cluster would still have to be taken down to fix the inconsistency.

The Importance of Maintenance Windows

The secret to system uptime is to correctly define it as planned uptime. This means that the system can be down for predetermined amounts of time on a specific maintenance schedule. This also means that employees can plan for these events. A four- hour block once a month is pretty standard for a maintenance window. This gives you plenty of time to test patches, plan capacity changes, and let those developers know that their plan to replicate the code safe across the VPN falls right in the middle of your firewall reboots long before you get to the actual event.

It is critical to have managers agree to the concept of a maintenance window. If you cannot make them understand the value of planned downtime they will often first have to feel the sting of unplanned downtime. Make sure managers understand that things like security patches, capacity upgrades, and directory integrity are key factors in the stability of the network and therefore in the productivity of employees.

Maintenance in a High Availability Environment

If you are operating under Service Level Agreements that do not allow for extended maintenance windows, you must be clever in the ways that you perform maintenance. Technologies like clustering and load balancing can avoid some of the downtimes associated with maintenance but at the cost of complexity and resources.

If you support a Web farm, you likely have a large number of servers that are providing the same service. By removing a Web server from a load-balancing group you can bring the server offline without affecting the end users. This assumes that you have an N+1 environment where you can take away a server without affecting the capacity of the load-balanced group. This will allow you to update patches and perform other intrusive system maintenance tasks without service interruption. You would then return the system to the farm and add it back to the load-balanced group. You would then repeat the task for the remaining systems in the group one at a time.

A similar situation can exist in an IT environment for file servers. If the files must be accessible at all times, you would normally not be able to take a file server down for maintenance. By using a technology like Distributed File System, you can abstract the file structure such that the end users are connecting to a virtual server that is comprised of links to real servers and real file shares. By doing this, you could migrate data and shares from one server to another and free up the original server of its file sharing responsibilities. This would allow you to perform the intrusive system maintenance tasks that would require reboots or extended downtime. When those tasks are completed the data could be migrated back to the system and the links redirected back to the original shares on that server. This would allow you to maintain the file servers without interruption to the end users. By using replicas of the shared data on multiple systems you wouldn't even have to perform the migration of data. Users connecting to a DFS share that was no longer available on the primary server would simply connect to a replica on a secondary server.

Database maintenance is where the whole process becomes a lot trickier. Technologies exist to mount the mirror of a database to run maintenance while the other instance of the database stores the logs so that they can be later committed to the offline database to make it current. This process is beyond the scope of this chapter and is covered in Chapter 22 "Creating a Fault-Tolerant Environment." Suffice it to say that the process is much easier and cheaper if you can enforce a maintenance window for any and all databases.