Availability Management | Microsoft SQL Server 2005: The Complete Reference: Full Coverage of all New and Improved Features

Understanding availability management is an essential requirement for IS/IT in almost all companies today. Availability management practice involves the following efforts:

Problem detection This need requires IT to be constantly monitoring systems for advance warnings of system failure. You use whatever tools you can obtain to monitor systems and focus on all the possible points of failure. For example: You will need to monitor storage, networks, memory, processors, and power.
Performance monitoring The role supports service level by assuring that systems are able to service the business and keep systems operating at threshold points considered safely below bottleneck and failure levels.
Scaling up or out to meet demand This level keeps the business running by guaranteeing availability of critical systems in acceptable response times; that users are serviced quickly.
Redundancy Redundancy is the service-level practice of providing secondary offline and online (warm) replicas or mirrors of primary systems that can take over from a primary system (fail-over) with minimal (not longer than a few minutes) disruption of service to database users.
Administration The administration role ensures 24×7 operations and administrative housekeeping. The administrative role manages SL budget, hires and fires, maintains and reports on service-level achievement, and reports to management or the CEO.

Problem Detection

The problem detector (or detective) is the role that is usually carried out by the DBA-analyst who knows what to look for on a high-end, busy, or mission-critical DBMS environment. You need to know everything there is about the technology, the SQL Server platform, and the SQL Server availability and performance capabilities. For example, You need to know which databases are doing what, where they are placed, use of storage services, level of maintenance, and so on. You also need to be able to collect data from SQL Server, interpret it, and forecast needs. You need to know exactly when you need to expand storage needs. There is no reason in any type of situation to have users unable to use the database because the hard disks are full, or because memory is at the max, and so on.

I know that problem detection is a lot like tornado chasing (and a data center on a bad day with poor DR models in place can make the film Twister look like a Winnie-the-Pooh movie). But you need to get into the mode of chasing tornadoes before they happen because you can spend all of your time listening to the earth and looking up at the sky, and then whack! It comes when you least expect it and where you least expect it. Suddenly your 100-ton IBM NetFinity is flying out the window and there’s no way to get it back. Might as well take a rope and a chair and lock your office door from the inside. According to research from the likes of Forrester, close to 40 percent of the IT management resources are spent on problem detection. With large SQL Server sites, some DBAs should be spending 110 percent of their time on problem detection and working with the tools and techniques described in the last chapter.

Note

There are several other SL-related areas that IT spends time on and that impact availability. These include change management and change control, software distribution, and systems management. Change management, or change control, is catered to by Active Directory and is beyond the scope of this book.

The money spent on SQL Server DR gurus is money well spent. Often, a guru will restore a database that, had it stayed offline a few hours longer, would have taken the company off the Internet. So it goes without saying that you will save a lot of money and effort if you can obtain DBAs that are also qualified to monitor for performance and problems, and not just excel at recovery Key to meeting the objective of ensuring SL and high availability is the acquisition of SL tools and technology. This is where Windows Server 2003 Server excels above many other platforms. While clustering and load balancing are included in Advanced Server and Datacenter Server, the performance and system monitoring tools and disaster recovery tools are available to all versions of the OS no matter where you install the DBMS.

Performance Management

Performance management aims to identify poor performance in areas such as networking, access times, transfer rates, and restore or recovery performance; it will point to problems that can be fixed before they turn into disasters. You need to be extremely diligent, and the extent of the management needs to be end-to-end because often a failure is caused by failures in another part of the system that you did notice. For example, if you get a massive flurry of transactions to a hard disk that does not let up until the hard disk crashes, is the hard disk at fault, or should you have been making plans to balance the load, or expand the disk array?

Scale-Up and Scale-Out Availability

In this area, you want to make sure that users do not find that they cannot connect to the database or that a Web page refresh takes forever because the database was choking, or that the server approaches meltdown because it is running out of memory, processor bandwidth, hard-disk space, or the like.

Availability, for the most part, is a post-operative factor. In other words, availability management covers redundancy, mirrored or duplexed systems, fail-overs, and so forth. Note that fail-over is emphasized because the term itself denotes taking over from a system that has failed.

Clustering of systems or load balancing, on the other hand, is also as much a disaster prevention measure as it is a performance-level maintenance practice. Using performance management, you would take systems to a performance point that is nearing threshold or maximum level; depending on your needs and guts, you may decide the failsafe level is 70 percent of system resources for each facility (I start topping when the gas meter shows a quarter tank).

At meltdown, your equipment should switch additional requests for service to other resources. A fail-over database server picks up the users and processes that were on a system that has just failed, and it is supposed to allow the workload to continue uninterrupted on the fail-over systems, although the users will have to reconnect.

But a fail-over is not only a server event. Other examples of fail-overs are mirrored disks, RAID-5 storage sets, and redundant controllers.