Monitoring Basics | Microsoft SQL Server 2000 High Availability

When you begin to look at how you will monitor your enterprise, if you spend most of your time performing database administrator (DBA) work, you will immediately think of monitoring the primary SQL Server and databases associated with the specific application. However, monitoring for availability is more complicated than that. An application or solution is made of multiple layers , including the network, possibly the firewall, application servers, associated infrastructure servers (such as Microsoft Exchange or domain controllers), and the code components themselves (for example, middle- tier Microsoft .NET Web Services). To claim that the whole system is available, you must take into account, monitor, and report on every moving part.

Most monitoring documents focus on one specific aspect, such as Microsoft SQL Server. There is simply not enough space to cover all aspects of monitoring an entire application chain in depth in only one chapter. Because this book focuses on SQL Server, the discussion is mostly limited to SQL Server “specific monitoring. For a SQL Server database chained to an application that is expected to be highly available, the following items are mandatory for monitoring:

The SQL Server service, and connectivity to that instance
The primary databases on the primary instance
Crucial working parts of the database system (network cards, disk adapters, memory, processor)
The operating system on the primary instance
The operating system on your standby server, whether hot, warm, or cold
Key functionality within each database

You should make every effort to incorporate similar monitoring in your test and development platforms. This will serve you well if you decide to introduce a change to your monitoring system, because you can see how it behaves in the relative safety of a test environment first. Your highly available application is relying on a highly available monitoring system; therefore you should have a standby system for your monitoring database repository. If the monitoring system goes down, how will you know if your primary system is running or not?

Monitoring generally has two forms: systems that are monitored and managed in near real time, and systems that are monitored only when there appears to be a problem. For highly available systems, you will need to ensure that you are indeed capable of monitoring your systems in near real time so that your response to issues minimizes outages. In either case, it is imperative that you monitor the proper details so that you can base your response to changes on data you have at hand. It is easy to go over the top with monitoring.

You could collect information on everything, but what would you do with it all? With just the standard counters, monitoring everything at a 15-second interval would yield 250 MB of data per day per server. In SQL Server Profiler, on a system running 1000 transactions per second, you would generate about 86 million entries per day. That is a lot of data, and unless you employ a small army of people to analyze your data, odds are you will never go through it on a regular basis to identify trends. This is not to say that there is actually anything wrong with collecting all that information if you have the bandwidth, storage space, and ability to put it to work. However, gigabytes or terabytes of collected data will not necessarily result in more available systems, and in some cases, depending on the system load that monitoring adds, you might skew the numbers generated.

This chapter presents the approach of minimal monitoring, which means monitoring in a lightweight manner with a specific goal. It is a bit new in relation to other documentation on monitoring, but it is what is used internally at Microsoft. The approach is deceptively simple:

Look for known conditions.
Look for common issues.
Look for early warning signs.
Collect only what you can use.

Monitoring everything over time is pointless and potentially destructive to your service level agreements (SLAs). Monitor only what you are using. Even if you choose to monitor at a granular level, there is no real need to store all of that data over a long period of time. Instead, aggregate the data, or save smaller, more specific subsets of it. Trends that can be utilized to your benefit emerge over time. You can monitor more aggressively when the occasion calls for it, but long- term monitoring must be more lightweight.

Do not just look at the data generated by monitoring. Look for meaning in the data. Have a question or problem in mind, decide which factors will produce an answer, and look for the answer where you expect to find it. When you detect a problem, and you then begin monitoring for troubleshooting purposes, you will be collecting significantly more data and looking through it for patterns you expected. If that fails, you might find yourself collecting additional data in search of a pattern.

Counters, events (measurements) for system health, and day-to-day alerts and actions might provide you with data that you want to store for capacity planning over the long term. In some cases monitoring might provide you with data that you could use to answer a business or application behavior question. For example, how many failed searches were issued in 1998 and were they all similar?

With SQL Server in particular, there is no shortage of data to analyze. If you do not start with what you want to know, you can end up looking through everything available to see what piques your interest. This random approach can consume many hours of interesting activity and discussion. Ultimately, however, it fails to improve availability on a consistent basis.

Table 15-1 gives you some ideas of the types of questions you might try to answer through monitoring. Some of these relate to availability and some do not.

Table 15-1: What Do You Want to Know Now?
System Health Is the service up? Is response time slow? Are there any errors? Is it being properly monitored?	Planning What is our availability per our SLA? Is the system performing properly? When will I need a higher capacity design or new hardware? When is the system in use?
Application Did all required processes complete? How fast is it? Is any condition detected that causes issues I have solved already? How is this month different from last month?	Business How many function X occurred per hour ? What items are most popular? What items were searched for and not found? How effective is this business component?

You must also answer questions regarding what you want to know from a reporting perspective, such as those shown in Table 15-2.

Table 15-2: What Do You Want to Know Later?
In Six Months? Response time variation Function X per month Revenue that could have been made on items searched but not found Usage patterns Buying patterns	In a Year? Response time variation Performance changes Orders per quarter Errors resolved Change in usage Resource and space utilization

Monitor in layers. Monitoring is like a layered cake ”once you get through one layer, there is another (and possibly even another flavor). Instead of one gargantuan monitoring system, use smaller concurrent monitoring processes, each one lightweight and specifically targeted . You can use the tools listed in Table 15-3 to monitor the various layers of your systems.

Table 15-3: Monitoring Tools for System Layers
Tools	Monitoring Layer
Profiler	SQL Server
Auditing, and so on	Customized monitoring
System Monitor	Application Operating system
Desktop Management Interface (DMI) tool	CPU, RAM, disk, host bus adapter (HBA), SCSI, NIC, network

When available, you should always choose to use the hardware vendor s DMI tools. In most cases these tools are free. Even more important, these tools return information on predictive failure much sooner than errors show up in the Windows Event Log. Simply put, this is a required level of monitoring if you are serious about availability. Hardware failures in general are a primary cause of outages, and DMI tools help to predict these failures. In general, these tools can be set up so a central monitoring station collects information on many servers.

You can also set up a great deal of customized monitoring. Obviously there is a lot of variety implicit in customizing your own program for this, but for our purposes this is intended as a general label of everything that does not fall into the category of System Monitor or SQL Server Profiler.

Setting Ground Rules

The question then becomes, what do you set up in Profiler and System Monitor? Before getting down to specific examples, you must first understand fundamentals for monitoring. Listed next are a few terms of engagement you should memorize, along with being proactive and practicing due diligence.

Tip

Perhaps the most important ground rule is that the monitoring tools that you decide to leverage should be selected keeping in mind that this is something you are going to have to maintain going forward. Your solution should be as hands-off as possible and should require minimal effort on the part of your team as your environment changes. Manually configured custom tools might be just the thing for a small environment or for unique situations. Large environments, however, require monitoring solutions that are easily managed, fault tolerant, and capable of self-monitoring. Monitoring should not add to your daily workload, and if implemented correctly can actually reduce it, freeing up valuable time for more productive activities.

You are creating history. When you monitor, you are creating a timeline of events. A counter value is an event, a query is an event, an error is an event, and a change is an event. To make the monitoring data meaningful, record the following events:
- Change in software version
- Difference in edition
- Application of service packs
- Change in hardware
- Change to the application
- Change in usage
- Change in priorities
Whenever anything on this list changes, be sure to reevaluate your monitoring methodology. In some cases, such as a difference between version, edition, or service pack of SQL Server (or hotfixes), SQL Server s behavior might change. If you cannot compare behavior before and after such changes, for example, you will not know if a phone call from an end user complaining about performance is valid or not. You should always be able to say empirically how things worked before so you can figure out what the delta of change is.

The same would go for a new version of the front-end application, or hardware changes of any type, Windows update, new drivers, new firmware, and so on. Record it all. Make things easier for yourself by storing it in a table in a format that joins easily with the data you are already collecting. That way, when you issue a report or create a timeline, it will be a relatively painless process.

Most important, record a baseline. Make one at the beginning and then again after every major system change. This is critical: without a history of what has happened before, you have no way to meaningfully interpret the data you collect. Trying to track developing problems is a complete waste of time without a baseline. It is difficult to state that there is a change in behavior if you only have the ability to observe the current behavior without being able to compare it to anything. Raw data becomes useful in relation to other counters, other events, or real-world activity. Therefore, the same measurement (counter, event) can be used more than once to answer an entirely different question. The same measurement can also shift in interpretation over time, especially if the system being measured is changed in some fundamental way. For example, if you added more processors to your system, the counter % Processor Time found in System Monitor would go down. If you did not record the addition of processors and make a correlation to this increase in processing ability, it might later be difficult to explain to others why the server appears to be underutilized . The same goes for application changes.

If a server goes down in the middle of your data center, no one cares how many PCI slots it had. Availability monitoring is about knowing if the trees and the forest are still standing, not if each leaf is turning brown or fallen . You do not want to collect minute details or data at too granular a level. Doing so generates a lot of what is referred to as noise, and it can also have a serious performance impact on your system.

This by itself is a vague recommendation. There is obviously going to be a lot of monitoring you can implement if you follow all the recommendations in this chapter. Does that mean this is all a big example? Not at all. Everything suggested in this chapter is actually running together simultaneously on some of Microsoft s most heavily used internal systems. The key is being lightweight so it all works well together. To succeed, you have to make sure that your monitoring is not negatively impacting your systems. Plan it that way, and then turn more things on to monitor gradually. Throughout you need to monitor resource usage to measure and report on the impact of monitoring, because there is always a cost associated with monitoring.

Tip

If you want to find out if your system gets the same lightweight impact received for the same monitoring strategy, do not add more counters because you think they might be useful. Test the original versions listed here first, then test any changes you make. For example, you could probably make an argument for measuring the disk impact based on the idea that you are writing information to a disk somewhere. However, if monitoring your system in a responsible and minimal fashion has a severe detrimental impact, then you have larger problems to deal with before implementing this kind of a system.

Push, pull, and call names . In some circles you can get into a discussion over whether it is better to push monitoring data to a central area or to pull it instead. Arguments can be made for keeping some information locally, replicating information, or storing it in a customized manner. Either way, make sure you can immediately detect the absence of a particular server name in a stream of information. A server that goes down or gets bumped off a network node most certainly stops delivering information about itself. Therefore absence of information associated with a particular name should trigger an alert.

Bring it on. That is the attitude you want to see in a server. You are not off the hook for monitoring performance and scalability, but sometimes you just want to know if the system is available. However, you also want to know if the server will be available tomorrow, too. Some performance or scalability issues can take your server offline very abruptly and if this happens it can take a team precious minutes or hours to bring it back online. You do not have to be a programming genius to keep an eye on the most common trends that impact performance and scalability. A few simple, well-monitored items can give you the opportunity to address issues long before users start to grumble about response time. To be very clear, part of your baseline (which is the keystone of your entire monitoring effort) should include performance and scalability information. If the numbers point to decay in this area, your monitoring solution should trigger an alert. What you do when that alert is received depends on your skill level, application policies and procedures, and other resources.

On the CD

For performance tuning and scalability information, see the document HA_Perf_And_Scale.doc.

We are already aware of the problem. This should be one of your trademark phrases, and this should be impressed on every member of your team. Do not wait for the user, the help desk, or an application administrator to complain. Do not wait until you read your e-mail tomorrow, the error logs tomorrow morning, or the event log (which you should monitor even if some other group does that regularly). Automate all of your alerts. To go one step further, once you can successfully automate the process and detect a problem, you can then revise your phrase to We are already aware of the problem, and based on our previous experience it will take X minutes to fix. This would simply require you to record problems and resolutions as events (or link to the problems) in your monitoring database. On systems you are responsible for monitoring, you should never learn of a monitored issue from anyone outside your team. Every problem should be detected in an automated fashion or by human eyes trained to notice it.

How Available Is Available?

Instead of looking at whether or not something is available, consider the measurement of monitoring data to be an indicator of how quickly users can get to the data. If you look at it this way, you can see that this adds a useful dimension to availability monitoring. You will be able to see impending availability problems before they occur if you begin with a baseline and chart the time intervals related to response time.

On a very simple level, you could create a set of queries (that do not change any data) based on the top n queries, or heavy hitters in the system. These are the 10 or 20 most important questions that you ask your data to retrieve from the application. These can be deliberately chosen based on business importance, or they can be the most frequently executed queries with a duration above an arbitrary interval, such as 5 seconds. These queries can then be used for tuning the system or measuring performance (or lack thereof). Once you have this list of queries, you can create a monitoring script or a trace that does something like the following:

Record start time.
Execute query. If you like, also record the Show Plan ALL output.
Record end time and number of records returned.

Store this information to a table.

Ideally, you would also collect System Monitor counters at the same time. SQL Server provides the option of creating custom counters through a User Settable object. If you set off a counter log based on an alert in Perfmon (SQLServer: User Settable Performance Object) on one of the instances of the query counter of the SQL Server User Settable object (mundanely known as user counters 1 “10), you can change the value of the user counter and trigger the alert. For instance, by executing the sp_user_counter1 procedure to change the value for User Counter 1 in Perfmon prior to executing your set of queries you are able to trigger collection of other System Monitor counters as a result. Obviously, your script or trace would work into this plan quite easily (just allow a delay that equals the delay interval that the alert polls the value of the counter, to ensure everything is running at once).

Information collected this way is very simple to report on. If you also store basic SLA information, you can easily create some very useful information. Consider the reports shown in Table 15-4, which could be set up as part of a Web site.

Table 15-4: Sample Reports
Region	Status	Deviation from SLA	Last 24 Hours	Last 7 Days	Last 30 Days
East	Green	“0.16	98.19%	98.45%	98.52%
West	Yellow	“3.3	87.80%	91.10%	91.00%

In Table 15-4, the availability information is separated by region. The Status column is a quick visual indicator of whether the availability falls within acceptable ranges. Deviation from SLA indicates how much deviance is measured from the SLA. The last three columns provide numbers showing the percentage of monitoring queries that produced results that fell within the SLA. For this to work, you must define exactly how available available is. The easiest and most productive way to do this is not to simply record if the resource was responsive to an inquiry, but to define how responsive it should be. Have the users tell you what their expectations are to work effectively. If they expect something to respond in 60 seconds, record the availability as just that: responds within 60 seconds. This number is an availability target, even though it clearly incorporates the idea that the query must perform well and that the design must scale. This approach helps tell you when your availability is at risk due to response time decreases.

If you combine this with a report that shows you how much of your downtime was incurred based on hardware failures and operational issues, you will have a report that helps you manage your system and drive necessary process change. You will also have data to support any incident of not meeting your nines goal. This is far more effective than simply reporting that your system was online for a percentage of time.

To take this example a little further, look at this not by region, but by functional component or workflow. Using the top n queries approach, you can create a list that shows which functional components are significant to your users. As part of an extended SLA, you might list My Order Status as requiring availability with a response time of 5 seconds. You would record the workflow description of this in terms of the front-end activity, for example, Click on Order Status My Orders. Then you would need to design a mechanism for verifying the response time, as shown in Table 15-5.

Table 15-5: Availability by Function
	Last 24 Hours	Last 7 Days	Last 30 Days
Function	My Order Status
SLA in Seconds	5
Description	Click on Orders My Orders
Number of Attempts	939	6860	38,347
Attempts Met SLA	583	4391	24,903
Max Duration	132.48	476.65	839.68
Avg Duration	4.23	4.17	3.99
Avg Duration (Met SLA)	1.51	1.40	1.42
Avg Duration (Not Met SLA)	8.69	9.08	8.76

Table 15-5 shows one way you could report on this information. To do this you could take a lightweight profiler that included the queries sent and their duration, and map the results to specific functionality. With some simple calculations, you have a valuable report. Remember the goal is to consider availability as how quickly the users were able to get to the data.