At some point you have to put theory into practice. This section walks you through the various methods of implementing a monitoring solution. No matter what technology you employ , the process is going to be the same. The first step is to create a baseline. Once you have a good idea of what you are going to need to collect, you can create a good baseline with that in mind. Note that what you need to baseline will undoubtedly change over time, so it is essential with this (as with any other system) to design it so it is relatively flexible to change as your needs change.
What exactly is a baseline? This simply means setting down a record of how things look now, not unlike a photograph. When you run the same monitoring you plan to run in the future, save it and call it a baseline. You then are able to compare historical data with current data. The idea is that later on, you can compare things to this baseline to see what has changed. Later, you might find that you need to create a new baseline based on a change to the system or usage. That gives you two baselines for analysis and comparison purposes.
For example, if you monitor your transaction log file growth on a regular basis, you can start to see trends. If your starting size was 200 MB as a baseline, and you monitor the file size and record it daily, over time, you can calculate the percentage of growth and correlate it to changes to the system, such as higher usage, new users, or different work patterns.
Much more could be discussed about what exactly to save, in what format, how to report on it, and so on. Because this book has a finite amount of space, you will get a foundation you can build on. Interpretation of data will also be customized to every environment; no two numbers that might be the same can be interpreted in the same way because of the surrounding environment.
|More Info|| |
The best resources are Windows Resource Kits, the SQL Server Resource Kits, and the SQL Server 2000 Operations Guide , which can be found at http://www.microsoft.com/technet/treeview/default.asp?url=/technet/prodtechnol/sql/maintain/ operate /opsguide/default.asp .
Hardware monitoring is required for anyone serious about maintaining availability. Obtain your hardware vendor s monitoring tool for a specific component, install it, and configure it to monitor the right servers. In most cases you can set the tool to alert you if there is a problem. Collecting this information should never be optional. However, you can monitor it through a more customized DMI tool. You can gather this information using tools such as Microsoft Operations Manager (MOM) or Microsoft Windows Management Instrumentation (WMI). Here is a list of standard information that should be checked:
Hardware error messages
Hardware error timestamps
CPU cache sizes, model name , and speed
Disk drive capacities , firmware revisions, and models
Memory sizing: total, individual application, and type of memory used
Network interface descriptions
PCI board names
Serial number, model name, and server name
that the preceding list includes more than error messages. With the information collected from this list, you could begin to collect information for a configuration management database or a run book. For more information, read Chapter 2, The Basics of Achieving High Availability, as well as Chapter 2 of the SQL Server 2000 Operations Guide .
Once you have the hardware layer sufficiently monitored , you need to look to the software layers , namely SQL Server, Windows, and anything else running on a specific server. Ultimately, you want proactive monitoring that alerts you using e-mail, pager, or phone that there is a problem based on an event or rule. Sometimes even with minimal monitoring, there can be too much for any one person to look at everything.
You can use various tools to monitor events, alert someone or something, and in some cases provide an action based on an event. Relevant Microsoft tools are described here, but you can also use capable third-party tools that might be employed in your enterprise.
Prior to setting up any formal monitoring, you should set up alerts for all known failure types. You can set up SQL Server to forward events or perform alerts, but you might also consider setting up an enterprise management tool such as MOM.
Sqldiag.exe ships with Microsoft SQL Server 2000 and is located in the shared Binn subdirectory. You can use it to retrieve information on the current state of your SQL Server, including user information, DLL versions, configuration information, and database size information. On completion, this tool creates an output file with detailed information about your SQL Server. The file is overwritten each time you run the tool, so if you would like to archive the information, rename the output file and archive it to another location.
System Monitor, otherwise known as Perfmon, is the most common way of monitoring system events over time. It is found under the Start menu in Administrative Tools. You can log your results to disk or, as of Microsoft Windows Server 2003, you can also log the data captured to a SQL Server database file using a standard Open Database Connectivity (ODBC) connection. This can prove to be a very useful feature, because if you are looking to track trend information over time, you could then feed that information into Analysis Services or perform a custom analysis to provide your company with a wealth of information about your systems that was previously difficult to mine. Its usage is profiled later in this chapter.
Event Viewer, which can be found under the Start menu in Administrative Tools, is generally the first place you would monitor for all types of events outside of System Monitor. There are different types of logs in Event Viewer, such as application, security, and system logs.
Simply looking at the log files generated by an application or the operating system can provide you with invaluable information that might not be found elsewhere.
SQL Server Profiler can be a useful monitoring tool, especially for auditing events within SQL Server. Its use is detailed later in this chapter.
WMI is a standard programming interface like ASP.NET that lets you work with aspects of your applications so that you can create your own applications, ASP or ASP.NET pages, and so on, that can be used for monitoring or performing management tasks . For SQL Server 2000, the WMI provider is built on SQL-DMO, so it cannot do anything that DMO cannot do. There are many Windows-level events that can be tracked and acted on through WMI. Beginning with Windows Server 2003, you can also use WMI with Windows Clustering. WMI can be accessed many different ways (programs, command-line tools such as Wmic.exe, and so on), which is also a positive.
WMI is very useful for creating an in-house monitoring tool, but you must be careful. From a security perspective, you must ensure that no one who does not need access would have access to the namespace. Because there is no granularity of security privileges in WMI, once someone has access to the namespace, he or she can do anything allowed in that namespace. WMI might not work for some because the WMI provider and DMO are run in the same process space as that of the WMI management service.
|More Info|| |
For more information on WMI, look at the Windows Management Instrumentation section in MSDN ( http://msdn.microsoft.com/library/default.asp?url=/library/en-us/wmisdk/wmi/wmi_start_page.asp ), TechNet Script Center ( http://www.microsoft.com/technet/scriptcenter ), and books published on the topic, such as Microsoft Windows 2000 Scripting Guide (Microsoft Press, 2002).
Microsoft Operations Manager (MOM) can capture a wide variety of system and application events from distributed Windows-based systems. It is also capable of aggregating them into a central event repository. You can consolidate these events for an overall view of your enterprise, server, and service availability, or you can obtain specific information from the detailed event stream.
MOM includes a series of plug-ins, or management packs, geared toward a specific piece of functionality. Management packs exist for SQL Server and Windows Clustering. As with all other forms of monitoring, monitor only as much as you need. When you install the SQL Server management pack, all objects it monitors are enabled for monitoring by default. You should disable them and add them in one by one as you determine your monitoring needs. MOM brings a centralized management structure to your environment. MOM can monitor your servers in many different ways (for example, event logs, WMI, Simple Network Management Protocol [SNMP], log files, service status changes, and System Monitor thresholds). It can also detect when a scheduled event, such as a database backup, has been missed. Reporting is another time-saving feature that MOM provides.
Finally, and most important, MOM provides a hands-off monitoring solution. Once you have defined your monitoring for SQL Server (or other types of servers), those monitoring rules are automatically applied to the server based on the products that are running on it. For example, if you introduce a new SQL Server to your environment, MOM automatically recognizes that the server is running SQL Server and deploys the appropriate agent and rules. This saves you a lot of time because there is no manual setup or maintenance as servers move in and out of your environment.
MOM has a maintenance requirement for its databases because it logs all information to a SQL Server database. Also, if you want to install MOM on a clustered SQL Server 2000 instance, you need at least Microsoft Operations Manager 2000 Service Pack 1.
|More Info|| |
An in-depth discussion of MOM is largely out of scope for this book. For more information, see http://www.microsoft.com/mom , which includes links to documentation, including the Microsoft Operations Manager 2000 Service Pack 1 Deployment Guide at http://www.microsoft.com/mom/techinfo/deployment/guide.asp.
All versions of SQL Server include the ability to create alerts. You can create an alert as part of a SQL Server Agent job to notify you of its success or failure, or you can set up generic alerts based on certain events. You can find alerts under the SQL Server Agent tree on the Management tab in Enterprise Manager. When you create an alert, you will see the dialog boxes shown in Figures 15-1 and 15-2.
On the General tab, enter the name for the alert and select what type of alert you are creating. You can base the alert on a specific error number you know might happen, or, more broadly, you can use the severity of an event to trigger an alert. Most Microsoft production systems prefer to use the more targeted error number to associate very specific actions with unique events. You can also select the database or databases to affect by this alert. As a matter of management, you should probably create an alert, even if it is the same type, for each database you need to monitor.
On the Response tab, you specify what to do if the alert happens. If it is something you can automate and you wish to run a specific SQL Server Agent job, select the Execute Job check box, and then select the appropriate job. Whether you automate corrective action or not, you should always notify. Creating an alert without notifying anyone is like a tree falling in the forest and no one hearing it. To notify using an e-mail alias, you must create an operator, which is then linked to a specific e-mail name. To complete the implementation you will be using the underlying SQL Mail component of SQL Server, which requires the Microsoft Outlook or compatible Messaging Application Programming Interface (MAPI) client to be installed on the box.
SQL Server Notification Services is an add- on product for SQL Server 2000. Although you would need to create your own solution based on Notification Services, you can create one that exactly meets your enterprise monitoring needs.
|More Info|| |
For more information on SQL Server Notification Services, see the information posted at http://www.microsoft.com/sql/ns/default.asp.
In its most basic form, system uptime can be calculated by identifying outages and subtracting the length of those outages from a predefined window of time. This window might be every minute between 8:00 A.M. and 6:00 P.M. In some cases it could be 24 hours a day, 365 days a year.
If you really want to monitor your exact uptime number (for example, 99.983 percent), then you need to do two things:
Decide what part of your system will be measured for availability. This would usually be from an application perspective. For a Web site, a company might define that its customer account page must be accessible, and then base its availability measurement on whether that page was available every time the monitoring tool queried it within certain windows of time.
Decide what tool you will use to do this (and where you will store the information). Third parties that specialize in availability monitoring can monitor many enterprise-level applications that have public presence. If you are going to do it yourself, then you will need to either create a tool, using WMI for example, or MOM.
Doing these two things are good ideas for measuring your uptime in an objective way. However from a database perspective, it does not help you solve the problem if your numbers are not what you hoped they would be. Additionally, an overall availability number is good from a higher-level viewpoint, but it does not detail exactly where barriers to availability lie. If you design your own tool, make sure that you also can report on the various components of the system, and assign separate availability numbers to each. This way you can more easily identify where to spend your time eliminating barriers to availability.
System Monitor plays an important part in availability. For best results, collect the counters from the server you are monitoring, not over the network from another server. Although System Monitor is capable of monitoring over the network, you can load your network down and cause delays, depending on the speed of your network. Also, you would need to ensure that both machines involved are capable of keeping their times synchronized; otherwise, in the event of a problem, coordinating information will be extremely difficult. Unless you have a viable reason not to monitor on the server itself, avoid monitoring remotely.
|More Info|| |
Chapter 4 discusses System Monitor counters you can use to monitor your disk subsystem.
Listed next are specific recommended availability counters. You could limit these to a particular process (for example, sqlservr ) or a specific disk (the disk holding the data files). However if you have multiple instances on the server, you need to collect them all. On the other hand, if a separate process is monitoring at that level, then by all means focus only on SQL Server.
Processor\% Processor Time This counter is considered high if all processors are consistently above about 90 percent in most environments, and is considered low if consistently below 30 percent in most environments. Your values will vary. You can monitor total processor usage by selecting _Total, or by the individual processor number. It is not necessary to run individual processor counters over the long haul. Turn on this level when you notice the total processor time peaking in the 80 percent range.
Process\Various Counters Under Process, there are many counters that show the usage of a specific process, such as SQL Serve. Some examples are % Processor Time, IO Write Operations/sec, IO Read Operations/sec, Thread Count, and Working Set (good for Address Windowing Extensions [AWE] usage).
SQL Server instances will appear as sqlservr, SQL Server Agent instances as sqlagent, and SQL Manager instances as sqlmangr. If you have multiple instances, these will be enumerated. However, keep in mind that the enumerations are in the order in which your instances were started, not in the order in which you actually installed them onto your system. There is no good way to ensure that your SQL Server instances are started in a consistent manner, because if you start them all in parallel, some might start quicker than others. The only way to ensure that the instances are in the same order is to start them one by one.
Memory\Available Mbytes This tells you how much physical memory is actually free on the system.
Memory\Pages/sec This counter should be very low (under 10 or 20), and none of the paging should be associated with the sqlservr process. Of course, if you are using AWE memory, that comes from the nonpageable pool so you would expect this counter to always be zero (or close to it). You can use DBCC MEMORYSTATUS to provide a snapshot of the current memory status of SQL Server. The output shows distribution of 8 K pages among the various components of SQL Server. It contains sections for stolen memory (the lazywriter is not allowed to flush stolen buffers, which include most of the buffers that are in use), free memory (buffers not currently in use), procedures (compiled stored procedures and ad hoc plans), InRam (that is, pinned tables), dirty (pages not yet flushed to disk), kept (temporarily pinned in-use pages), I/O (waiting on pending I/O), latched (pages being scanned), and other (data pages not currently in use).
PhysicalDisk\Avg. Disk sec/Transfer This counter is more useful than the disk queue length for seeing exactly how long you are waiting on the disk to respond. Because response time is made up of service time and queuing time, as queue lengths increase the response time becomes dominated by queuing time. Longer queue lengths therefore translate directly to longer response times. If you are seeing response times greater than 25 msec for random requests or 2 msec for sequential requests , you probably have some nontrivial queuing delays (but not necessarily a cause for alarm). Certainly if you exceed 50 msec average response time you should be concerned . This counter in particular is a great baseline counter to use over time.
The following counters are related to a particular SQL Server instance, so you need to gather information on all the instances currently running on that server. All of the SQL Server counter information also appears in the system table master..sysperfinfo, so you might wish to instead collect that information from within SQL Server. More information on how to do that is provided in the next section. Using sysperinfo might benefit you.
SQLServer: Buffer Manager\Buffer cache hit ratio The Buffer cache hit ratio tells you, in a percentage format, how much SQL Server could use the pages in memory versus having to read from the disk. This number will vary, because if you have a heavy write and not a heavy read application, you will not be keeping many things in memory.
SQLServer: SQL Statistics\Batch Requests/sec The Batch Requests/sec counter reveals the incoming workload for prepared statements, but if your application uses stored procedures it might not be a good counter to use because a procedure might have many statements in a single batch. The value of this counter should be high (reflecting high usage of the system), and is only a matter of concern when it is very low (below 100 or so), but even that might not represent a problem if you have large stored procedures.
SQLServer: Databases\Transactions/sec If you have a lot of writes to your database, this counter might be a better indicator of how much work is being done in a unit of time. This is also another great baseline counter.
If neither Batch Requests/sec nor Transactions/sec is a really good indicator of workload, you might have to rely on an application- specific metric to determine your throughput.
SQLServer: General Statistics\User Connections By itself, this counter does not indicate much, but it can be useful when evaluated over time and in comparison with the other counters.
SQLServer: SQL Statistics\SQL Compilations/sec This counter is considered high if consistently above 10 or so at steady state; ideally the value should be zero.
SQLServer: Memory Manager\Connection Memory (KB) This tells you how much memory the sum of the user connections for that instance is consuming.
SQLServer: Memory Manager\Total Server Memory (KB) This tells you how much memory the instance is consuming.
SQLServer: General Statistics\Logins/sec and Logouts/sec These tell you how many logins or logouts you are having. If there is an excessive amount of either, you could be getting hacked.
SQLServer: General Statistics\User Connections This tells you the total number of users at any given point for an instance.
SQLServer: Buffer Manager\AWE counters There are various counters for monitoring AWE usage.
Make sure your collection interval is not more frequent than once every 15 seconds (it can be a much wider interval, up to 15 minutes). When you choose to archive the information for later use, such as deriving a new baseline value, you should only keep an average over a period of time to cut down on the amount of data you are storing.
There are several ways to automate the collection of information from System Monitor (or other aspects of the operating system). One is to use MOM; another is to create your own monitoring collection agent using WMI.
|More Info|| |
There is a lot more information available about WMI in the TechNet Script Center online at http://www.microsoft.com/technet/scriptcenter; or you can simply navigate to it from the main menu tree on TechNet. Also, there are some code samples on the CD that accompanies this book.
You can use simpler, old-fashioned methods for automating the collection of this information. First, you should be aware that System Monitor Perfmon is an ActiveX object and can be started from HTML.
|More Info|| |
Using HTML to monitor with System Monitor is included as part of the SQL Server 2000 Resource Kit in the automatic System Monitor generation tool that is on the SQL Server 2000 Resource Kit CD under DBManagement. You could also code your own solution, but this provides a good base to start from.
Another way is to set it up through the System Monitor interface. To do this, simply navigate to counter logs in System Monitor, right-click, and select New Log Settings from the shortcut menu. Add the counters in question and set the collection interval to something reasonable. In Microsoft Windows 2000, set it up to save the output to a .csv file, and then save it.
Never use a Universal Naming Convention (UNC) path to a log file. The overhead is far too high for the volume of information that needs to be logged.
If you choose to log data to a file, you can set a maximum limit on the file size. You can also schedule it to start a new log file when one of them is full (which breaks it into easy-to-import chunks ). You need to decide how long the log should collect data once it is started by the alert, unless you opted to store it to a SQL table. You might schedule it to stop the log after 3 hours, for example, or at a specific time. You can set it to run a command every time a log file closes , so you could actually have it import the data as it goes.
In Windows Server 2003, you have the option of saving the output to a SQL Server database. This is a good solution for some things, and there will be some overhead, but this is lightweight monitoring, and you are going to observe the impact before running it all the time, as discussed earlier in the chapter. You might drop entries if you are using tables, so be careful when using this approach.
Then, go to System Monitor alerts, create a new one, and add a counter to it. This counter sets off the alert. To find this counter, select the SQL Server performance object named User Settable; there is only one counter for this one (Query), so just select the instance you want to use, such as User Counter 1. Set the alert to occur when the value is over the limit of, for example, 0. Then set the sample interval to every 30 to 60 seconds (or whatever seems reasonable to you); this number reflects how much of a delay there can be between turning on the counter and triggering the alert. Next, set up the action the alert will take. In this case we want it to start a performance log; choose the one you just created.
Inside of SQL Server, create a job that will run the following commands as steps:
EXECUTE sp_user_counter1 1
WAITFOR DELAY '00:05:00'
EXECUTE sp_user_counter1 0
This way you can execute the job on demand to set off the monitoring from within SQL Server, if you wish. You could set it off without making the alert and job to do it. You also want to collect information from a trace at the same time.
Next, you need a script to import the Perfmon data. The easiest way to do this is to import the first one so you can get your table created correctly. The table always defaults to varchar columns unless you format it yourself to match your counter data. This is where this process can get difficult. If your filename is always the same, you are all set. However, if it is a rollover file that appends numbers, or if it simply appends the date, then the filename is different, and you run into a problem of automating the load. This is easy enough to solve if you have the coding knowledge, and you might want to learn this to broaden your horizons. There are numerous sample scripts available online in developer communities that could assist with this task.
Once the information is imported, you can query it and report on it like any other data in a table. One huge benefit of having this information in the database is having the full facility of Transact -SQL at your disposal to be able to easily find out precisely what was occurring in a certain time range.
Sysperfinfo is a system table, and its contents are described in Table 15-6. Normally, you should not use system tables, but this is a table used for read-only information that can prove very useful. Because System Monitor essentially gets its values from sysperfinfo, you can create your own management tool that might be more natural for you to work with as a database developer than System Monitor.
There are a few caveats to be aware of before you use sysperfinfo directly. First, it only captures information for the first 99 databases on an instance. Any databases beyond those 99 are not in the system table. It impacts your system s processor, memory, and disk I/O to some small degree. Take these factors into account if monitoring Processor Time, Pages per sec, and Disk sec/Transfer information. Also, because sysperfinfo is for dynamic data, it does not store the information historically. You would need your own mechanism to save the information to another table. Finally, remember that that these counters are not persistent, so they reset every time the instance restarts. If you end up with data that does not seem correct, remember that the counter values roll over when they pass the limit of the column data type.
There are a few types of counters represented in sysperfinfo. In some cases, you cannot just select a counter and use its value directly. For example, to calculate a percentage (such as the buffer cache hit ratio), you need to take two values, which are the main value itself (such as Buffer cache hit ratio) and the base value that is the number used to divide by. This is usually denoted with the word base added to the associated counter value. All base counters have a cntr_type value of 1073939459, and the number used as the upper number has a cntr_type value of 537003008. Examples are shown here:
-- Instance-wide counter SELECT (CAST(a.cntr_value as NUMERIC(20,4)))/(CAST(b.cntr_value AS NUMERIC(20,4))) AS 'Buffer cache hit ratio' FROM master..sysperfinfo a, master..sysperfinfo b WHERE a.counter_name ='Buffer cache hit ratio' AND b.counter_name ='Buffer cache hit ratio base' -- Example for a specific instance_name SELECT (CAST(a.cntr_value AS NUMERIC(20,4)))/(CAST(b.cntr_value AS NUMERIC(20,4))) as 'Cache Hit Ratio' FROM master..sysperfinfo a, master..sysperfinfo b WHERE a.counter_name ='Cache Hit Ratio' AND b.counter_name ='Cache Hit Ratio Base' AND a.instance_name = '_Total' AND b.instance_name = '_Total'
Other counters, such as those with averages, need to be calculated as well. This means that you need to store the value for a counter and then perform math to get the average value based on the current value and the last value. That is how it is displayed in System Monitor. However, tracking it over time, you can get longer averages than you might be able to get with a normal System Monitor trace. Usually the counters that are averaged are cumulative ones, such as Transactions/sec, which is available to different instance_name values. All cumulative counters have a cntr_type value of 272696320. By themselves , cumulative counters are only snapshots of the moments in time they are measuring. The following is an example of grabbing one point in time for the counter Transactions/sec for all databases:
SELECTinstance_name, 'Transactions/sec' = cntr_value FROM master..sysperfinfo WHERE object_name = 'SQLServer:Databases' ANDcounter_name = 'Transactions/sec' ORDER BY instance_name
If you want to get the average, you need to be able to store these to a table or somewhere else that you can average two numbers. The following example takes the values from a table called average_values. The table is populated by an application that runs the following statement to populate the table (it can be explicit Transact-SQL or run as a stored procedure):
INSERT INTO average_values SELECTobject_name, counter_name, instance_name, cntr_value, GETDATE() FROM master..sysperfinfo WHEREcounter_name = 'Transactions/sec' ORDER BY instance_name
If you now want to average the last two values, here is one way to do it:
SELECT TOP 2 * INTO #tmp_tran_avg FROM average_values WHERE instance_name = '_Total' ANDcounter_name = 'Transactions/sec' ORDER BY time_inserted DESC SELECT SUM(cntr_value)/2 AS 'Average Transactions/sec for two consecutive readings' FROM #tmp_tran_avg DROP TABLE #tmp_tran_avg
If you want to average all data captured over a specific period of time, you can do that as well. The following example calculates averages for all data captured over a one-minute period. Remember that this number, unless you are capturing the data at each time interval (say every second), might not be a complete reflection of an average, but it represents a good approximation .
DECLARE @max_time DATETIME DECLARE @max_time_less_1_min DATETIME SET @max_time = (SELECT MAX(time_inserted) FROM average_values WHERE instance_name = '_Total') SET @max_time_less_1_min = DATEADD(mi,-1,@max_time) SELECT avg_val_id, instance_name, cntr_value, time_inserted INTO #tmp_tran_avg FROM average_values WHERE instance_name = '_Total' AND counter_name = 'Transactions/sec' GROUP BY avg_val_id, instance_name, cntr_value, time_inserted HAVING time_inserted >= @max_time_less_1_min ORDER BY time_inserted DESC SELECT SUM(cntr_value)/count(cntr_value) AS 'Average Transactions/sec over a 1 minute period' FROM #tmp_tran_avg DROP TABLE #tmp_tran_avg
Finally, you can average all values you have captured over the lifetime of monitoring to get a picture of a certain counter. The following are two examples:
SELECT SUM(cntr_value)/COUNT(cntr_value) AS 'Instance Average Transactions/sec' FROM average_values WHERE instance_name = '_Total' AND counter_name = 'Transactions/sec' SELECT SUM(cntr_value)/COUNT(cntr_value) AS 'MSDB Average Transactions/sec' FROM average_values WHERE instance_name = 'msdb' AND counter_name = 'Transactions/sec'
More monolithic counters, such as Percent Log Used or Active Transactions, are also only snapshots of the time being observed . These have a cntr_type value of 65536. An example of a monolithic counter query is as follows :
SELECT cntr_value AS 'User Connections' FROM master..sysperfinfo WHERE object_name = 'SQLServer:General Statistics' AND counter_name = 'User Connections'
|On the CD|| |
The preceding Transact-SQL statements can be found in the file Sysperfinfo_Examples.sql.
Compilations can be a significant detrimental factor, typically characterized by high CPU usage and low throughput, and can be viewed directly through the SQLServer: SQL Statistics\SQL Compilations/sec counter. If you monitor and find this issue, see Knowledge Base articles 243588, INF: Troubleshooting Performance of Ad-Hoc Queries, and 243586, INF: Troubleshooting Stored Procedure Recompilation available at http://support.microsoft.com/ for helpful information.
SQL Server Profiler actually runs a trace in SQL Server and makes it easier for you to manipulate it graphically. However, running SQL Server Profiler as a graphic utility incurs system overhead that might be better utilized for servicing database requests. Set up a trace the hard way (using script) and you can easily control it from a job while getting the same results.
|On the CD|| |
There is a stored procedure, LRQ_Trace_Start , in the file LRQ.zip that performs this function. It also imports the trace file automatically and stores it to a table.
One of the more interesting features of this LRQ_Trace_Start is the use of ::fn_trace_gettable , a function that allows you to directly query a trace file, even one that has multiple rollover files, and view it as you would a table. The script, LRQ_permanent_trace.sql, also contains more code that manages the process and allows it to work properly in a production environment.
The interesting thing about this script is its reusability. There are a couple of common reasons that people run a trace. One is to audit security or audit changes to system objects, usually with a very tightly focused trace. Another is for performance tuning. A great deal of the tracing that goes on with Microsoft s own internal systems is of this variety. If there is a performance problem, the DBAs that are put in charge of these systems usually want to be the first to know. In the second case, a trace is run as far as it can go at full bore during a crisis. Because there is seemingly a problem, SQL Server Profiler is started to see if anything looks wrong. DBAs who use SQL Server Profiler very effectively are unfortunately in the minority.
By providing this script, you should run a lightweight, tightly focused trace to help you identify queries with response times that stray into the unacceptable range. The sample trace that is provided is run permanently on many critical production SQL Servers at Microsoft. The traces are made permanent by SQL Server Agent jobs that restart them if they are stopped . Daily reports highlighting the previous day s performance are sent out in e- mails to the application owners . The report also shows performance trends over the last 30 days for comparative purposes. This helps with performance tuning efforts, and if you archive the data you can chart degradations in performance and act on them before they impact availability. Knowing the trend in advance of the end users feeling the trend is the sign of a sophisticated operations organization. It really does not require high- powered monitoring tools to adopt and implement best practices like this.
It is important to note that this trace is designed specifically to mitigate the performance impact on the server. To this end, the events include only the Transact-SQL batch completed and remote procedure call (RPC) statement completed. Do not add more events to this trace, such as statement starting or stored procedure statements. Although they are useful, they increase the impact on the server and are not something you want in a permanently running trace.
Filter the trace to queries with durations greater than 10 seconds to also limit the performance impact. This is meant for OLTP systems; reporting systems with a longer standard for queries would probably require a higher duration filter. By limiting the events captured and by using a duration filter, you should experience negligible performance impact on the server.
|On the CD|| |
To examine trace data, you can use the procedure Monitor_LRQ , which is created by the file LRQ_Monitor.sql in LRQ.zip. Another useful script in LRQ.zip is Performance_Trend.sql, which reports on output duration trends over specific time periods ranging back to 30 days.
A by-product of a good application is unavoidable database growth. As databases become larger there will always be some tuning and tweaking that can produce better performance. Some of this takes the form of reoptimizing queries and indexes. Having this report and historical comparisons can help you identify where you can best spend your time optimizing queries.
The approach of monitoring for response time shows you how a SQL Server trace can be used to record baseline performance of the database itself and detect significant deviations in response times expected by the user community.
Those of you who are paying sharp attention will realize that the scripts provided do not monitor response time of all functional pieces of the system (see the earlier section How Available Is Available? ). The reason for this is that we do not recommend that you do this on every server you have. You can do that with the permanent trace example.
To create a full-scale trace of response times for purposes of availability, you can take one of several approaches. First, you could simply define a metric such as all queries run in less than 10 seconds, except for these. The advantage of that is that you could run the permanent trace for it. The caveat is that you would have to both make sure it is understood that from the database side you can monitor only query duration, not the actual time it takes to return to the user s desktop, and make a report on duration that excludes the queries your users agree will take longer than the blanket duration number (10 seconds in this case).
Another way to approach it, based on exactly what is running on some of Microsoft s internal servers, is to select specific servers that are targets for high availability measured by response time. Set up a permanent trace on those without the duration filter. You will collect a significantly greater amount of data with this method, so you might want to run it for an hour during peak time and then calculate how much data you expect to get back and how you will accommodate it from a storage perspective. For example, you could immediately archive it to your centralized administration SQL Server or a server that is not so heavily used.
You can also experiment with specific filters, such as those that record only queries with text data you are monitoring for, for example, select, insert, update, and delete, or names of specific procedures. Or you can exclude text data you are not looking for in the permanent trace by filtering out specific tables or procedures such as exec sp_cursorfetch . To do this intelligently, you first need to define which items you are going to watch for (specific queries that map to functional areas). This requires some meticulous work, and we suggest you start with a one functional workflow. Once you are very comfortable with this, consider others.
Extended blocking and deadlocking indicates a design issue in your application that needs to be resolved. This could be due to poorly written statements that escalate to table locks, slow hardware, long transactions that must complete before others can use a column, and more. If you have an application that shows symptoms of blocking and locking, such as low CPU utilization combined with low throughput, you need to capture these problems to decide what actions to take. There are several methods documented on the Microsoft site, including Knowledge Base article 224453 INF: Understanding and Resolving SQL Server 7.0 or 2000 Blocking Problems, as well as elsewhere. Choose the approach that you feel most comfortable with and implement it. Just be sure it has a way to map any issues that occur to queries that were running.
Beware of using DBCC INPUTBUFFER to capture running queries, because when your application uses sp_executesql , that is all you will see in the input buffer.
|On the CD|| |
There is a tool to assist you in monitoring blocked processes in SQL. Use the file Blocking_Monitor.sql in Blocking_Monitor.zip. Should you need to ever call Microsoft Product Support Services for a blocking and deadlocking problem with your queries, they will require that you use Knowledge Base article 271509 INF: How to Monitor SQL Server 2000 Blocking.
If your files fill up and cannot grow further, this directly impacts your availability. Of course, you do not want them growing during production hours either, if you can help it, because if a database is growing, it cannot service requests during the expansion. It is a good idea to keep tabs on the space remaining. Here is a sample using sysperfinfo for programmatically obtaining that information:
SELECT'Log File(s) Used Size (KB)' = a.cntr_value, 'Log File(s) Size (KB)' = b.cntr_value, 'Log File Space remaining' = b.cntr_value - a.cntr_value, 'Database' = a.instance_name FROM master..sysperfinfo a JOIN master..sysperfinfo b ON a.object_name = b.object_name AND a.instance_name = b.instance_name WHERE a.object_name = 'SQLServer:Databases' AND a.counter_name = 'Log File(s) Used Size (KB)' AND b.counter_name = 'Log File(s) Size (KB)'
|On the CD|| |
Some tools that might help you with monitoring your database space and disk usage are included in Database_Capacity_ and_Disk_Capacity_Monitor.zip.
Properly tuned and managed indexes are crucial to good performance. Incorrect indexing might show up in symptoms such as high processor usage with low throughput, table scans , and queries that initiate a high number of reads and have long duration. Keep your statistics up to date, either by using update statistics or updating them later. Keep them defragmented as well.
|More Info|| |
See Knowledge Base article 243589, HOW TO: Troubleshoot Slow-Running Queries on SQL Server 7.0 or Later, for some information on index tuning, and also refer to HA_Perf_and_Scale.doc on the accompanying CD. For more information on indexes and fragmentation, see Chapter 14, Administrative Tasks to Increase Availability.
|On the CD|| |
For scripts that might help you deal with your fragmentation, see Index_Analysis_Tool.sql and IndexDefrag_Tool.sql in Defrag.zip.
This type of information can be useful when you want to look at your read:write ratio, or when you are tracking usage of a particular database or of each database file separately:
SELECT dats.name, stats.FileId, stats.NumberReads, stats.NumberWrites, stats.BytesRead, stats.BytesWritten, 'UserWaitTime(ms)' = stats.IoStallMS, BytesPerRead = CASE WHEN ISNULL(stats.NumberReads,0) > 0 THEN stats.BytesRead/stats.NumberReads ELSE 0 END, BytesPerWrite = CASE WHEN ISNULL(stats.NumberWrites,0) > 0 THEN stats.BytesWritten/stats.NumberWrites ELSE 0 END, WaitPerRead = CASE WHEN ISNULL (stats.NumberReads,0) > 0 THEN stats.IoStallMS/stats.NumberReads ELSE 0 END, WaitPerWrite = CASE WHEN ISNULL (stats.NumberWrites,0) > 0 THEN stats.IOStallMS/stats.NumberWrites ELSE 0 END FROM ::fn_virtualfilestats(-1,-1) AS stats JOIN master.dbo.sysdatabases dats ON stats.DBId = dats.dbid
This can become important when you need to track usage of each file separately either to determine where a problem lies, or to help you redesign your database file layout.