Chapter 22: SQL Server and ADO.NET: Notes on Performance | Applied ADO.NET: Building Data-Driven Solutions

Ever since the early days of relational databases, optimizing database performance has always been as much an art as a science. Even today, you'll often hear of new "secrets" through professional relationships that take on the air of furtive backalley exchanges more reminiscent of a Le Carr novel than a sober discussion between colleagues. This chapter discusses some SQL Server and ADO.NET performance issues. It also discusses some best practices to make sure you get the most out of your code.

Improving SQL Server Performance

The following sections provide some of the "back-alley" secrets we've accumulated over the years, as well as some of the more methodical approaches for examining your database and determining the best course of action to achieve the best performance.

Using Indexes

Properly selecting and managing your indexes can have the single biggest impact on performance using any database—and Microsoft SQL Server is no different. Indexes are like little optimized tables that organize the data you'll most commonly use so that you can find rows in the main table optimally.

To manage the indexes on a table, simply right-click the table in Enterprise Manager, choose All Tasks from the menu that pops up, and then select Manage Indexes (see Figure 22-1).

click to expand
Figure 22-1: The Manage Indexes menu in Enterprise Manager

From the dialog box that appears, you can manage any of the tables in any of the databases running in the current instance of Microsoft SQL Server (see Figure 22-2). You can change databases by choosing a different database from the Database drop-down list, or you can change the table you're managing your indexes on by choosing another table from the Table drop-down list.

click to expand
Figure 22-2: The Manage Indexes dialog box

Once you have the database and table selected that you would like to manage the index on, the current indexes that exist on the table display in the Existing Indexes area.

To add a new index to the table, click the New button at the bottom of this dialog box. This opens the Create New Index dialog box (see Figure 22-3). Having a good understanding of this dialog box and its implications is critical to creating a top-performing application. Determining which columns you should index in different contexts will be covered later in this chapter, so for now you'll focus on the options for determining the behavior of the index you'd like to create.

click to expand
Figure 22-3: The Create New Index dialog box

Checking the Unique Values box tells SQL Server that this index will be unique. If you're creating an index on a single-column identity column or any other type of single column where you know the value will always be unique, check this box. The index then knows that once it has found the item it's looking for, it shouldn't look any further, which makes for a faster index.

The real value of this option comes into play when dealing with a multiple-column index that, when combined, produces a unique value. For example, a complicated join table may not have any single column that represents a unique value, but a combination of two of the columns is unique.

The Pad Index and Fill Factor options are related. When you build an index, the index pages contain as much data as possible by default. However, if you subsequently add data to the table, then it takes a little longer because new index pages have to be created from scratch. Setting the Fill Factor option allows you to control how much spare space is left in each index page; a frequently updated table might have a Fill Factor option as low as 20 or even 10 percent. An index with a lower Fill Factor takes up more space, but it provides better performance. Note, though, that once the spare space is fully used, subsequent index pages are completely filled as if their Fill Factor was zero—regardless of what the setting was when the index was created. At this point, you should recompile the index to recover the performance enhancement. At installation, the default Fill Factor option for a server is set at 0, meaning each index page is completely filled, but you can alter this either through the SQL Server Enterprise Manager Server Properties dialog box or by using the sp_configure system stored procedure.

The Pad Index option performs a similar function and in fact uses the same percentage as the Fill Factor option. SQL Server indexing uses an indexing method called b-trees, which means the index contains a number of nodes where each node acts as a parent to two or more subnodes. The lowest pages of the index, beyond which there are no further nodes, are called leaf pages, and it is these that the Fill Factor option affects. Padding the index allocates spare space in the intermediate nodes, allowing for much greater expansion to the table's data than the Fill Factor option alone would support. Typically, however, setting the Fill Factor option provides enough optimization, and you should pad the index only in high-use scenarios.

The Clustered Index option internally sorts the data in a way that is optimal for a number of queries, especially those involving GROUP BY, ORDER BY, BETWEEN, or join clauses. A clustered index also provides a large boost for queries that involve aggregate functions such as MIN or MAX.

Clustered indexes don't help on columns that contain data with little variability such as bit columns or columns that join to something finite such as values representing states. Unless the country goes on a huge annexation binge, this number will remain relatively small compared to other things that are better candidates for your clustered index.

You can have only one clustered index on any given table, so think about what you want to take advantage of the potential performance boost. This may sound trivial; however, if you closely examine a table, you'll see a number of options. You'll have to determine which ones your queries will use the most.

The Do Not Recompute Statistics option comes with a warning that it is not recommended. In almost all circumstances, heed this warning or you could get some unexpected results. The only situation where you should use this option is where a table is read only and will never be altered.

Choosing Which Columns to Index

Knowing which columns to index is as important as knowing what options to use when indexing them. This is a good place to start: Any column you frequently query is probably a good candidate for an index.

The best candidates for indexes are usually the primary keys of your tables. The reason for this is when you're doing a multiple table join, you'll usually do the join on your primary key. If most of your queries use a particular column, then it's probably optimal to make that column a clustered index. It's never necessary to create an index for your primary keys because of the nature of how SQL Server stores data relating to auto-numbered primary keys.

How SQL Server Stores Your Data

SQL Server works well when storing data. Its overall goal is to make sure that when a row is inserted, it can be placed physically between rows above and below your primary key in terms of sort order. By ensuring that the rows are stored in the order of your primary key, primary key searches are very fast. For tables where you're providing your own value for your primary key, you'll want to use table padding.

If your table uses an auto-numbered primary key, your records will automatically be in the correct order as you insert them because of the auto numbering that SQL Server uses to store your data. In this case, it's not beneficial to use table padding.

Also, when choosing what columns to index, remember that you only get one clustered index. The clustered index groups and orders your index in a manner that it can quickly find what is being queried against.

Also, don't waste your clustered index on a finite state column. This includes bits and numbers with little variance (for example, an integer between 1 and 10). The clustered index shines on columns with broad variances in data, not in situations where most of the data in the column is the same from row to row.

Understanding the SQL Server Profiler

You can learn quite a few things about how to optimize your database by using the SQL Server Profiler. The SQL Server Profiler allows you to see what the database is doing, when it is doing it, and how long it takes.

You can launch the SQL Server Profiler from the Microsoft SQL Server submenu in the Start menu or from the Tools menu of Enterprise Manager. Once you've launched the SQL Server Profiler, you can start a new trace by choosing File New Trace. This opens the Trace Properties dialog box (see Figure 22-4).

click to expand
Figure 22-4: The Trace Properties dialog box

Once you've launched the Trace Properties dialog box, you'll see a number of important options. If you intend on being able to run the optimizer (which can automatically perform a number of optimizations on your database), you'll need to save the output to a trace file. You can do this by checking the Capture to File box and then choosing a file by clicking the folder icon to the right of the box.

The Events tab allows you to manually choose the events you'd like to include in the trace. These events include table scans, stored procedures being called, and more. It's worth your time to look through this menu and identify the types of events in which you're interested.

The Data Columns tab provides you with the ability to choose the kind of information you want captured when each of the selected events occurs. Some of the options included are the Application Name, CPU Time Used, IO, and Duration.

The Filters tab allows you to choose the items you would like excluded from your trace. Generally, you don't need this tab when working against a development database whose traffic is limited to your usage, but it's helpful in building a trace against a large server that has multiple databases on it. In other words, it's helpful to be able to filter out the "noise."

Once you've started your trace, you'll be able to watch the activity on your database in real time and understand how long each query and stored procedure takes to execute (see Figure 22-5).

click to expand
Figure 22-5: The SQL Server Profiler with a trace running

If you're watching a trace on a live production system, you should be able to quickly identify any stored procedures that are taking a long time to run by looking at the Duration column. Any stored procedure taking longer than 50 milliseconds is a potential danger to your application's scalability.

Optimizing Automatically with the SQL Server Profiler

The SQL Server Profiler also provides an outstanding tool that can automatically optimize your database for you. Database veterans are probably cringing as they read this after having been burned by one of the variety of applications that promised to do this and delivered marginal results. However, SQL Server's Index Tuning Wizard provides most of the optimization that applications need as long as you understand a few caveats.

Foremost, it's important to keep in mind that the Index Tuning Wizard does nothing for optimizing the database for queries contained in stored procedures. This is a bit of a problem given that it's a good practice to keep all of your database queries wrapped in stored procedures for optimal performance.

Note

For more information about SQL Server performance, see Brad M. McGehee's site at www.sql-server-performance.com. You can also refer to William R. Vaughn's books, notably ADO.NET and ADO Examples and Best Practices for VB Programmers, Second Edition (Apress, 2002) or the classic Hitchhiker's Guide to Visual Basic and SQL Server (Microsoft Press, 1998), which discusses Visual Basic 6, SQL Server 7, and SQL Server 2000, Microsoft Data Engine (MSDE), and ADO 2.0.

Fortunately, there are some workarounds to this particular problem. You can copy and paste queries from your stored procedures into the Query Analyzer and have it assess the load you're putting on the database and make an informed decision on what indexes should be added.

Once you've pasted your query into the Query Analyzer, choose Query Perform Index Analysis, and the Query Analyzer will use the SQL Server Profiler to show you the best course of action (see Figure 22-6).

click to expand
Figure 22-6: The Query Analyzer's recommendations for the current query

If the Query Analyzer has recommendations on what you should do, a second dialog box will pop up and ask if you want it to automatically implement the recommendations. Often, it recommends adding an index to a particular column that is causing a table scan because table scans take significantly longer to run than indexed queries (see Figure 22-7).

click to expand
Figure 22-7: The Query Analyzer's recommendation to create an index that would help the query run faster

You can also attach the SQL Server Profiler to run a trace against your production database to identify stored procedures that are taking longer to execute than you would like. Then, launch the wizard and choose Profile the Performance of a Stored Procedure; the tool will test that particular stored procedure and recommend what you can do to optimize its performance.

It's largely a matter of personal preference as to how you want to work to improve the performance of your stored procedures. Although the wizard makes suggestions, you don't have the opportunity to tweak and alter the stored procedure the way you do with the Query Analyzer.

Optimizing Your Transact-SQL Queries for Performance

You can tune your Transact-SQL queries and stored procedures for performance in a number of ways. Think of your queries as asking the database a question and think of the optimizations you make on the query as asking the question in a way that requires the database less thought. Listing 22-1 shows an inefficient query that you can optimize.

Listing 22-1: The Query from SelectVideosCheckedIn, Not Optimized

 SELECT VideoTapeID, Title, Description FROM VideoTape WHERE VideoTapeID NOT IN     (     SELECT VideoTapeID FROM VideoCheckOut as vco WHERE VideoTapeID NOT IN          (SELECT VideoTapeID From VideoCheckIn as vci WHERE          vco.VideoCheckOutID=vci.VideoCheckOutID)     )

The query in Listing 22-1 took 15 seconds to return 42,000 rows from the VideoStore2 database available in the downloadable code for this book. Most of this is the time required to pipe out that many rows to the Query Analyzer. But even with factoring that in, there's a way to shave a second off the query time. Figure 22-8 shows the execution plan of Listing 22-1.

click to expand
Figure 22-8: The estimated execution plan of the query in Listing 22-1

By using NOT IN to test the subqueries, SQL Server has to do a comparison to make sure that the VideoTapeID doesn't match at each level of the subquery. To understand what SQL Server is doing to perform this query, it's helpful to be able to see the execution path and examine it for comparisons and table scans that you can eliminate.

You can view this in the Query Analyzer by choosing Query Display Estimated Execution Path. This tool is an especially powerful weapon in the war against inefficient queries.

Next to each item in the execution path is a cost in terms of the percentage of the overall execution time (always adding up to 100 percent). As you follow the execution path and move your mouse over each item, you should pay attention to a few key items in the pop-up text. The three most critical factors to pay attention to are Estimated Number of Executes, Estimated CPU, and Estimated IO.

Estimated Number of Executes lets you know how many times the comparison will be performed. If you can think of a way to keep the logic of your query and still reduce the number of times this comparison must be performed, you can have an excellent impact on performance.

The Estimated CPU and Estimated IO columns let you know how much CPU usage and disk IO is resulting for a single step in the process of executing your stored procedure.

Using the information derived from the estimated execution plan, you can identify a few areas that are ripe for optimization. You can significantly reduce the table scan icon that is taking up 60 percent of the execution time if you use a narrowing item in the WHERE clause.

One of the biggest impacts you can have on the speed of your query execution is keeping the result set as small as possible. To this end, you can modify the SelectVideosCheckedIn stored procedure to include a simple way to reduce the number of rows that must be compared to the subquery. In this case, narrowing the results based upon the CategoryID vastly reduces the number of rows that will be analyzed and returned. Listing 22-2 lists an optimized version of SelectVideosCheckedIn.

Listing 22-2: An Optimized Version of the SelectVideosCheckedIn Query

 SELECT VideoTapeID, Title, Description FROM VideoTape WHERE     CategoryID=5 AND     VideoTapeID NOT IN      (     SELECT VideoTapeID FROM VideoCheckOut as vco WHERE VideoTapeID NOT IN          (SELECT VideoTapeID From VideoCheckIn as vci WHERE vco.VideoCheckOutID=vci.VideoCheckOutID)     )

Another effect of narrowing the results from the VideoTape table before performing the subquery comes from the Estimated Number of Executions that are now expected to run on the Table Spool step and its related substeps. Without narrowing the results by CategoryID first, this step executed 42,000 times—one for each row in the database. With the VideoTape table narrowed by CategoryID, this number drops to a few thousand, providing an enormous boost to the performance of the SQL statement. Figure 22-9 shows the execution plan of Listing 22-2.

click to expand
Figure 22-9: The estimated execution plan of the query in Listing 22-2

A second strategy you can use to reduce the number of rows returned is to use SELECT TOP to reduce the number of rows you expect to be returned. This can provide an enormous performance boost, especially in situations where you don't care precisely what records from the larger result set are required.

An example of where this behavior is beneficial is in an unordered queue situation. For example, if you had a table that tracks tasks for employees to perform and you wanted a query that would return 10 tasks for the employee to do, you could write a query like this one:

 SELECT TOP 10 TaskID, Name FROM Tasks WHERE Assigned=0

The performance increase comes from two elements. First, the number of rows that have to be piped across the network is significantly less. Second, the query can stop running as soon as it has any 10 records that meet the criteria.

If the query had to have the tasks in some type of order, such as Date Due column, then the query would lose the benefit of being able to stop executing as soon as it identified 10 items meeting the criteria. Adding an ORDER BY clause forces the query to completely execute to return the top 10 items in the correct order in which you expect them.

Understanding the execution plans of your queries can help you devise effective strategies for producing better stored procedures.

Avoiding Excessive Stored Procedure Recompilation

One of the reasons stored procedures are faster than simply executing SQL directly against the database is that stored procedures are interpreted and compiled into a structure that SQL Server can execute efficiently. The danger that comes from this is excessive recompilation if SQL Server thinks something has changed in your stored procedure.

A simple way to avoid excessive recompilations is to keep all of your variable and temporary table declarations together in your stored procedure. Some types of declarations, such as the creation of a temporary table, cause the stored procedure to recompile, but if it surrounded by other such statements, they will all be compiled at the same time instead of breaking execution as each statement is compiled.

Another cause of your stored procedure being frequently recompiled is if the contents of the tables the stored procedure is querying change frequently. If you have a table that is constantly changing, using sp_executesql can keep your stored procedure from recompiling (see Listing 22-3). The downside of this particular approach is that SQL Server won't be able to cache the execution plan of the query you run in your stored procedure.

Listing 22-3: Using sp_executesql to Execute a Database Query

 EXEC dbo.sp_executesql N' SELECT VideoTapeID, Title, Description FROM VideoTape WHERE     VideoTapeID NOT IN     (     SELECT VideoTapeID FROM VideoCheckOut as vco WHERE VideoTapeID NOT IN          (SELECT VideoTapeID From VideoCheckIn as vci WHERE vco.VideoCheckOutID=vci.VideoCheckOutID)     )'

To identify what stored procedures in your database are recompiling excessively, use the SQL Server Profiler and make sure you have the SP:Recompile event selected. The SQL Server Profiler shows you when each recompilation occurs. If you're able to identify any stored procedures that recompile often, you'll have an excellent place to start tuning your database performance.

Avoiding Locks

To avoid two threads accessing the same resource at the same time, SQL Server has a locking system. If your CPU usage and IO are low but your database is performing slowly, it's likely you have some bad locking conditions occurring.

If you have a stored procedure that needs frequent recompilation, make sure you include the table owner in all of your statements. When a stored procedure is being recompiled, it's in a locked state and can't be called by another process until it has finished recompiling. If it has to look up who owns each table as it does the recompile, it'll take longer.

Also, avoid using cursors at all costs. A cursor allows row-by-row processing on a result set. Sometimes a cursor is necessary, but it can put a lock on the table while it executes. You can rewrite most queries that use a cursor using either a CASE clause or at least a TABLE variable if you're using SQL Server 2000.

Finally, only use transactions when you need to—and keep transactions short. Because a transaction can't have any data modified outside of the transaction while it's occurring, it'll lock almost everything it touches. If you think through the execution of your transaction and come up with any way to keep your transactional integrity without actually using a transaction, then you can reduce the number of locks that occur.

Designing Your Tables for Optimal Performance

To achieve the best performance from your database, you'll need to start thinking about performance as you design your table schema. In every step of the process, think about how you'll query the data in the table and how you'll run updates on it. Unfortunately, sometimes you'll have to break away from clean design and choose one that is not as clean as you would like it to be to achieve optimal performance.

A common myth in designing a database for performance is that every table must have a primary key. In general, it's a good idea for every table to have a primary key—unless the primary key will never be used. A good example of this is a join table that sits between two tables and expresses a relationship between the two. Adding an extra column to a table that represents a many-to-many relationship is like having a fifth wheel on your car—it won't break anything, but it might hurt your gas mileage just a little. Figure 22-10 shows a join table.

click to expand
Figure 22-10: A join table that doesn't need a primary key

In the table structure shown in Figure 22-10, the VideoTapeType table doesn't need a primary key. If you added another VideoTapeTypeID column as an autonumbered primary key, it would be completely superfluous and add a useless piece of data to the database.

If performance is absolutely critical down to the last cycle and you're willing to commit to an unclean database design, denormalizing your data can be an effective means of achieving that performance. This type of situation comes up fairly often when dealing with a system that allows people to log in and search for companies. In this situation, you're working on a database that has two types of companies—companies with people who can log into your system and companies that are just searchable data. It'll provide better performance for the companies that can log in to have the companies that can't in a separate table. The reason for this is it's likely that the number of companies that can't log in is significantly greater than those that can, but at the same time you'll likely be doing many more queries on the companies that can log in. It's definitely a convoluted situation and bad practice to track the same data in two places. However, if you can keep the amount of data in one table that will be frequently accessed small to the detriment of good design, it can help quite a bit.

Checking the Size of Database Pages

In SQL Server 2000, data is internally stored as a collection of 8KB pages. The more rows that can be stored on a single 8KB page, the faster all queries run on the database will be and the less space the table will take up.

A simple way to find out how many bytes each row takes up is to look at the Size column in design view. If you add up the size of each of these columns and it's perfectly divisible by 8KB, then you have a perfect situation where no space will be wasted in the pages. This doesn't mean you should intentionally make your tables larger just to be divisible by 8KB—but in general smaller tables are better for your application's design. If you have a table with more than 10 columns, make sure your requirements are best served by what you're doing.

Database page size is one of the reasons why it's generally better to have multiple small tables as opposed to a denormalized table that contains a large number of columns.

Understanding Denormalization

The opposite of normalization is denormalization, which essentially amounts to creating large, flat table structures that have all of the columns you need on a single row. Denormalized tables are generally used for reporting, for derived tables, or for data warehousing. In general, it's not recommended to use a denormalized schema unless it's for one of those reasons—generally in a data warehouse or "business intelligence" environment.

The performance benefit of these tables comes from the fact that the database doesn't need to join multiple tables together. Denormalization is useful for situations where you need most of the data that's in the table. If a denormalized table has 40 columns, and you really only need the data in two of the columns the majority of the time, you're likely better off just using a regular normalized table schema to achieve optimal performance.

Understanding the Dangers of Database Design Slippage

You have to maintain a fine balance when creating a database schema between three main elements:

Clean normalized design
Performance
Speed of development

If you're designing a large, complicated application, a good normalized design is probably the way to go. If scalability is the prime consideration, then performance comes first. If the application you're developing is small and doesn't have scalability requirements, obviously speed of development is at the top of your list.

Above all, make sure you're designing the application the way it's needed, and as with all professional applications, make the right tool for the job and avoid architectural "gold plating" (creating a system that is beautifully geeky to a programmer but doesn't solve the business problem any better than a less-geeky solution). The people who pay your salary probably don't know or care that your application uses an absolutely perfect normal form.

With this consideration, you have to be careful when you start slipping from clean design because without discipline it can become a "slippery slope." Sometimes doing the "wrong" thing because it's fast can make development slow later as you work to cover up the sins of your past. Always think in terms of both longterm lazy and short-term lazy (described in Chapters 18 and 19).