SQL Server 2005 Query Engine | MCITP Self-Paced Training Kit Exam 70-442 .Designing and Optimizing Data Access by Using Microsoft SQL Server 2005

Simply put, the SQL Server 2005 query engine, sometimes known as the relational engine, is responsible for efficiently answering requests from users and providing output. The query engine is made up of the following components, as shown in Figure B-12:

Query Optimizer The Query Optimizer is responsible for constructing an efficient "plan of attack" to answer queries.
Procedure Cache Manager The Procedure Cache Manager is responsible for managing the storage and retrieval of query execution plans that the optimizer has constructed.
Statistics Manager The Statistics Manager is responsible for maintaining statistical information on indexes to ensure that the most up-to-date information is used when constructing a query plan.

image from book
Figure B-12: The SQL Server 2005 query engine components

Each of these components works together to construct the most efficient plan, or set of instructions, to provide to the storage engine for data retrieval and storage.

One of the most prevalent design goals for the SQL Server 2005 query engine was to make it invisible to the user. The hope was that developers would never need to know about the internal workings of the query engine, and it would just do its job without interaction. In previous versions of SQL Server, developers of large-scale applications spent a fair amount of time trying to fool the query engine into generating more efficient query plans. This practice led to a number of query hints and options being embedded in T-SQL code that helped instruct the query engine. Unfortunately, due to the improvements in the SQL Server 2005 query engine, this practice can lead to less-efficient queries, and as the applications are upgraded to SQL Server 2005, many developers are finding that their applications are running more slowly. Developers need to understand exactly how the query engine processes data in SQL Server 2005 to make the best choices for creating efficient queries within their applications.

Query Engine Processing

The most basic query type is the SELECT query. SELECT simply instructs the database engine to retrieve certain data based on the criteria specified in the FROM and WHERE clauses. A SELECT statement is nonprocedural, in that it does not state the exact steps that the query engine should use to retrieve the requested data. This means that the query engine must analyze the statement to determine the most efficient way to extract the requested data. This is referred to as query optimization. The component that does this is called the query optimizer. The input to the optimizer consists of the query, the database schema (table and index definitions), and the database statistics. The output of the optimizer is a query plan.

A SELECT statement only defines the following:

The format of the result set. This is specified mostly in the select list. However, other clauses, such as ORDER BY and GROUP BY, also affect the final form of the result set.
The tables that contain the data. This is specified in the FROM clause.
How the tables are logically related. This is defined in the JOIN specifications, which might appear in the WHERE clause or in an ON clause following FROM.
The conditions that the rows in the source tables must satisfy to qualify for the statement. These are specified in the WHERE and HAVING clauses.

A query execution plan is a definition of the following:

The sequence in which the source tables are accessed.
Typically, there are many sequences in which the query engine can access the base tables to build the result set. For example, if the SELECT statement references three tables, the database server could first access Table1, use the data from Table1 to extract matching rows from Table2, and then use the data from Table2 to extract data from Table3. It is also possible that the query engine could choose Table3 first. This is important because database developers must construct queries that do not rely on data being retrieved in any specific order.
The methods used to extract data from each table.

Generally, there are different methods for accessing the data in each table. If only a few rows with specific key values are required, the database server can use an index. If all the rows in the table are required, the database server can ignore the indexes and perform a table scan. If all the rows in a table are required, but there is an index whose key columns are in an ORDER BY, performing an index scan instead of a table scan might save a separate sort of the result set. If a table is very small, table scans might be the most efficient method for almost all access to the table.

The process of selecting one execution plan from potentially many possible plans is referred to as optimization. The query optimizer is one of the most important components of SQL Server 2005. While some overhead is used by the query optimizer to analyze the query and select a plan, this overhead is typically balanced out when the query optimizer picks an efficient execution plan.

The SQL Server query optimizer is a cost-based optimizer. Each possible execution plan has an associated cost in terms of the amount of resources used to answer the query. The query optimizer must analyze the possible plans and choose the one with the lowest estimated cost. SELECT statements can have thousands of possible execution plans. In these cases, the query optimizer does not analyze all possible combinations. Instead, it uses complex algorithms to find an execution plan that has a cost reasonably close to the minimum possible cost.

The SQL Server query optimizer does not choose only the execution plan with the lowest resource cost; it chooses the plan that returns results to the user with a reasonable cost in resources and that returns the results the most quickly. For example, processing a query in parallel typically uses more resources than processing it serially, but it completes the query more quickly. The SQL Server optimizer will use a parallel execution plan to return results if the load on the server will not be adversely affected.

The query optimizer relies on distribution statistics when it estimates the resource costs of different methods for extracting information from a table or index. Distribution statistics are retained for columns and indexes. They indicate the selectivity of the values in a particular index or column. For example, in a table representing people, many people can live in the same city and even have the same last name, but each person has a unique Social Security Number (SSN). An index on the SSN is more selective than an index on the city or last name. If the index statistics are not current, the query optimizer might not make the best choice for the current state of the table.

The query optimizer is important because it enables the database server to adjust dynamically to changing conditions in the database without requiring input from the developer or administrator of the application. This enables developers to focus on developing the query and trusting that the query optimizer will build an efficient execution plan for the state of the database every time the statement is run.

To put it all together, the query engine performs the following steps:

The parser scans the SELECT statement and breaks it into logical units such as keywords, expressions, operators, and identifiers.
A query tree, sometimes referred to as a sequence tree, is built describing the logical steps needed to transform the source data into the format required by the result set.
The query optimizer analyzes different ways the source tables can be accessed. It then selects the series of steps that returns the results the most quickly while using fewer resources. The query tree is updated to record this exact series of steps. The final, optimized version of the query tree is called the execution plan.
The relational engine starts executing the execution plan. As the steps that require data from the base tables are processed, the relational engine requests that the storage engine pass up data from the rowsets requested from the relational engine.
The relational engine processes the data returned from the storage engine into the format defined for the result set and returns the result set to the client.

The basic steps described for processing a SELECT statement apply to other SQL statements such as INSERT, UPDATE, and DELETE. UPDATE and DELETE statements both have to target the set of rows to be modified or deleted. The process of identifying these rows is the same process used to identify the source rows that contribute to the result set of a SELECT statement. The UPDATE and INSERT statements might both contain embedded SELECT statements that provide the data values to be updated or inserted.

SQL Server 2005 has a pool of memory that is used to store both execution plans and data buffers. The percentage of the pool allocated to either execution plans or data buffers fluctuates dynamically, depending on the state of the system. The part of the memory pool that is used to store execution plans is referred to as the procedure cache.

One important aspect of the query execution steps is the fact that once the "best" query plan is generated, it is stored in procedure cache so that subsequent executions of the query do not have to generate the plan.

Query Plan Cache and Retrieval

SQL Server 2005 execution plans have two main components: the query plan and the execution context, as shown in Figure B-13.

image from book
Figure B-13: The SQL Server 2005 Query Execution Plan Architecture

The purpose of these components is as follows:

Query Plan. The bulk of the execution plan is a re-entrant, read-only data structure used by any number of users. This is referred to as the query plan. No user context is stored in the query plan. There are never more than one or two copies of the query plan in memory: one copy for all serial executions, and another for all parallel executions. The parallel copy covers all parallel executions, regardless of their degree of parallelism.
Execution Context. Each user that is currently executing the query has a data structure that holds the data specific to their execution, such as parameter values. This data structure is referred to as the execution context. The execution context data structures are reused. If a user executes a query and one of the structures is not being used, it is reinitialized with the context for the new user.

When any SQL statement is executed in SQL Server 2005, the query engine first looks through the procedure cache to verify that an existing execution plan for the same SQL statement exists. SQL Server 2005 reuses any existing plan it finds, saving the overhead of recompiling the SQL statement. If no existing execution plan exists, SQL Server 2005 generates a new execution plan for the query.

SQL Server 2005 has an efficient algorithm to find any existing execution plans for any specific SQL statement. In most systems, the minimal resources that are used by this scan are less than the resources that are saved by being able to reuse existing plans instead of compiling every SQL statement.

One of the problems with the cache architecture that the SQL Server query engine uses for execution plan storage is that memory is a finite resource. To ensure that the procedure cache memory is utilized in the most effective manner, the query engine must manage the lifetime of query plans stored in procedure cache.

Procedure Cache Storage Management

After an execution plan is generated, it stays in the procedure cache. SQL Server 2005 ages old, unused plans out of the cache only when additional memory is required. Each query plan and execution context has an associated cost factor that indicates how expensive the structure is to compile. These plans also have an age field. Every time the object is referenced by a connection, the age field is incremented by the compilation cost factor. For example, if a query plan has a cost factor of 10 and is referenced two times, its age becomes 20. The lazy writer process periodically scans the list of objects in the procedure cache. The lazy writer then decrements the age field of each object by one on each scan. It will take 20 passes of the lazy writer process to age out a plan with an age of 20. The lazy writer process deallocates an object if the following conditions are met:

The memory manager requires memory and all available memory is currently being used.
The age field for the object is 0.
The object is not currently referenced by a connection.

Because the age field is incremented every time an object is referenced, frequently referenced objects do not have their age fields decremented to 0 and are not aged from the cache. Objects infrequently referenced are soon eligible for deallocation, but are not actually deallocated unless memory is required for other objects.

Certain changes in a database can cause an execution plan to be either inefficient or invalid, based on the new state of the database. The query engine detects the changes that invalidate an execution plan and marks the plan as not valid. A new plan must then be recompiled for the next connection that executes the query. The conditions that cause a new plan to be created are:

Schema changes made to a table or view referenced by the query
Changes to any indexes used by the execution plan
Updates on statistics used by the execution plan, generated either explicitly from a statement, such as UPDATE STATISTICS, or generated automatically
Dropping an index used by the execution plan
An explicit call to sp_recompile
Large numbers of changes to index keys
For tables with triggers, if the number of rows in the inserted or deleted tables grows significantly
Executing a stored procedure using the WITH RECOMPILE option (which means the stored procedure won't take advantage of the procedure cache)

Most recompilations are required either for statement correctness or to obtain potentially faster query execution plans.

In previous versions of SQL Server, whenever a statement within a batch causes recompilation, the whole batch, whether submitted through a stored procedure, trigger, ad-hoc batch, or prepared statement, is recompiled. In SQL Server 2005, only the statement inside the batch that causes recompilation is recompiled. Because of this difference, recompilation counts in SQL Server 2000 and SQL Server 2005 are not comparable. Also, there are more types of recompilations in SQL Server 2005 because of its expanded feature set.

Statement-level recompilation benefits performance because, in most cases, a small number of statements causes recompilations and their associated penalties, in terms of CPU time and locks. These penalties are therefore avoided for the other statements in the batch that do not have to be recompiled.

The SQL Server Profiler SP:Recompile trace event reports statement-level recompilations in SQL Server 2005. This trace event reports only batch recompilations in previous versions and can lead to confusion if not examined closely. Also, in SQL Server 2005, the TextData column of this event is populated so that more information is returned by the trace.

SQL Server 2005 also adds a new trace event called SQL:StmtRecompile that reports statement-level recompilations. This trace event can be used to track and debug recompilations. Whereas SP:Recompile generates only for stored procedures and triggers, SQL:StmtRecompile generates for stored procedures, triggers, ad-hoc batches, batches that are executed by using sp_executesql, prepared queries, and dynamic SQL.

The following profiler trace events are relevant for observing and debugging plan caching, compilation, and recompilation behaviors:

Cursors: CursorRecompile. Useful for observing recompilations caused by cursor-related batches.
Objects: Auto Stats. Useful for observing the statistics updates caused by SQL Server's "auto-stats" feature.
Performance: Show Plan All For Query Compile. Useful for tracing batch compilations. It does not distinguish between a compilation and a recompilation. It produces showplan data in textual format
Performance: Show Plan XML For Query Compile. Useful for tracing batch compilations. It does not distinguish between a compilation and a recompilation. It produces showplan data in XML format
Stored Procedures: SP:Recompile. Useful for detecting when a recompilation occurs.

See Chapter 8, "Improving Database Application Performance," for a more complete discussion of the SQL Server Profiler. For more information on how SQL Server 2005 caches query execution plans, see the TechNet article, "Batch Compilation, Recompilation, and Plan Caching Issues in SQL Server 2005" at http://www.microsoft.com/technet/prodtechnol/sql/2005/recomp.mspx.

You can use the following query to list all query plans that are currently cached by the query engine and how often they are used:

 SELECT    cacheobjtype,    objtype,    pagesused,    usecounts,    sd.[name],    sql FROM sys.syscacheobjects so JOIN sys.sysdatabases sd ON so.dbid = sd.dbid;

The output of this query returns a result set with one row for each object, as well as some statistical information on how the object is used. The query also returns the SQL Statement that was used to generate the plan. This can be very helpful when examining which queries are most frequently run on a busy system.

SQL Server 2005 caches query plans for a variety of statement types submitted to it for execution. Query plan caching causes query plan reuse, avoids compilation penalty, and utilizes plan cache better. Some coding practices hinder query plan caching and reuse, and therefore should be avoided. (This is unlike previous versions of SQL Server, where hints were the only way to ensure the optimizer was doing its job.) SQL Server detects opportunities for query plan reuse. In particular, query plans can become non-reusable for two reasons: Schema of an object appearing in a query plan can change, thereby making the plan invalid; and data in tables, referred to by a query plan, can change enough to make a plan sub-optimal. SQL Server detects these two classes of conditions at query execution time and recompiles a batch or pieces of it as necessary. Bad T-SQL coding practices can increase recompilation frequency and adversely affect SQL Server's performance.