SQL, Query Optimization, and Query Processing


It might seem a bit patronizing to call attention to the query word in Structured Query Language (SQL). It is crucial, though, that you as a PC database developer or administrator understand the special significance of the word in SQL Server, even in SQL Server Desktop Engine, which will replace Jet as the default Access database. Most Microsoft Access users, for example, never really construct or work with more than the most basic SQL statements, because most of the time they are working at a very high, often visual, level above the database and doing little else than storing data and navigating a grid. But for C/S databases in general and SQL Server in particular, everything you do starts and ends with the SQL query.

On the face of it, a query is a question or an enquiry into the data in the DBMS, but a SQL query goes a lot further than being just a question. A SQL query can be combined with Data Definition Language (DDL) and Data Manipulation Language (DML) to create and manipulate storage objects within a database; to add, delete, and transform data in the data storage objects; and to inquire into and analyze that data. And database management systems are managed and driven by humans using SQL.

Each database manufacturer supports the SQL standard. And to support custom and proprietary extensions in their products, they have each extended SQL in certain ways. To extend the SQL language to suit proprietary extensions in SQL Server, Microsoft invented the Transact-SQL extensions to the language. And now Transact-SQL, or rather T-SQL, is the de facto language you use to talk to SQL Server. If you have a background in standard SQL, adopting T-SQL will be easy. T-SQL is covered in Part III.

A good database client application should never place the burden of constructing SQL query statements on application users. I have been developing database applications for many years, and every client eventually requests features that enable users to sort, group, and interpret data. Even the smallest companies I have dealt with, three to four computers max, would tell me they need to know why they are doing better or worse this year than last, and that they need data that can shed light on the health of their businesses. And yet we experienced DBAs tend to think that data analysis is only for the Fortune 500.

How we allow our users to query and analyze data without having to learn SQL is a challenge. And if you think you can get away with giving them simple SQL to use for queries, then you can forget about the language of OLAP CUBES. When I first stumbled onto the multidimensional expression (MDX), I tripled my health insurance.

We tackle this tricky subject in later chapters, which cover the English Query. A good start on this road to user empowerment is to appreciate that SQL is really a very high-level language. It is the language that developers and administrators use to “talk” to a DBMS in a language and syntax that have been widely adopted, but at the developer or administrator level, and not the general user level. We will return to this thread in later chapters.

No DBMS, SQL Server included, uses (or can use) SQL as a procedural or instructional language. It is used to convey to the DBMS what a client requires; the server goes off and makes use of any optimal computer language it has been programmed to use to satisfy the request, after the SQL code has been translated.

At a lower level than SQL, the methods used to manipulate and analyze data are both algebraic and calculus in nature. In other words, the procedures to satisfy a query or directive on data are based on mathematical formulations. If we were required to study “raw” relational algebra or relational calculus, there would not be many of us DBAs and SQL developers around, but SQL Server is precisely programmed to deal with that. Consider the following expression:

 TX WHERE EXISTS CREDITX     (CREDITX < 5 AND     T (T#: TX, CREDIT: CREDITX, STATE: 'NEW YORK'))

This is relational calculus, and the SQL version is

 SELECT * FROM CUSTOMERS WHERE CreditRating < 5 AND State='NEW YORK'

For most of us, talking to a computer in such mathematical terms is difficult, to say the least. SQL, on the other hand, lets us talk to the computer in more comfortable terms. Hebrew is one of the best comparisons as a communications language. The old Hebrew comprises certain vowels and variations adopted by the ancient Rabbis who used the language to talk to God (whether God replied is another matter). Today, however, Hebrew is also available in a form without the vowels and variations of the Biblical code, which makes it a better lingua franca for the Jewish people of Israel, and yet a third variation of Hebrew, very different from the former, is used for handwriting.

The view from SQL on the underlying data is no different from the view from C# on machine code. You would only need to go down to assembler if you were writing code for the low levels of the operating system and talking directly to hardware. In the same way, a developer would not use the internal language of the database server unless he or she were programming logic into the engine, or creating a new engine.

To sum up, then, SQL offers a wide range of developers, administrators, and advanced analysts a means of conversing with a database management system. SQL is not ideal or enabled in any way to work directly on the data structures, storage systems, and mechanisms of a database.

But the computer still needs to translate our requirements into its own language, which is better optimized and suited to perform relational algebra and relational calculus at the lower level to carry out the queries. This field of study in databases is known as query optimization.

What Is Query Optimization?

If SQL’s SELECT statement imposed constant procedures on the database engine, they would all operate in an identical fashion to obtain a result set. The net result would be inflexible architecture and the inability of the engine to take advantage of new technology, new processing power, and zillions of variables.

In the study of analytical problem solving, you learn that there is always more than one solution to a problem, and millions of possible outcomes, palatable or not. The methods used to query data address problems that can be solved in a variety of different ways.

The cliché “There are many ways to skin a cat” describes the problem that query optimization was obviously created to address. A result set can be derived through many different steps, and the optimizer’s job is to evaluate each query to determine the quickest means of solving the query problem for the least cost. Cost analysis, performed by most database servers, and especially SQL server, is applied mainly in terms of file I/O (as opposed to I/O on rowsets), with memory and other resources being taken into account. The cost analysis is then combined with each step in the query, and an optimum execution plan is the result.

Of course, it is possible for a genius to knock out a theorem that proposes a solution an optimizer might deduceafter all, humans are responsible for the logic used by a computer to make such deductions in the first place. But a computer is able to solve such puzzles far quicker than humans, no matter their IQs. The computer, or its optimizer module, is suitably equipped to perform that optimization with a benefit of addressable memory that can tell it about the form that data is in (such as data types and structure), the number of records in a table, the number of distinct values, and so on. The optimizer is capable of taking into account all this data while it determines how best to carry out the whims of its users, expressed as SQL.

Optimization is thus a fundamental database server function that is required to empower a database server product to achieve acceptable performance and to continue to better this performance as it evolves. It is also needed because instructions received from humans in the form of SQL (and now XML) are too high above its native processing architecture to serve any performance function. And the more widely adopted SQL becomes, the more opportunity a database server has of becoming better at what it does at the low level. Humans don’t care about how optimal a SQL statement is for a server, but they will notice if a query takes a few milliseconds longer to produce results, which can easily turn into seconds or minutes or hours the more complex the query and the fewer the resources available. In other words SQL Server has to care about optimization. Its life depends on it.

The Optimization Process

Every database server, every version of SQL Server, and every other DBMS system, such as those produced by Sybase and Oracle, are required to take the standard SQL request and convert it into the procedural language each product uniquely understands. How each product converts the SQL statement and what it then does with it are what make it unique. Algorithms, techniques, and guile are what are used to acquire market share and set the products apart from each other.

An UPDATE may cause triggers, functions, stored procedures, and integrity constraints to fire. These events thus kick off a chain reaction resulting in the opening of other result set comparisons and complex evaluations of collateral or linked data. A simple UPDATE could actually result in a hefty execution plana query domino effect, in a manner of speaking.

Although the optimizer is cost-based, it cannot be accurate all the time. What’s more, a point of diminishing returns must come when the optimizer must realize that it cannot optimize any further, and that performance may even suffer on all further optimizations. So a certain amount of heuristic logic has to be built into a DBMS engine to allow it to determine when to draw the lineor decide when more is becoming less. SQL Server is no exception.

SQL Server invokes a number of essential steps that are required to optimize a query, but it is worthwhile to mention the three most important ones here:

  • It must first transform the query into the low-level language of the engine. And it can do this by first parsing the SQL statement.

  • It must then identify the possible procedures it can use to satisfy the query. Remember, the query can be manipulative, resultant, or both. This is essentially the optimization process.

  • It must then generate query plans and choose the one that costs the least.

Let’s look at the first steps taken by the SQL Server optimizer code:

The SELECT or any other SQL statement is a standard request format and does not influence procedures or invoke or define the functionality of the engine and its capability to extract or update data. It is merely a means of requesting and updating data and presenting it in a certain way. It does not tell the server how to go about getting that data, but the server understands what is required of it from the syntax and keywords of the SELECT statement. In other words, the SELECT statement is a “desire” or an end-not a means to obtain the end.

Note 

I mentioned XML earlier, and it is important to see XML both as an extension to T-SQL and as an extended definition language for SQL Server. A powerful XML parser has been built into this product for the past several years.

For the record, the SELECT statement is used to define only the following things:

  • The format of the result set returned to the client. Usually the default format is sufficient for the client because it follows the schematic layout of the columns, but often “qualifiers” such as GROUP BY and ORDER BY clauses instruct the engine to present the data in alternative formats.

  • The objects, mostly tables, that contain the data, or a part of the data, required by the client. The FROM clause is the subordinate component in a SELECT statement that identifies the source of the data. The FROM clause can specify more than one source.

  • How tables and other source objects are connected for a combined result set. The JOIN clause is another subordinate clause that specifies the logical connection of the two or more data sources.

  • Conditions or criteria that row data must meet to be included in the selection process. The subordinate clauses include the WHERE and HAVING statements.

Before the SQL statement-or the services requested by it-is passed to the query optimizer, it must first be validated by the statement parser, which is part of the greater relational engine, as opposed to the kernel we talked about earlier. The parser ensures that the SQL statement is syntactically correct and that it will be fully understood and productive. If the SQL code you sent is flawed, the parser will “throw” an exception and return a syntax error to the client application. (The error is sent out-of-band to the client; in other words, it is sent via a separate channel and not via the connection established by the client. SQL Server’s error reporting mechanisms are explained more fully in Chapter 4.)

If the syntax “flies,” the parser will break the statement into logical units. Such units include keywords, parameters or arguments, operators, and identifiers. The parser scans the text, inspecting delimiters, and determines what represents actual objects and what represents row values or variables. You can specify how SQL Server should evaluate what you send it. For example, you can change the character that represents a delimiter in a SQL Server statement. This can be done interactively through settings in Management Studio or in T-SQL code (in similar fashion to define statements in C or C++ code). We will deal with such advanced subjects in the programming chapters in Part III.

Note 

Many a SQL Server expert will come to a situation that might require (or tempt) him or her to second-guess the optimizer and suggest a “hint” for consideration. While you might not be so bold, it is important to understand the factors that can influence performance.

When all parsing is complete, the parser performs a grouping function, breaking down complex statements into smaller logical collections of routines and operations. In this fashion, using typical problem-solving techniques, the best query execution plan can be determined. In other words, the last function of the parser is to help the optimizer, which kicks in next, to see the “wood” for the “trees” in a complex request, thus contributing to the best execution plan. It is thus fitting that the format into which the original T-SQL statement is translated is called the query tree.

Note 

A query tree is also known as a syntax tree or a sequence tree.

The kernel finally compiles the query tree into a series of operations to perform against native rowsets stored in the databases. The kernel gains access to the databases via the services of the storage engine. The relational kernel and the storage kernel communicate via the OLE DB API using standard SQL SELECT statements that OLE DB can pass to the storage engine. (As you are aware, OLE DB is complex to code against-hence ADO and ADO.NET-but it is highly efficient and ideal for the interlocution of two kernels.) The following code represents a simple SELECT query issued by the relational engine to the storage engine:

 SELECT CustID FROM CustDetails

Obviously, the execution plan here is a no-brainer for the kernel because the SELECT statement is of the simplest form. The plan here would thus consist of a simple SELECT statement. This is represented in the simple flow diagram illustrated in Figure 3–2.

image from book
Figure 3–2: The steps taken by SQL Server to parse and optimize a T-SQL statement and create an execution plan

The operations are defined against table or index objects or are both stored in the database. If no index objects are defined against the data, the execution plan is constructed only to perform the necessary table scan.

But what takes place when a SELECT statement rivaling Fermat’s Last Theorem arrives over the wire? Here is something a little more complex, but still relatively simple:

 USE MODERNIZE SELECT OD.OrderID, OD.CustomerID, CUST. ContactName  FROM dbo.Orders OD INNER JOIN  dbo.Customers CUST ON OD.CustomerID=CUST. CustomerID

What you have here is a request for the return of two result sets from two tables that are joined to form a third result set. Although it is still not the most complex SELECT, the kernel breaks the statement into two execution steps, one for each table. It will call the OLE DB OPENROWSET method once for each table and then finally perform the necessary join of each rowset returned. When the relational kernel has finished working with the storage engine, it will format the result set to be sent to the client into the necessary tabular format and place that back onto the wire.

So what intervention, insight, or control do you have on the process I have just described? The good news for DBAs and developers is that SQL Server 2005 comes equipped with some marvelous tools to play with. Besides query hints, SQL Server Management Studio (see Chapter 4) will let you test a query for syntax, check out the execution plan, estimate execution plans, and so forth. It even lets you inspect the metrics derived in the optimization process, in calculations, and in steps taken to carry out the execution plan. I don’t want to go into much detail about the SQL Query Analyzer here, because we use it a lot in the chapters in Part III. You will also later investigate and learn about the Index Tuning Wizard introduced in Chapter 12.

How the SQL Server query processor resolves indexes, views, partitioned views, stored procedures, and so on is discussed in the chapters that focus on these areas.

Caching and Reusing Execution Plans

Most SELECT statements are identical or may differ only in the values specified in the search expression. Consider the following SELECT statements:

 SELECT * FROM CustOrds WHERE OrdValue < '400' SELECT * FROM CustOrds WHERE OrdValue < '1000' SELECT * FROM CustOrds WHERE OrdValue < ?

It is no doubt obvious that an intelligent optimizer will deduce that the preceding SELECT statements are likely to result in the same or very similar execution plans. It therefore would make no sense for SQL Server to recompile an execution plan if one has already been compiled for a previous user not too long ago.

SQL Server thus maintains a pool of memory, called the procedure cache, that it uses to store both execution plans and data buffers. I might add that the preceding examples are very simple, and if SQL Server determines that it would cost more to store such a plan than to recompile it, the plan will not be stored. The memory pool size varies and is different from system to system. It also changes according to state of the system. Any number of users can reuse SQL Server execution plans.

To further support a reuse policy, SQL Server also maintains a parameter data structure that is used to store the parameters sent in by each user. These structures are known as execution contexts. SQL Server places each user-actually each connection-executing a query into an execution context.

An execution plan that matches a user’s query can be used instead of compiling a new execution plan. SQL Server needs only to obtain the parameter or variable value in the SQL statement and combine that with the already compiled execution plan. The result represents significant savings in time and resources, which translates into better response time by the end user.

At any time a statement is sent to the server, the server will first inspect the procedure cache to see if there is an existing execution plan that can satisfy the SQL statement.

Note 

A number of factors can cause SQL Server to drop a plan. A change of data type is such a factor.

The saving comes by obviating the need to build a new execution plan. It should thus be obvious that you can do a lot in your code to ensure that your SQL statements have a higher probability of being matched. Your goal as a DBA and systems programmer is to increase the chance that SQL Server will reuse complex plans that have already been compiled. The use of parameters in stored procedures and parameter markers in ADO, OLE DB, and ODBC applications goes a long way to ensuring this. No doubt this becomes more difficult when we start dealing with NULL values and three- and four-dimensional values, which will be discussed in Chapter 17.

The execution plan and context-matching algorithms have been enhanced in SQL Server 2005, and this greatly improves performance over versions 2000 and earlier. Like SQL Server 2000, SQL Server 2005 keeps the execution plan in the procedure cache until it is needed. Only old and unused plans are purged from the cache when space is needed and, obviously, when the system is restarted.

Tip 

The DBCC FREEPROCCACHE is a directive you can use to purge the procedure cache and force SQL Server to recompile the procedure.

The algorithm maintains an age counter that contains a replica of the query cost factor. This replica is incremented by the cost value in the age field each time the object is referenced by a connection. For example, let’s say a query plan is given a cost factor of eight. Each time it is referenced, the value in the age counter doubles, becoming 16, 32, 64, and so forth. The lazywriter process, discussed earlier in this chapter, periodically scans the procedure cache and decrements each number in the age field by 1. Eventually, if the query plan is not being used, it is devalued to 0 and thereafter ignored. The plan will be kept in the cache until SQL Server needs memory, and any plans at 0 will be flushed.

There are other reasons to destroy the execution plans. They include the following:

  • The schema is altered or objects, such as tables, are dropped or changed (the ALTER TABLE or ALTER VIEW statement is an example of such a change that will flush the cache).

  • An index used by a plan is dropped.

  • You explicitly issue a request to recompile (this is done by calling the sp_recompile stored procedure).

  • Keys are being significantly changed. This could result from excessive transaction processing that fundamentally alters the table referenced by a query.

  • Triggers, stored procedures, functions, and so forth are placed on tables referenced by your plan.

  • Any function that causes the table to grow considerably will render an execution plan obsolete.

The subject of optimization and execution plans is exhaustive but essential when it comes to fine-tuning and optimization of both systems and applications, for DBMS systems in general and SQL Server in particular. In later chapters, you will build on the foundation you have gained in this chapter.




Microsoft SQL Server 2005. The Complete Reference
Microsoft SQL Server 2005: The Complete Reference: Full Coverage of all New and Improved Features
ISBN: 0072261528
EAN: 2147483647
Year: 2006
Pages: 239

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net