The SQL Server Engine | Inside Microsoft SQL Server 7.0 (Mps)

Figure 3-1 shows the general architecture of SQL Server. For simplicity, we've made some minor omissions and simplifications and ignored certain "helper" modules.

Now let's look in detail at the major modules.

The Net-Library

The Net-Library (often called Net-Lib, but in this book we'll use Net-Library) abstraction layer enables SQL Server to read from and write to many different network protocols, and each such protocol (such as TCP/IP sockets) can have a specific driver. The Net-Library layer makes it relatively easy to support many different network protocols without having to change the core server code.

A Net-Library is basically a driver that is specific to a particular network interprocess communication (IPC) mechanism. (Be careful not to confuse driver with device driver .) All code in SQL Server, including Net-Library code, makes calls only to the Microsoft Win32 subsystem. SQL Server uses a common internal interface between Microsoft Open Data Services (ODS), which manages its use of the network, and each Net-Library. If your development project needs to support a new and different network protocol, all network-specific issues can be handled by simply writing a new Net-Library. In addition, you can load multiple Net-Libraries simultaneously , one for each network IPC mechanism in use.

SQL Server uses the Net-Library abstraction layer on both the server and client machines, making it possible to support several clients simultaneously on different networks. Microsoft Windows NT, Windows 95, and Windows 98 allow multiple protocol stacks to be used simultaneously on one system. Net-Libraries are paired. For example, if a client application is using a Named Pipes Net-Library, SQL Server must also be listening on a Named Pipes Net-Library. (You can easily configure multiple Net-Libraries using the SQL Server Network Utility, which is available under Programs\Microsoft SQL Server 7.0 from your Start button.)

click to view at full size.

Figure 3-1. The major components of SQL Server architecture.

It is important to distinguish between the IPC mechanisms and the underlying network protocols. IPC mechanisms used by SQL Server include named pipes, RPC, SPX, and Windows Sockets. Network protocols used include TCP/IP, NetBEUI, NWLink (IPX/SPX), Banyan VINES SPP, and Appletalk ADSP. Two Net-Libraries, Multiprotocol and Named Pipes, can be used simultaneously over multiple network protocols (NetBEUI, NWLink IPX/SPX, and TCP/IP). You can have multiple network protocols in your environment and still use only one Net-Library.

On machines running Windows 95 or Windows 98, you can also use Shared Memory as an IPC mechanism. This is considered both the IPC and the Network protocol and is available only for local connections in which the client is running on the same machine as SQL Server. In addition, SQL Server running on Windows 95 or Windows 98 machines does not support named pipes as an IPC.

The Multiprotocol Net-Library uses the RPC services of Windows NT, Windows 95, or Windows 98. It could just as well have been called the "RPC Net-Library," but "Multiprotocol" better conveys its key benefit. Because the Multiprotocol Net-Library uses the RPC services of the operating system, it can encrypt all traffic (including requests , data, and passwords) between the client application and the SQL Server engine.

Both named pipes and RPC services in Windows NT support impersonation of security contexts to provide an integrated logon capability (also known as Windows NT Authentication). Instead of requiring a separate user ID/password logon each time a connection to SQL Server is requested , SQL Server can impersonate the security context of the user running the application that requests the connection. If that user has sufficient privileges (or is part of a Windows NT domain group that does), the connection is established. Note that Windows NT Authentication is not available when SQL Server is running on Windows 95 or Windows 98. When you connect to SQL Server running on Windows 95 or Windows 98, you must specify a SQL Server logon ID and password.

Some History
If RPCs had been available years ago, Net-Library might never have been invented. When SQL Server ran only on OS/2, it supported only named pipes. The developers wanted to broaden this support to SPX and TCP/IP and potentially other protocols, so they developed Net-Library as an abstraction layer. Now RPC services are available with so many network protocols that RPC alone might have met the need.

Which Net-Library Is Fastest ?
Strictly speaking, the TCP/IP Sockets Net-Library is the fastest Net-Library. In a pure network test that does nothing except throw packets back and forth between Net-Library pairs, it is perhaps 30 percent faster than the slowest Net-Library. But for LAN environments and applications, the speed of the Net-Library probably makes little difference because the network interface is generally not a limiting factor in a well-designed application.

On a LAN, however, turning on encryption with the Multiprotocol Net-Library does cause a performance hit ” it's the slowest Net-Library option when encryption is turned on. But again, most applications probably wouldn't notice the difference. Your best bet is to choose the Net-Library that matches your network protocols and provides the services you need in terms of unified logon, encryption, and dynamic name resolution. (These choices will be explained further in Chapter 4.)

Open Data Services

Open Data Services (ODS) functions as the client manager for SQL Server; it is basically an interface between server Net-Libraries and server-based applications, including SQL Server. ODS manages the network: it listens for new connections, cleans up failed connections, acknowledges "attentions" ( cancellations of commands), coordinates threading services to SQL Server, and returns result sets, messages, and status back to the client.

SQL Server clients and the server speak a private protocol known as tabular data stream ( TDS ). TDS is a self-describing data stream. In other words, it contains tokens that describe column names , datatypes, events (such as cancellations), and status values in the "conversation" between client and server. The server notifies the client that it is sending a result set, indicates the number of columns and datatypes of the result set, and so on ” ; all encoded in TDS. Neither clients nor servers write directly to TDS. Instead, the open interfaces of DB-Library and ODBC at the client emit TDS. Both use a client implementation of the Net-Library.

ODS accepts new connections, and if a client unexpectedly disconnects (for example, if a user reboots the client computer instead of cleanly terminating the application), resources such as locks held by that client are automatically freed.

You can use the ODS open interface to help you write a server application, such as a gateway. (MicroDecisionware, subsequently purchased by Sybase, developed its successful database gateway using ODS.) Such applications are called ODS server applications . SQL Server is an ODS server application, and it uses the same DLL (OPENDS70.DLL) as all other ODS applications.

ODS Read and Write Buffers

After SQL Server puts result sets into a network output buffer that's equal in size to the configured packet size , the Net-Library dispatches the buffer to the client. The first packet is sent as soon as the network output buffer (the write buffer) is full or, if an entire result set fits in one packet, when the batch is completed. In some exceptional operations (such as one that provides progress information for database dumping or provides DBCC messages), the output buffer is flushed and sent even before it is full or before the batch completes.

SQL Server has two input buffers (read buffers) and one output buffer per client. Double-buffering is needed for the reads because while SQL Server reads a stream of data from the client connection, it must also look for a possible attention. (This allows that "Query That Ate Cleveland" to be canceled directly from the issuer. Although the ability to cancel a request is extremely important, it's relatively unusual among client/server products.) Attentions can be thought of as "out-of-band" data, although they can be sent with network protocols that do not explicitly have an out-of- band channel. The SQL Server development team experimented with double-buffering and asynchronous techniques for the write buffers, but these didn't improve performance substantially. The single network output buffer works very nicely . Even though the writes are not posted asynchronously, SQL Server doesn't need to write through the operating system caching for these as it does for writes to disk.

Because the operating system provides caching of network writes, write operations appear to complete immediately with no significant latency. But if several writes are issued to the same client and the client is not currently reading data from the network, the network cache eventually becomes full and the write blocks. This is essentially a throttle . As long as the client application is processing results, SQL Server has a few buffers queued up and ready for the client connection to process. But if the client's queue is already stacked up with results and is not processing them, SQL Server stalls sending them and the network write operation to that connection has to wait. Since the server has only one output buffer per client, data cannot be sent to that client connection until it reads information off the network to free up room for the write to complete. (Writes to other client connections are not held up, however; only those for the laggard client are affected.)

SQL Server adds rows to the output buffer as it retrieves them. Often, SQL Server can still be gathering additional rows that meet the query's criteria while rows already retrieved are being sent to the client.

Stalled network writes can also affect locks. For example, if READ COMMITTED isolation is in effect (the default), a share lock can normally be released after SQL Server has completed its scan of that page of data. (Exclusive locks used for changing data must always be held until the end of the transaction to ensure that the changes can be rolled back.) However, if the scan finds more qualifying data and the output buffer is not free, the scan stalls. When the previous network write completes, the output buffer becomes available and the scan resumes. But, as stated above, that write won't complete until the client connection "drains" (reads) some data to free up some room in the pipe (the virtual circuit between the SQL Server and client connection).

If a client connection delays processing results that are sent to it, concurrency issues can result because locks are held longer than they otherwise would be. A sort of chain reaction occurs: if the client connection has not read several outstanding network packets, further writing of the output buffer at the SQL Server side must wait because the pipe is full. Since the output buffer is not available, the scan for data might also be suspended because no space is available to add qualifying rows. Since the scan is held up, any lock on the data cannot be released. In short, if a client application does not process results in a timely manner, database concurrency can suffer.

The size of the network buffer can also affect the speed at which the client receives the first result set. As mentioned earlier, the output buffer is sent when the batch, not simply the command, is done, even if the buffer is not full. (A batch is one or more commands sent to SQL Server to be parsed and executed together. For example, if you are using OSQL.EXE, a batch is the collection of all the commands that appear before a specific GO command.) If two queries exist in the same batch and the first query has only a small amount of data, its results are not sent back to the client until the second query is done or has supplied enough data to fill the output buffer. If both queries are fast, this is not a problem. But suppose the first query is fast and the second is slow. And suppose the first query returns 1000 bytes of data. If the network packet size is 4096 bytes, the first result set must wait in the output buffer for the second query to fill it. The obvious solution here is either to make the first command its own batch or to make the network packet size smaller. The first solution is probably the best one in this case, since it is typically difficult to fine-tune your application to determine the best buffer size for each command. But this doesn't mean that each command should be its own batch. Quite the contrary. In fact, under normal circumstances, grouping multiple commands into a single batch is most efficient and recommended because it reduces the amount of hand shaking that must occur between client and server.

ODS Default Net-Libraries

By default, SQL Server on Windows NT always listens on named pipes as well as on TCP/IP and Multiprotocol. SQL Server on Windows 95 and Windows 98 listens on the Shared Memory library instead of on named pipes, but it also has TCP/IP and Multiprotocol available. You can add other Net-Library interfaces. On Windows NT, you can also remove any of the Net-Libraries, but it's best not to remove named pipes. All the other Net-Libraries on Windows NT require an actual network. Because named pipe services exist in Windows NT even when no network is present, using named pipes leaves you a back door into SQL Server even if your network becomes totally nonfunctional. Similarly, SQL Server on Windows 95 and Windows 98 always listens over the Shared Memory IPC by default, so you should avoid removing this option. Even with no network, which is a more likely scenario in Windows 95 and Windows 98 than in Windows NT, Shared Memory is still available for interprocess communication.

Figure 3-2 shows the path from the SQL Server client application to the SQL Server engine and shows where the Net-Library interface fits in. On the server side, ODS provides functionality that mirrors that of ODBC, OLE DB, or DB-Library at the client. Calls exist for an ODS server application to describe and send result sets, to convert values between datatypes, to assume the security context associated with the specific connection being managed, and to raise errors and messages to the client application.

ODS uses an event-driven programming model. Requests from servers and clients trigger events that your server application must respond to. Using the ODS API, you create a custom routine, called an event handler, for each possible type of event. Essentially, the ODS library drives a server application by calling its custom event handlers in response to incoming requests.

click to view at full size.

Figure 3-2. The path from a SQL Server client application to the SQL Server engine.

ODS server applications respond to the following events:

Connect events When a connect event occurs, SQL Server initiates a security check to determine whether a connection is allowed. Other ODS applications, such as a gateway to DB/2, have their own logon handlers that determine whether connections are allowed. Events also exist that close a connection, allowing the proper connection cleanup to occur.
Language events When a client sends a command string, such as an SQL statement, SQL Server passes this command along to the command parser. A different ODS application, such as a gateway, would install its own handler that accepts and is responsible for execution of the command.
Remote stored procedure events These events occur each time a client or SQL Server directs a remote stored procedure request to ODS for processing.

ODS also generates events based on certain client activities and application activities. These events allow an ODS server application to respond to changes to the status of the client connection or of the ODS server application.

In addition to handling connections, ODS manages threads (and fibers) for SQL Server. It takes care of thread creation and termination and makes the threads available to the User Mode Scheduler (UMS). Since ODS is an open interface with a full programming API and toolkit, other ISVs writing server applications with ODS get the same benefits that SQL Server derives from this component, including SMP-capable thread management and pooling as well as network handling for multiple simultaneous networks. This multithreaded operation enables ODS server applications to maintain a high level of performance and availability and to transparently use multiple processors under Windows NT because the operating system can schedule any thread on any available processor.

The Relational Engine and the Storage Engine

The SQL Server database engine is made up of two main components, the relational engine and the storage engine. Unlike in earlier versions of SQL Server, these two pieces are clearly separated, and their primary method of communication with each other is through OLE DB. The relational engine comprises all the components necessary to parse and optimize any query. It requests data from the storage engine in terms of OLE DB rowsets and then processes the rowsets returned. The storage engine comprises the components needed to actually access and modify data on disk.

The Command Parser

The command parser handles language events raised by ODS. It checks for proper syntax and translates Transact -SQL commands into an internal format that can be operated on. This internal format is known as a query tree . If the parser does not recognize the syntax, a syntax error is immediately raised. Starting with SQL Server 6, syntax error messages identify where the error occurred. However, non_syntax error messages cannot be explicit about the exact source line that was the cause of the error. Because only the parser can access the source of the statement, the statement is no longer available in source format when the command is actually executed. Exceptions to the calling sequence for the command parser are EXECUTE(" string " ) and cursor operations. Both of these operations can recursively call the parser.

The Optimizer

The optimizer takes the query tree from the command parser and prepares it for execution. This module compiles an entire command batch, optimizes queries, and checks security. The query optimization and compilation result in an execution plan.

The first step in producing such a plan is to normalize the query, which potentially breaks down a single query into multiple, fine-grained queries. After the optimizer normalizes the query, it optimizes it, which means that the optimizer determines a plan for executing that query. Query optimization is cost-based; the optimizer chooses the plan that it determines would cost the least based on internal metrics that include estimated memory requirements, estimated CPU utilization, and the estimated number of required I/Os. It considers the type of statement requested, checks the amount of data in the various tables affected, looks at the indexes available for each table, and then looks at a sampling of the data values kept for each index or column referenced in the query. The sampling of the data values is called statistics and will be discussed in detail in Chapter 14. Based on the available information, the optimizer considers the various access methods and join strategies it could use to resolve a query and chooses the most cost-effective plan. The optimizer also decides which indexes, if any, should be used for each table in the query, and, in the case of a multitable query, the order in which the tables should be accessed and the join strategy to be used.

The optimizer also uses pruning heuristics to ensure that more time isn't spent optimizing a query than it would take to simply choose a plan and execute it. The optimizer does not necessarily do exhaustive optimization. Some products consider every possible plan and then choose the most cost-effective one. The advantage of this exhaustive optimization is that the syntax chosen for a query would theoretically never cause a performance difference, no matter what syntax the user employed. But if you deal with an involved query, it could take much longer to estimate the cost of every conceivable plan than it would to accept a good plan, even if not the best one, and execute it. For example, in one product review, SQL Server (and some other products) consistently executed one complex eight-table join faster than a product whose optimizer produced the same "ideal" execution plan each time, even though SQL Server's execution plan varied somewhat. This was a case in which a pruning technique produced faster results than pure exhaustive optimization. In general, though, you will typically get the same execution plan no matter what equivalent syntax you use to specify the query. Some products have no cost-based optimizer and rely purely on rules to determine the query execution plan. In such cases, the syntax of the query is vitally important. (For example, the execution would start with the first table in the FROM clause.) Such products sometimes claim to have a "rule-based optimizer." This might simply be a euphemism for "no optimizer" ” any optimization was done by the person who wrote the query.

The SQL Server optimizer is cost-based, and with every release it has become " smarter " to handle more special cases and to add more query processing and access method choices. However, by definition, the optimizer relies on probability in choosing its query plan, so sometimes it will be wrong. (Even a 90 percent chance of choosing correctly means that something will be wrong 1 in 10 times.) Recognizing that the optimizer will never be perfect, you can use SQL Server's query hints to direct the optimizer to use a certain index ” for example, to force the optimizer to follow a specific sequence while working with the tables involved, or to use a particular join strategy.

After normalization and optimization are completed, the normalized tree produced by those processes is compiled into the execution plan, which is actually a data structure. Each command included in it specifies exactly which table will be affected, which indexes will be used (if any), which security checks must be made, and which criteria (such as equality to a specified value) must evaluate to TRUE for selection. This execution plan might be considerably more complex than is immediately apparent. In addition to the actual commands, the execution plan includes all the steps necessary to ensure that constraints are checked. (Steps for calling a trigger are a bit different than those for verifying constraints. If a trigger is included for the action being taken, a call to the procedure that comprises the trigger is appended. A trigger has its own plan that is branched to just before the commit. The specific steps for the trigger are not compiled into the execution plan, like those for constraint verification.)

A simple request to insert one row into a table with multiple constraints can result in an execution plan that requires many other tables to also be accessed or expressions to be evaluated. The existence of a trigger can also cause many additional steps to be executed. The step that carries out the actual INSERT statement might be just a small part of the total execution plan necessary to ensure that all actions and constraints associated with adding a row are carried out.

The SQL Manager

The SQL manager is responsible for everything having to do with managing stored procedures and their plans. It determines when a stored procedure needs recompilation based on changes in the underlying objects' schemas, and it manages the caching of procedure plans so that they can be reused by other processes.

The SQL manager also handles autoparameterization of queries. In SQL Server 7, certain kinds of ad hoc queries are treated as if they were parameterized stored procedures, and query plans are generated and saved for them. This can happen if a query uses a simple equality comparison against a constant, as in the following statement:

 SELECT * FROM pubs.dbo.titles  WHERE type = 'business'

This query can be parameterized as if it were a stored procedure with a parameter for the value of type :

 SELECT * FROM pubs.dbo.titles  WHERE type = @param

A subsequent query, differing only in the actual value used for the value of type , can use the same query plan that was generated for the original query.

The Expression Manager

The expression manager handles computation, comparison, and data movement. Suppose your query contains an expression like this one:

 SELECT @myqty = qty * 10 FROM mytable V

The expression service copies the value of qty from the rowset returned by the storage engine, multiplies it by 10, and stores the result in @myqty .

The Query Executor

The query executor runs the execution plan that was produced by the optimizer, acting as a dispatcher for all the commands in the execution plan. This module loops through each command step of the execution plan until the batch is complete. Most of the commands require interaction with the storage engine to modify or retrieve data and to manage transactions and locking.

Communication Between the Relational Engine and the Storage Engine

The relational engine uses OLE DB for most of its communication with the storage engine. The following description of that communication is adapted from the section titled "Database Server" in the SQL Server Books Online. It describes how a SELECT statement that processes data from local tables only is processed :

The relational engine compiles the SELECT statement into an optimized execution plan. The execution plan defines a series of operations against simple rowsets from the individual tables or indexes referenced in the SELECT statement. ( Rowset is the OLE DB term for a result set.) The rowsets requested by the relational engine return the amount of data needed from a table or index to perform one of the operations used to build the SELECT result set. For example, this SELECT statement requires a table scan if it references a table with no indexes:
SELECT * FROM ScanTableV
The relational engine implements the table scan by requesting one rowset containing all the rows from ScanTable . This next SELECT statement needs only information available in an index:
SELECT DISTINCT LastName FROM Northwind.dbo.Employees
The relational engine implements the index scan by requesting one rowset containing the leaf rows from the index that was built on the LastName column. The following SELECT statement needs information from two indexes:
SELECT CompanyName, OrderID, ShippedDate FROM Northwind.dbo.Customers AS Cst JOIN Northwind.dbo.Orders AS Ord ON (Cst.CustomerID = Ord.CustomerID)
The relational engine requests two rowsets: one for the clustered index on Customers and the other for one of the nonclustered indexes on Orders .
The relational engine then uses the OLE DB API to request that the storage engine open the rowsets.
As the relational engine works through the steps of the execution plan and needs data, it uses OLE DB to fetch the individual rows from the rowsets it requested the storage engine to open. The storage engine transfers the data from the data buffers to the relational engine.
The relational engine combines the data from the storage engine rowsets into the final result set transmitted back to the user.

Not all communication between the relational engine and the storage engine uses OLE DB. Some commands cannot be expressed in terms of OLE DB rowsets. The most obvious and common example is when the relational engine processes data definition language (DDL) requests to create a table or other SQL Server object.

The Access Methods Manager

When SQL Server needs to locate data, it calls the access methods manager. The access methods manager sets up and requests scans of data pages and index pages and prepares the OLE DB rowsets to return to the relational engine. It contains services to open a table, retrieve qualified data, and update data. The access methods manager doesn't actually retrieve the pages; it makes the request of the Buffer Manager, which ultimately serves up the page already in its cache or reads it to cache from disk. When the scan is started, a look-ahead mechanism qualifies the rows or index entries on a page. Retrieving rows that meet specified criteria is known as a qualified retrieval . The access methods manager is employed not only for queries (selects) but also for qualified updates and deletes (for example, UPDATE with a WHERE clause).

A session opens a table, requests and evaluates a range of rows against the conditions in the WHERE clause, and then closes the table. A session descriptor data structure (SDES) keeps track of the current row and the search conditions for the object being operated on (which is identified by the object descriptor data structure, or DES).

The Row Operations Manager and the Index Manager

The row operations manager and the index manager can be considered components of the access methods manager because they carry out the actual method of access. Each is responsible for manipulating and maintaining its respective on-disk data structures, namely rows of data or B-tree indexes. They understand and manipulate information on data and index pages.

The Row Operations Manager

The row operations manager retrieves, modifies, and performs operations on individual rows. (For more information on on-disk structures, see Chapter 5; for more on data page and row format, see Chapter 6.) It performs an operation within a row, such as "retrieve column 2" or "write this value to column 3." As a result of the work performed by the access methods manager, lock manager, and transaction manager, the row will have been found and will be appropriately locked and part of a transaction. After formatting or modifying a row in memory, the row operations manager inserts or deletes a row.

The row operations manager also handles updates. SQL Server 7 offers three methods for handling updates. All three are direct, which means that there is no need for two passes through the transaction log, as was the case with deferred updates in earlier versions of SQL Server. SQL Server 7 has no concept of a deferred data modification operation.

The three update modes in SQL Server 7 are:

In-place mode This mode is used to update a heap or clustered index when none of the clustering keys change. The update can be done in place, and the new data is written to the same slot on the data page.
Split mode This mode is used to update nonunique indexes when the index keys change. The update is split into two operations ”a delete followed by an insert ”and these operations are performed independently of each other.
Split with collapse mode This mode is used to update a unique index when the index keys change. After the update is rewritten into delete and insert operations, if the same index key is both deleted and then reinserted with a new value, it is " collapsed " into a single update operation.

In Chapter 8, we'll take a look at some actual examples of each of these types of updates.

If you want to reorganize a table ” for example, to reestablish a FILLFACTOR value or to make data more contiguous after a lot of data modification has occurred ” you can use a clustered index, which makes the reorganization easy. You simply rebuild the clustered index, which rebuilds the entire table. In the case of a delete, if the row deleted is the last row on a data page, that page is deallocated. (The only exception occurs if that page is the only one remaining in the table. A table always contains at least one page, even if it is empty.)

The Index Manager

The index manager maintains and supports searches on B-trees, which are used for SQL Server indexes. An index is structured as a tree , with a root page and intermediate- and lower-level pages (or branches). A B-tree groups records that have similar index keys, thereby allowing fast access to data by searching on a key value. The B-tree's core feature is its ability to balance the index tree. ( B stands for balanced .) Branches of the index tree are spliced together or split apart as necessary so that the search for any given record always traverses the same number of levels and thus requires the same number of page accesses .

The traverse begins at the root page, progresses to intermediate index levels, and finally moves to bottom-level pages called leaf pages . The index is used to find the correct leaf page. On a qualified retrieval or delete, the correct leaf page is the lowest page of the tree at which one or more rows with the specified key or keys reside. SQL Server supports both clustered and nonclustered indexes. In a nonclustered index, shown in Figure 3-3, the leaf level of the tree (the leaf page of the index) contains every key value in the index along with a row locator for each key value. The row locator is also called a bookmark and indicates where to find the referenced data. A row locator can have one of two forms. If the base table has no clustered index, the table is referred to as a heap . The row locators in nonclustered index leaf pages for a heap are pointers to the actual rows in which the data can be found, and these pointers consist of a Row ID (RID), which is a file number, a page number, and a row number on the page. If the base table has a clustered index, the row locators in any nonclustered index leaf pages contain the clustered index key value for the row.

After reaching the leaf level in a nonclustered index, you can find the exact location of the data, although the page on which that data resides must still be separately retrieved. Because you can access the data directly, you don't need to scan all the data pages for a qualifying row. Better yet, in a clustered index, shown in Figure 3-4, the leaf level actually contains the data row, not simply the index key. A clustered index keeps the data in a table physically ordered around the key of the clustered index, and the leaf page of a clustered index is in fact the data page itself.

Because data can be physically ordered in only one way, only one clustered index can exist per table. This makes the selection of the appropriate key value on which to cluster data an important performance consideration.

You can also use indexes to ensure the uniqueness of a particular key value. In fact, the PRIMARY KEY and UNIQUE constraints on a column work by creating a unique index on the column's values. The optimizer can use the knowledge that an index is unique in formulating an effective query plan.

Internally, SQL Server always ensures that clustered indexes are unique by adding a 4-byte "uniqueifier" to clustered index key values that occur more than once. This uniqueifier becomes part of the key and is used in all levels of the clustered index and in references to the clustered index key through all nonclustered indexes.

Since SQL Server maintains ordering in index leaf levels, you do not need to unload and reload data to maintain clustering properties as data is added and moved. SQL Server will always insert rows into the correct page in clustered sequence. For a clustered index, the correct leaf page is the data page in which a row will be inserted. For a nonclustered index, the correct leaf page is the one into which SQL Server inserts a row containing the key value (and data row locator) for the newly inserted row. If data is updated and the key values of an index change, or if the row is moved to a different page, SQL Server's transaction control ensures that all affected indexes are modified to reflect these changes. Under transaction control, index operations are performed as atomic operations. The operations are logged and fully recovered in the event of a system failure.

click to view at full size.

Figure 3-3. A nonclustered index with the leaf level containing row locators: either a RID or a clustered index key value.

click to view at full size.

Figure 3-4. A clustered index with the data located at the leaf level.

Locking and Index Pages
As we'll see in the section on the lock manager, pages of an index use a slightly different locking mechanism than regular data pages. A lightweight lock called a latch is used to lock upper levels of indexes. Latches are not involved in deadlock detection because SQL Server 7 uses "deadlock-proof" algorithms for index maintenance.

You can customize the locking strategy for indexes on a table basis or index basis. The system stored procedure s p_indexoption lets you enable or disable page or row locks with any particular index or, by specifying a table name, for every index on that table. The setable options are AllowPageLocks and AllowRowLocks . If both of these options are set to false for a particular index, only table level locks are applied.

The Page Manager and the Text Manager

The page manager and the text manager cooperate to manage a collection of pages as named databases. Each database is a collection of 8 KB disk pages, which are spread across one or more physical files. (In Chapter 5, you'll find more details about the physical organization of databases.)

SQL Server uses six types of disk pages: data pages, text/image pages, index pages, Page Free Space (PFS) pages, Global Allocation Map (GAM and SGAM) pages, and Index Allocation Map (IAM) pages. All user data, except for the text and image datatypes, are stored on data pages. The text and image datatypes, which are used for storing large objects (up to 2 GB each of text or binary data), use a separate collection of pages, so the data is not stored on regular data pages with the rest of the rows. A pointer on the regular data page identifies the starting page and offset of the text/image data. Index pages store the B-trees that allow fast access to data. PFS pages keep track of which pages in a database are available to hold new data. Allocation pages (GAMs, SGAMs, and IAMs) keep track of the other pages. They contain no database rows and are used only internally. We'll cover all the types of pages extensively in Chapters 5 and 6.

The page manager allocates and deallocates all types of disk pages, organizing extents of eight pages each. An extent can be either a uniform extent, for which all eight pages are allocated to the same object (table or index), or a mixed extent, which can contain pages from multiple objects. If an object uses less than eight pages, the page manager allocates new pages for that object from mixed extents. When the size of the object exceeds eight pages, the page manager allocates new space for that object in units of entire uniform extents. This optimization prevents the overhead of allocation from being incurred every time a new page is required for a large table; it is incurred only every eighth time. Perhaps most important, this optimization forces data of the same table to be contiguous for the most part. At the same time, the ability to use mixed extents keeps SQL Server from wasting too much space if a database contains many small tables.

To determine how contiguous a table's data is, you use the DBCC SHOWCONTIG command. A table with a lot of allocation and deallocation can get fairly fragmented , and rebuilding the clustered index (which also rebuilds the table) can improve performance, especially when a table is accessed frequently using table scans.

The Transaction Manager

A core feature of SQL Server is its ability to ensure that transactions follow the ACID properties (discussed in Chapter 2). Transactions must be atomic ” that is, all or nothing. If a transaction has been committed, it must be recoverable by SQL Server no matter what ” even if a total system failure occurs one millisecond after the commit was acknowledged. In SQL Server, if work was in progress and a system failure occurred before the transaction was committed, all the work is rolled back to the state that existed before the transaction began . Write-ahead logging makes it possible to always roll back work in progress or roll forward committed work that has not yet been applied to the data pages. Write-ahead logging assures that a transaction's changes ”the "before and after" images of data ” are captured on disk in the transaction log before a transaction is acknowledged as committed. Writes to the transaction log are always synchronous ” that is, SQL Server must wait for them to complete. Writes to the data pages can be asynchronous because all the effects can be reconstructed from the log if necessary. The transaction manager coordinates logging, recovery, and buffer management. These topics are discussed later in this chapter; at this point, we'll just look at transactions themselves .

The transaction manager delineates the boundaries of statements that must be grouped together to form an operation. It handles transactions that cross databases within the same SQL Server, and it allows nested transaction sequences. (However, nested transactions simply execute in the context of the first-level transaction; no special action occurs when they are committed. And a rollback specified in a lower level of a nested transaction undoes the entire transaction.) For a distributed transaction to another SQL Server (or to any other resource manager), the transaction manager coordinates with the Microsoft Distributed Transaction Coordinator (MS DTC) service using operating system remote procedure calls. The transaction manager marks savepoints, which let you designate points within a transaction at which work can be partially rolled back or undone.

The transaction manager also coordinates with the lock manager regarding when locks can be released, based on the isolation level in effect. The isolation level in which your transaction runs determines how sensitive your application is to changes made by others and consequently how long your transaction must hold locks to protect against changes made by others. Four isolation-level semantics are available in SQL Server 7: Uncommitted Read (also called "dirty read"), Committed Read, Repeatable Read, and Serializable.

Your transactions' behavior depends on the isolation level. We'll look at these levels now, but a complete understanding of isolation levels also requires an understanding of locking because the topics are so closely related . The next section gives an overview of locking; more detailed information can be found in Chapter 13.

Uncommitted Read

Uncommitted Read, or dirty read (be careful ” do not confuse "dirty read" with "dirty page," which we'll discuss later) lets your transaction read any data that is currently on a data page, whether or not that data has been committed. For example, another user might have a transaction in progress that has updated data, and even though it is holding exclusive locks on the data, your transaction can read it anyway. The other user might then decide to roll back his or her transaction, so logically those changes were never made. If the system is a single-user system and everyone is queued up to access it, the changes would never have been visible to other users. In a multiuser system, however, you read the changes and possibly took action on them. Although this scenario isn't desirable, with Uncommitted Read you won't get stuck waiting for a lock, nor will your reads issue share locks (described below) that might affect others.

When using Uncommitted Read, you give up assurances of having strongly consistent data so that you can have the benefit of high concurrency in the system without users locking each other out. So when should you choose Uncommitted Read? Clearly, you don't want to choose it for financial transactions in which every number must balance. But it might be fine for certain decision-support analyses ” for example, when you look at sales trends ” for which complete precision is not necessary and the tradeoff in higher concurrency makes it worthwhile.

Committed Read

Committed Read is SQL Server's default isolation level. It ensures that an operation will never read data that another application has changed but not yet committed. (That is, it will never read data that logically never existed.) With Committed Read, if a transaction is updating data and consequently has exclusive locks on data rows, your transaction must wait for those locks to be released before you can use that data (whether you are reading or modifying). Also, your transaction must put share locks (at a minimum) on the data that will be visited, which means that data is potentially unavailable to others to use. A share lock doesn't prevent others from reading the data, but it makes them wait to update the data. Share locks can be released after the data has been sent to the calling client ” they do not have to be held for the duration of the transaction.

NOTE
Although you can never read uncommitted data, if a transaction running with Committed Read isolation subsequently revisits the same data, that data might have changed or new rows might suddenly appear that meet the criteria of the original query. Rows that appear in this way are called phantoms .

Repeatable Read

The Repeatable Read isolation level adds to the properties of Committed Read by ensuring that if a transaction revisits data or if a query is reissued, the data will not have changed. In other words, issuing the same query twice within a transaction will not pick up any changes to data values made by another user's transaction.

Preventing nonrepeatable reads from appearing is a desirable safeguard. But there's no free lunch . The cost of this extra safeguard is that all the shared locks in a transaction must be held until the completion (COMMIT or ROLLBACK) of the transaction. (Exclusive locks must always be held until the end of a transaction, no matter what the isolation level, so that a transaction can be rolled back if necessary. If the locks were released sooner, it might be impossible to undo the work.) No other user can modify the data visited by your transaction as long as your transaction is outstanding. Obviously, this can seriously reduce concurrency and degrade performance. If transactions are not kept short or if applications are not written to be aware of such potential lock contention issues, SQL Server can appear to "hang" when it is simply waiting for locks to be released.

NOTE
You can control how long SQL Server waits for a lock to be released by using the session option LOCK_TIMEOUT. We'll discuss this option, as well as many other ways of customizing locking behavior, in Chapter 14.

Serializable

The Serializable isolation level adds to the properties of Repeatable Read by ensuring that if a query is reissued, rows will not have been added in the interim. In other words, phantoms will not appear if the same query is issued twice within a transaction. More precisely, Repeatable Read and Serializable affect sensitivity to another connection's changes, whether or not the user ID of the other connection is the same. Every connection within SQL Server has its own transaction and lock space. We use the term "user" loosely so as to not obscure the central concept.

Preventing phantoms from appearing is another desirable safeguard. But once again, there's no free lunch. The cost of this extra safeguard is similar to that of Repeatable Read ” all the shared locks in a transaction must be held until completion of the transaction. In addition, enforcing the Serializable isolation level requires that you not only lock data that has been read, but also lock data that does not exist! For example, suppose that within a transaction we issue a SELECT statement to read all the customers whose zip code is between 98000 and 98100, and on first execution no rows satisfy that condition. To enforce the Serializable isolation level, we must lock that "range" of potential rows with zip codes between 98000 and 98100 so that if the same query is reissued, there will still be no rows that satisfy the condition. SQL Server handles this by using a special kind of lock called a range lock, which we'll discuss in Chapter 13. The Serializable level gets its name from the fact that running multiple serializable transactions at the same time is the equivalent of running them one at a time ” that is, serially ” without regard to sequence. For example, transactions A, B, and C are serializable only if the result obtained by running all three simultaneously is the same as if they were run one at a time, in any order. Serializable does not imply a known order in which the transactions are to be run. The order is considered a chance event. Even on a single-user system, the order of transactions hitting the queue would be essentially random. If the batch order is important to your application, you should implement it as a pure batch system.

SEE ALSO
For an interesting critique of the formal definitions of the ANSI isolation levels, see the companion CD, which contains a technical report called "A Critique of ANSI SQL Isolation Levels," published by the Microsoft Research Center.

The tough part of transaction management, of course, is dealing with rollback/rollforward and recovery operations. We'll return to the topic of transaction management and recovery a bit later. But first we'll further discuss locking and logging.

The Lock Manager

Locking is a crucial function of a multiuser database system such as SQL Server. Recall from Chapter 2 that SQL Server lets you manage multiple users simultaneously and ensures that the transactions observe the properties of the chosen isolation level. At the highest level, Serializable, SQL Server must make the multiuser system perform like a single-user system ” as though every user is queued up to use the system alone with no other user activity. Locking guards data and the internal resources that make this possible, and it allows many users to simultaneously access the database and not be severely affected by others' use.

The lock manager acquires and releases various types of locks, such as shared read locks, exclusive locks for writing, intent locks to signal a potential "plan" to perform some operation, extent locks for space allocation, and so on. It manages compatibility between the lock types, resolves deadlocks, and escalates locks if needed. The lock manager controls table, page, and row locks as well as system data locks. (System data, such as page headers and indexes, is private to the database system.)

The lock manager provides two separate locking systems. The first enables row locks, page locks, and table locks for all fully shared data tables, data pages and rows, text pages, and leaf-level index pages and index rows. The second locking system is used internally only for restricted system data; it protects root and intermediate index pages while indexes are being traversed. This internal mechanism uses latches, a lightweight, short-term variation of a lock for protecting data that does not need to be locked for the duration of a transaction. Full-blown locks would slow the system down. In addition to protecting upper levels of indexes, latches protect rows while they are being transferred from the storage engine to the relational engine. If you examine locks by using the sp_lock stored procedure or a similar mechanism that gets its information from the syslockinfo table, you won't see or be aware of latches; you'll see only the locks for fully shared data. However, some counters are available in Performance Monitor to monitor latch requests, acquisitions, and releases. These will be discussed in Chapter 15.

Locking is an important aspect of SQL Server. Many developers are keenly interested in locking because of its potential effect on application performance. Chapter 13 is devoted to the subject, so we won't go into it further here.

Other Managers

Also included in the storage engine are managers for controlling utilities such as bulk load, DBCC commands, backup and restore operations, and the Virtual Device Interface (VDI). VDI allows ISVs to write their own backup and restore utilities and to access the SQL Server data structures directly, without going through the relational engine. There is a manager to control sorting operations and one to physically manage the files and backup devices on disk.