SQL Server 2005 Storage Engine | MCITP Self-Paced Training Kit Exam 70-442 .Designing and Optimizing Data Access by Using Microsoft SQL Server 2005

The storage engine is another important aspect of SQL Server 2005 for developers to master. SQL Server 2005 ultimately reads and writes data to disk. The storage engine is the component that is responsible for the management of the task of reading and writing data across all databases in the SQL Server instance. The storage engine has been designed to be "invisible" to the user in that they are not aware of its existence, except, of course, when something goes wrong. The storage engine is responsible for the following tasks:

Managing physical data and log files on the disk subsystem
Reading data from database storage files and loading it into buffer cache
Reading data from buffer cache and storing it to disk
Tracking atomicity of all transactions
Writing transaction log data to the log file
Ensuring the data is in a consistent state when the instance starts
Performing database backup operations when instructed
Performing database restore operations when instructed
Miscellaneous file-management tasks

As you can see, the database storage engine is a very busy component within the SQL Server 2005 architecture. The database storage engine incorporates an industry-standard set of algorithms known as "Algorithms for Recovery and Isolation Exploiting Semantics" or ARIES. ARIES was first developed in the late 1980s by scientists at IBM and, later, Microsoft incorporated them into SQL Server. Because the database storage engine is based on such mature standards, it is very efficient in its operation.

Storage Engine Architecture and Terminology

To understand the vast amount of documentation and information that exists for the database storage engine, database developers should be familiar with the following terms:

ACID properties The ACID (Atomicity, Consistency, Isolation, and Durability) properties are a core requirement for SQL Server; anyone developing applications within SQL Server should understand them. ACID properties are:
- Atomicity A transaction must be an atomic unit of work; either all of its data modifications are performed or none of them are performed.
- Consistency When completed, a transaction must leave all data in a consistent state. In SQL Server 2005, all rules must be applied to the transaction's modifications to maintain data integrity. All internal data structures, such as B-tree indexes or doubly linked lists, must be correct at the end of the transaction.
- Isolation Modifications made by concurrent transactions must be isolated from modifications made by all other concurrent transactions. A transaction either sees the data in the state it was in before another concurrent transaction modified it, or it sees the data after the second transaction has completed, but it does not see an intermediate state. This is referred to as serializability because it provides the system with the capability to reload the starting data and replay a series of transactions to end up with the data in the same state it was in after the original transactions were performed.
- Durability After a transaction has completed, its effects are permanently in place in the system. The modifications persist even in the event of a system failure.
Point in Time Point in Time is generally used to refer to the ability to restore a database to a specified point in time due to a hardware or software failure.
Stable Media Stable media is often confused with physical storage. SQL Server defines stable media as storage that can survive system restart or common failure. Many high-end disk subsystems provide high-speed cache facilities to reduce the latency of read and write operations. This cache is often supported by a battery-powered backup facility. The battery backup provides the necessary power to maintain the data in the cache for several days, but implementations vary by manufacturer. Manufacturers can switch out batteries to increase the life of the cache when necessary.

The key is that after the system problem has been corrected, the pending writes in the cache are treated as if the failure or restart never occurred. Most manufacturers' implementations immediately flush pending writes to physical disk during the restart operations.
Write Ordering Write ordering (or write dependency) is the capability of the I/O subsystem to preserve the order of I/O operations. As described previously, stable media can include caching. If a point in time is studied, the stable media should reveal preserved I/O order characteristics.

The order of the I/O operations associated with SQL Server must be maintained. The system must maintain write ordering, or it breaks the WAL protocol described in this appendix. (The log records must be written out in correct order and the log records must always be written to stable media before the data pages that the log records represent are written.) After a transaction log record is successfully flushed, the associated data page can then be flushed as well. If the subsystem permits the data page to reach stable media before the log record does, data integrity is breached.

For example, if a computer running SQL Server had to restart after the data page had reached stable media but before the log record, the database recovery might fail. Because the log record for a page modification does not exist, the recovery process cannot determine the proper transactional state of the page. Worse, the log record did not get flushed to stable media, so the recovery process is not aware that the page requires rollback and it cannot attempt to correct the problem, thereby leaving the database in an unknown state.
Multichannel and Load-Balancing Systems Many high-end storage environments, such as Storage Area Networks, implement load-balanced systems that can have multiple channels to support I/O requests. These systems must maintain support for I/O ordering. Many of these systems support I/O ordering with a stable media cache and subsequently combine and/or split I/O requests across available subsystem resources to complete the storing to physical media.
Torn I/O or Pages Torn I/O is often referred to as a torn page in SQL Server documentation. A torn I/O occurs when a partial write takes place, leaving the data in an invalid state. SQL Server data pages are 8 KB in size. A torn data page for SQL Server occurs when only a portion of the 8 KB is correctly written to or retrieved from stable media.

SQL Server always checks I/O completion status for any operating system error conditions and proper data transfer size and then handles errors appropriately. Torn pages generally arise after system outages where the subsystem does not complete the entire 8-KB I/O request.

Disk drive manufacturers generally limit data transfer guarantees to sector boundaries of 512 bytes, so if the stable media is a physical disk and the controller does not contain a cache with battery backup capabilities, the I/O request is limited to the final spin/ transfer rate of the physical drive. Therefore, if an I/O is issued to write 8 KB (for a total of sixteen 512-byte sectors), but only the first three sectors make it to stable media, then the page becomes torn, which results in data corruption. A subsequent read of the 8-KB page would bring in 3 sectors of the new version of the page and 13 sectors of an older version.

SQL Server can enable torn page detection on a per-database basis. A portion of the first 512-byte sector of the page contains the page header. When torn page detection is enabled, the header member contains information about each of the other 512-byte sectors or segments of the 8-KB page. When the page is read in, the torn page information is used to detect a torn page situation.

Torn page detection incurs minimal overhead and is a recommended practice for SQL Server installations.
Log Parity Hardware manufacturers guarantee sector-size writes, so SQL Server transaction log files are always written with sector-size alignment. Each sector of the transaction log contains a parity flag. This flag can be used to determine the last sector that was correctly written.

During a recovery operation, the log file(s) are scanned for the final sector that was written; the log records can then be used to return the database to the appropriate transactional state.
Mirroring and Remote Mirroring Mirroring is a common data redundancy and emergency recovery practice. Mirroring can be implemented at a software or hardware level. Mirroring installations have historically been physical implementations of locally attached hardware cabinets. Recent advances in remote mirroring technologies have made it possible to maintain mirrors across long distances.

Several types of mirroring implementations are on the market. Some implementations are cache-based; others ensure that the I/O takes place at all mirrored locations before the I/O request is considered complete. Whatever the implementation, write ordering must be maintained.

SQL Server considers a mirror to be a stable-media, point-in-time copy of the primary data. Point in time is an important aspect of this. Strict maintenance of the WAL requirements must occur on the mirrored subsystem to maintain the ACID properties of the data. The mirrored subsystem must mimic the exact points in time as experienced in the primary data.

For example, many high-end installations contain multiple I/O storage devices. If database log files are placed on one mirror set and data files on another mirror set, write ordering cannot be directly maintained across the separate hardware components. Without extended capabilities, the log and data page write order on the mirror devices cannot be maintained with point-in-time capabilities. Extended mirror capabilities are needed to ensure that write ordering can be maintained across multiple physical mirrored devices. These are often referred to as Mirror Groups or Consistency Groups.

Forced Unit Access (FUA) Forced Unit Access (FUA) occurs when a file is opened (by using CreateFile) with the FILE_FLAG_WRITETHROUGH flag. SQL Server opens all database and log files with this flag.

The flag indicates that on any write request, the FUA bit should be sent with the request to the subsystem. This bit indicates to the subsystem that data must reach stable media before the I/O is to be considered complete and the operating system signaled that the I/O is complete. No intermediate cache should be used that is not considered stable media. In other words, the write should go directly to the stable media; this process is called writethrough.

This prevents problems that occur when a cache (such as a disk cache that is not battery backed up) accepts the I/O and informs SQL Server that the I/O is complete, when in fact it has not yet been stored to stable media. Without this capability, SQL Server could not rely on the system to support WAL protocol needs.

Important

How FUA requests are handled

Integrated Drive Electronics (IDE) disk specifications and implementations do not have clear standards for how the FUA request is handled. Small Computer System Interface (SCSI) disk specifications and implementations use the FUA request to disable physical disk caches and other caching mechanisms. In many IDE implementations, the FUA request is simply discarded by the IDE hardware, thereby making this type of subsystem unsafe to use with SQL Server or with any other product that relies on the FUA behavior. Because of the need to honor the FUA setting, some IDE drive manufacturers have utilities that permit IDE drive caches to be disabled, making them safe for use with SQL Server.

Data Pages The SQL Server database page size is 8 KB. Each page contains a header with fields such as Page Number, Object Id, LSN, Index Id, Torn bits, and Types. The actual row data is located on the remaining portion of the page. The internal database structures track the allocation state of the data pages in the database.
Page Number A page number is a value from 0 through ((Max File Size/8 KB)-1). The page number multiplied by 8 KB provides the offset in the file to the first byte in the page.

When a page is read from disk, the page number is immediately checked to ensure that the proper offset was returned (the page number in the header is compared to the expected page number). If the value isn't as expected, SQL Server will generate Error 823.
Object ID This is the ID of the object to which the page is assigned within the schema of the database. A page can be assigned to only a single object. When the page is read from disk, the object ID is checked on the page. If the object ID does not match the expected object ID, SQL Server will generate Error 605.
Extents SQL Server generally allocates space an extent at a time. An extent is eight 8-KB pages, or 64 KB. SQL Server often performs reads in extent-sized (64 KB or 128 KB) boundaries as well.
Buffer Pool The buffer pool is also referred to as the BPool. The BPool consumes the majority of the user mode address space, leaving relatively little virtual address range free for thread stacks, dynamic-link libraries (DLLs), and other activities. The buffer pool is reserved in large chunks, but the working page size of 8 KB is maintained to match the database page size.

Hardware Read Cache The hardware read cache is commonly a read-ahead cache used by the controllers. Depending on the size of the available cache, the read-ahead cache is used to retrieve more data than the actual read request might have requested.

The hardware read cache and read-ahead cache will be helpful to an application whose data is commonly of a contiguous nature and is retrieved in a reasonably contiguous manner, such as OnLine Analytical Processing (OLAP) scans or a reporting application.

Because the hardware read cache removes some of the cache memory that could be used to support write requests, it can have a negative effect on OnLine Transaction Processing (OLTP) installations that require that data be written at high rates.

Important

Read-ahead and size constraints

Some controllers will not do read-ahead if the size of the read request is larger than 16 KB. If the primary application is SQL Server, the hardware read-ahead activity does not provide any benefit because I/O read requests can occur for sizes larger than 16 KB. Check with your hardware vendor for complete details and recommendations when running SQL Server.

Hardware Write Cache The hardware write cache not only caches write requests, but also fulfills read requests if the data is still available in the hardware write cache. This is a commonly used I/O caching mechanism.

Hardware write cache capabilities can be critical in maintaining OLTP performance goals. With the proper battery backup and algorithms, the hardware write cache can secure the data safely (on stable media) as well as increase the speed of SQL Server by hiding much of the true physical I/O time.
Write-Ahead Logging (WAL) Protocol A key to ACID property maintenance is the WAL protocol. The WAL protocol requires that all transaction log records associated with a particular data page be flushed to stable media before the data page can be flushed to stable media.

SQL Server 2005 uses 8-KB data pages and sector-aligned transaction log buffers.
Log Sequence Number The log sequence number (LSN) value is a three-part, uniquely incrementing value. It is used for maintaining the sequence of the transaction log records in the database. This enables SQL Server to maintain the ACID properties and to perform appropriate recovery actions.

When a modification occurs, a log record is generated with a new LSN value. The same LSN value is stored in the data page header so the last log record and matching data page can be associated with each other.

Latching SQL Server uses latches to provide data synchronization. A latch is a user-mode reader-writer lock implemented by SQL Server. Each data page in memory has a buffer (BUF) tracking structure. The BUF structure contains status information (Dirty, On LRU, In I/O) as well as a latch structure.

Locking maintains the appropriate lock activity; latching controls physical access. For example, it is possible for a lock to be held on a page that is not in memory. The latch is only appropriate when the data page is in memory (associated with a BUF).

SQL Server 2005 utilizes the Win32 APIs WriteFileGather and ReadFileScatter for access to the disk subsystem. The WriteFileGather function gathers data from multiple discontinuous buffers and writes the data to a file. The ReadFileScatter function reads data from a file and distributes the data into multiple discontinuous buffers.

These APIs enable SQL Server to avoid multiple physical I/O requests. For example, during the checkpoint process, up to sixteen 8-KB pages can be flushed with a single WriteFileGather invocation. Previous versions of SQL Server had to issue an I/O request for each data page, or would have to sort and buffer a larger request itself, leading to possible performance problems on busy systems.

Important

Scatter-Gather I/O

Scatter-gather capabilities are hardware specific. When the hardware does not support scatter-gather capabilities, the operating system must intervene and issue separate I/O requests. To maximize SQL Server I/O performance, ensure that your subsystem natively supports scatter-gather I/O operations.

Lazy Writer The lazy writer in SQL Server 2005 attempts to locate up to 16 unique pages for each sweep to return to the free list. If the page reference count has dropped to zero, the page is eligible for return to the free list. If the page is marked dirty, the log records and page date will be flushed to disk.

Thus, it is possible for the lazy writer to flush 16 * 16 pages to disk in a single sweep. This is efficient because many of the pages will remain in the SQL Server buffer pool but are now in a clean state. I/O is done in the background from the primary SPID. When the lazy writer needs additional buffers for the free list, the buffers might not need to be flushed to disk, but are returned to the free list.
Checkpoint The SQL Server 2005 checkpoint process periodically sweeps the buffer pool for buffers that contain pages from a specified database and flushes all dirty buffers to stable media. This makes recovery quicker because roll-forward operations have less physical work to complete.

As described earlier, the checkpoint process uses the same I/O approach to post up to 16 pages in a single I/O. Because the I/O is posted (OVERLAPPED), checkpoint does not wait for each I/O request to complete immediately. Checkpoint continues to track posted and completed I/Os, but attempts to maintain a high level of outstanding I/Os (for example, 100 write requests continuously outstanding). This maximizes I/O throughput and reduces checkpoint runtime.

Before the introduction of WriteFileGather, SQL Server sorted the buffers for a specified database in page order and issued I/O requests in page order. This required many physical I/O requests because the page order of the flush is not in contiguous memory location order. However, it often kept the physical subsystem mechanisms in physical locations that are in close proximity to make the I/O requests complete faster.

With the older design, an elevator seek can be a problem. Issuing many I/Os in page order generally results in a similar "on disk" order, resulting in table "hot spots." Hot spots are not as much of a problem in SQL Server 2005.

With WriteFileGather, SQL Server can sweep the buffer pool without requiring any physical ordering relationship to the pages on disk. By gathering 128 KB (sixteen 8-KB pages) in a group, SQL Server is able to transfer blocks of data with far fewer physical I/O requests. This allows the checkpoint process to maintain its speed, while the random nature of the I/O requests deters any elevator seeking that could affect other I/O operations.

All databases, except for tempdb, are checkpointed. Tempdb does not require recovery (it is re-created every time SQL Server starts), so flushing data pages to disk is not optimal for tempdb, and SQL Server avoids doing so.

Checkpoint protects the system from I/O flooding by serializing checkpoint processes. Only one checkpoint at a time can run to completion. The checkpoint and lazy writer processes also communicate with each other to control I/O queue depths.

Eager Write SQL Server 2005 uses eager writes to post data pages associated with minimally logged or bulk operations. This enables the async capabilities of the I/O patterns to keep the dirty pages flowing to disk without dirtying large and unwanted portions of the buffer pool. The same mechanism as that used by the lazy writer and checkpoint processes to post I/O operations is used.

Important

Efficient I/O

Lazy writer, checkpoint, and eager write do not wait for the I/O to immediately complete. They always post the I/O with WriteFileGather with the OVERLAPPED option and continue with other work, checking for I/O completion success at a later point in time. This enables SQL Server to maximize both CPU and I/O resources for the appropriate tasks.

SQL Server Database Engine Physical Operations

Understanding how the SQL Server database engine physically reads and writes data to the disk is important for any database developer who wants to understand how his or her application will perform. It is very easy, however, to get lost in the vast sea of information about ARIES technologies and the intricacies of SQL Server implementation. The most important facet for developers to take away is how data and log pages are flushed to disk.

Flushing Data Pages to Disk

The three primary mechanisms shown below trigger the flush of a data page to disk. However, each mechanism uses the same internal routine in the buffer pool to achieve the transfer.

Lazy write (least recently used (LRU) and memory-pressure based)
Checkpoint (recovery-interval based)
Eager write (nonlogged I/O based)

To efficiently flush writes to disk, WriteFileGather is used. This enables SQL Server to bundle consecutive dirty pages into a single write request. SQL Server performs the following steps to flush a single page:

It latches the page to prevent further modifications.
It ensures that log records up to and including the LSN on the page are flushed to stable media.
It establishes proper entries for the WriteFileGather invocation.

SQL Server performs the following steps to set up another page for flushing and repeats for up to 16 total pages (inclusive of the first page):

SQL Server does a hash lookup for the next contiguous page. For example, if the page to be flushed were page 100, SQL Server searches for page 101 in the buffer hash array.
If it does not locate the page, then the end of contiguous I/O block is established and the I/O is posted.
If it locates the page, it acquires the latch to prevent further modifications if the page might be dirty.
It checks to ensure that the page is dirty and needs to be written. If it isn't, SQL Server releases the latch and considers the end of contiguous I/O block as established and submits the asynchronous I/O request.

If the page is dirty, SQL Server follows the steps described earlier that detail how it flushes a single page. After the set of pages to be flushed is determined, the WriteFileGather function is invoked to post (Async / OVERLAPPED) the I/O request with the associated callback function to complete the I/O operation.

When SQL Server determines that HasOverlappedIoCompleted returns TRUE, GetOverlapped-Results is used to gather completion information from the system and the callback function is invoked. The callback makes an appropriate determination regarding the success of the I/O operation and releases the latch on each page.

Flushing Log Pages To Disk

The process for flushing log records to disk is very similar to the way that data pages are written. The log manager is responsible for all writes for all transaction log records for all databases.

When a request to flush all log records up to a certain LSN is requested by any worker on the system, the request is queued to the log manager. The worker then waits for a response from the log manager specifying that the I/O has completed successfully. The log manager retrieves from the queue and formats the request; it then posts the I/O on sector-aligned boundaries.

The I/O is posted with WriteFile using the OVERLAPPED (async) mechanisms. The log manager can then return to service other queued requests. When the I/O is completed, the completion routine is run to check the success of the write. If the write is successful, the waiting workers can be signaled to continue their operations.

Write ordering is critical at this stage. Because multiple log write requests can be posted for the same transaction log, the LSN order must be maintained.

For example, pages 5, 10, and 20 are modified by separate transactions. Page 5 was modified first, then 10, and then 20. LSN flush requests occurred for page 5, 10, and 20 and were put in motion in the same order. If the log record(s) for page 5 and 20 are flushed to stable media, only the flush to LSN for page 5 is considered done and SQL Server might only flush page 5. The LSN 10 must be flushed to stable media before 10 and then 20 can be considered to be flushed to the LSN.

Read-Ahead

SQL Server uses sophisticated algorithms to retrieve data pages that are going to be used in the immediate future. For example, if you run a query that can use an index to determine applicable rows, a read-ahead might occur on the actual data pages that are needed to complete the select list. As index entries are identified, SQL Server can post OVERLAPPED I/O operations for the data pages that will be used in upcoming steps of the query plan. For example, this is how a query using a bookmark lookup operator uses read-ahead. (More information about the bookmark lookup is included later in this appendix.)

This example is just one of many read-ahead situations that SQL Server can use. Enabling the index searching to continue while the data page I/O is in progress maximizes the CPU and I/O of the system. The I/O is often completed by the time it is needed, so other steps in the plan have direct memory access to the needed data and do not have to stall while waiting on I/O. When a read-ahead is posted, it can be from 1 to 1,024 pages. SQL Server limits a single read-ahead request depth to 1,024 pages.

SQL Server uses the following steps to set up read-ahead:

It obtains the requested amount of buffers from the free list.
For each page:
1. It determines the in-memory status of the page by doing a hash search.
2. If it's found to be already in memory, SQL Server sets up the read-ahead request to immediately return the buffer to the free list upon I/O completion.
3. It establishes the proper I/O request information for ReadFileScatter invocation.
4. It acquires I/O latch to protect buffer from further access.
5. If the page is not found in hash search, then SQL Server inserts it into the hash table.
It issues the ReadFileScatter operation to read the data.

When the I/O operation is complete, each page is sanity checked for a valid page number and torn page errors. In addition, various other data integrity and safety checks are performed. The I/O latch is then released so the page is available for use if it is located on the hash chain. If the page was determined to be already in memory, the page is immediately discarded to the free list.

This process shows the key factors of SQL Server I/O patterns. Read-ahead goes after pages that can already be in memory or not allocated. Because SQL Server maintains the in-memory buffers and hash chains, SQL Server tracks the page's state. Importantly, read-ahead processing opens the door for overlapping read and write requests at the hardware level.

If a page is already in memory when the read-ahead request is posted, the contiguous read is still needed and is faster than breaking up read requests into multiple physical requests. SQL Server considers the read to be unusable for the page in question, but many of the pages around it might be usable. However, if a write operation is in progress when the read is posted, the subsystem has to determine which image of the read to return. Some implementations return the current version of the page before the write is complete; others make the read wait until the write completes; and yet others return a combination, showing partially new data and partially old data. The key is that SQL Server will discard the read as unusable but the subsystem needs to maintain the proper image for subsequent read operations. The in-progress write, when completed, must be the next read image returned to the server running SQL Server.

Important

Read-ahead vs. parallel query plans

Do not confuse read-ahead with parallel query plans. Read-ahead occurs independently of the parallel query plan selection. The parallel plan might drive I/O harder because multiple workers are driving the load, but read-ahead occurs for serial and parallel plans. To ensure that parallel workers do not work on the same data sets, SQL Server implements the parallel page supplier to help segment the data requests.

Having a good understanding of the internal workings and architecture of the SQL Server 2005 database storage engine helps database developers to develop applications that can overcome the weaknesses of the storage engine as well as take advantage of its strengths.