Internal Storage

This section covers system catalogs and the internal data storage of tables. Although you can use SQL Server effectively without understanding the internals, understanding the details of how SQL Server stores data will help you develop efficient applications.

When you create a table, one or more rows are inserted into a number of system tables to manage that table. At a minimum, you can see metadata for your new table in the sys.tables, sys.indexes, and sys.columns catalog views. When you define the new table with one or more constraints, you'll also be able to see information in the sys.check_constraints, sys.default_constraints, sys.key_constraints, or sys.foreign_keys views. For every table created, a single row that containsamong other thingsthe name, object ID, and ID of the schema containing the new table is available through the sys.tables view. Remember that the sys.tables view inherits all the columns from sys.objects (which shows information relevant to all types of objects) and then includes additional columns pertaining only to tables. The sys.columns view shows you one row for each column in the new table, and each row will contain information such as the column name, data type, and length. Each column receives a column ID, which initially corresponds to the order in which you specified the columns when you created the tablethat is, the first column listed in the CREATE TABLE statement will have a column ID of 1, the second column will have a column ID of 2, and so on. Figure 6-5 shows the rows returned by the sys.tables and sys.columns views when you create a table. (Not all columns are shown for each view.)

Figure 6-5. Basic catalog information stored after a table is created

[View full width]

CREATE TABLE dbo.employee ( emp_lname varchar(15) NOT NULL, emp_fname varchar(10) NOT NULL, address varchar(30) NOT NULL, phone char(12) NOT NULL, job_level smallint NOT NULL ) sys.tables object_id name

schema_id type_desc ----------- ------------------

----------- ------------------ 917578307 employee 1

UsER_TABLE sys.columns object_id column_id name

system_type_id max_length ----------- -----------

------------ -------------- ---------- 917578307 1

emp_lname 167 15 917578307 2

emp_fname 167 10 917578307 3 address

167 30 917578307 4 phone

175 12 917578307 5

job_level 52 2

Note

There can be gaps in the column ID sequence if the table is altered to drop columns. However, the information schema view gives you a value called ORDINAL_POSITION because that is what the SQL standard demands. The ordinal position is the order the column will be listed when you SELECT * on the table. So the column_id is not necessarily the ordinal position of that column.

The sys.indexes Catalog View

In addition to sys.columns and sys.tables, the sys.indexes view returns at least one row for each table. In versions of SQL Server prior to SQL Server 2005, the sysindexes table contains all the physical storage information for both tables and indexes, which are the only objects that actually use storage space. The sysindexes table has columns to keep track of the space used by all tables and indexes, the physical location of each index root page, and the first page of each table and index. (In Chapter 7, I'll tell you more about root pages and what the "first" page actually means.) In SQL Server 2005, the compatibility view sys.sysindexes contains much of the same information, but it is incomplete because of changes in the storage organization in SQL Server 2005. The catalog view sys.indexes contains only basic property information about indexes, such as whether the index is clustered or nonclustered, unique or non-unique, and other properties, which I discuss in Chapter 7. To get all the storage information in SQL Server 2005 that previous versions provided in the sysindexes table, we have to look at two other catalog views in addition to sys.indexes: sys.partitions and sys.allocation_units (or alternatively, the undocumented sys.system_internals_allocation_units). I'll discuss the basic contents of these views shortly, but first let's focus on sys.indexes.

You might be aware that if a table has a clustered index, the table's data is actually considered part of the index, so the data rows are actually index rows. For a table with a clustered index, SQL Server has a row in sys.indexes with an index_id value of 1 and the name column in sys.indexes contains the name of the index. The name of the table that is associated with the index can be determined from the object_id column in sys.indexes. If a table has no clustered index, there is no organization to the data itself, and we call such a table a heap. A heap in sys.indexes table has an index_id value of 0, and the name column contains NULL. Every additional index has a row in sys.indexes with an indid (index ID) value between 2 and 250. Because there can be as many as 249 non-clustered indexes on a single table and there is one row for the heap or clustered index, every table has between 1 and 250 rows in the sys.indexes view for relational indexes. A table can have additional rows for XML indexes. Metadata for XML indexes is available in the sys.xml_indexes catalog view, which inherits columns from the sys.indexes view.

In SQL Server 2000, each row in the sysindexes table contains information about how much space that table or index takes up, and also information about where the pages for that structure can be found. LOB data stored in a table is stored in a separate location from the other table data, and it has its own row in sysindexes to reflect the size of the LOB data and its location. A row with an indid value of 255 is always used to keep track of storage information for LOB data.

Two main changes in SQL Server 2005 make it no longer efficient to store all storage information in a single table. First, SQL Server 2005 adds the ability to store a table or index on multiple partitions, so the space used by each partition, as well as its location, must be kept track of separately. Second, LOB data is now much more flexible. Not only is there an additional kind of LOB data called row-overflow data, but unlike in SQL Server 2000, LOB data can now be a part of an index. No longer can one extra row in sysindexes be sufficient for keeping track of LOB data if it occurs in multiple indexes and in two different formats. I'll discuss partitioning tables and indexes in Chapter 7, and I'll discuss the storage of row-overflow data, along with the storage of LOB data, later in this chapter.

Data Storage Metadata

Each table and index has a row in sys.indexes, and each table and index in a SQL Server 2005 database can be stored on multiple partitions. The sys.partitions view contains one row for each partition of each table or index. Every table or index has at least one partition, even if you haven't specifically partitioned the structure, but there can be up to 1,000 partitions for one table or index. So there is a one-to-many relationship between sys.indexes and sys.partitions. The sys.partitions view contains a column called partition_id as well as the object_id, so we can join sys.indexes to sys.partitions on the object_id column to retrieve all the partition ID values for a particular table or index. The term used in SQL Server 2005 to describe a subset of a table or index on a single partition is hobt, which stands for Heap Or B-Tree and is pronounced (you guessed it) "hobbit." (A B-Tree is the storage structure used for indexes.) The sys.partitions view includes a column called hobt_id, and in SQL Server 2005 there is always a one-to-one relationship between partition_id and hobt_id. In fact, in the sys.partitions table you will always see that these two columns have the same value. (Future versions of SQL Server might change the relationship between partitions and hobts so that it is not always one-to-one, and the partition_id and hobt_id might have different values, but any comments about such a future enhancement would be pure speculation at this point.)

Each partition (whether for a table or an index) can have three types of rows, each stored on its own set of pages. These types are called in-row data pages (for our "regular" data or index information), row-overflow data pages, and LOB data pages. A set of pages of one particular type for one particular partition is called an allocation unit, so the final catalog view I need to tell you about is sys.allocation_units. The view sys.allocation_units contains one to three rows per partition because there can be as many as three allocation units for each table or index on each partition. There is always an allocation unit for regular in-row pages, but there might also be an allocation unit for LOB data and one for row-overflow data. Figure 6-6 shows the relationship between sys.indexes, sys.partitions, and sys.allocation_units.

Figure 6-6. The relationship between sys.indexes, sys.partitions, and sys.allocation_units

Querying the Catalog Views

Let's look at a specific example now to see information in these three catalog views. Let's first create the table described earlier in Figure 6-5. You can create it in any database, but I suggest either using tempdb, so the table will be automatically dropped next time you restart your SQL Server, or creating a new database just for testing. Many of my examples will assume a database called test.

CREATE TABLE dbo.employee(                emp_lname  varchar(15)   NOT NULL,                emp_fname  varchar(10)   NOT NULL,                address    varchar(30)   NOT NULL,                phone      char(12)      NOT NULL,                job_level  smallint      NOT NULL )

This table will have one row in sys.indexes and one in sys.partitions, as we can see when we run the following queries. I am including only a few of the columns from sys.indexes, but sys.partitions only has six columns, so I have retrieved them all.

SELECT object_id, name, index_id, type_desc FROM sys.indexes WHERE object_id=object_id('dbo.employee') SELECT * FROM sys.partitions WHERE object_id=object_id('dbo.employee')

Here are my results. Yours might vary slightly because your ID values will most likely be different.

object_id   name  index_id  type_desc ----------- ----- --------- ------------ 5575058     NULL  0         HEAP partition_id      object_id index_id partition_number hobt_id           rows ----------------- --------- -------- ---------------- ----------------- ---- 72057594038779904 5575058   0        1                72057594038779904 0

Each row in the sys.allocation_units view has a unique allocation_unit_id value. Each row also has a value in the column called container_id that can be joined with partition_id in sys.partitions, as shown in this query:

SELECT object_name(object_id) AS name,     partition_id, partition_number AS pnum, rows,     allocation_unit_id AS au_id, type_desc as page_type_desc,     total_pages AS pages FROM sys.partitions p JOIN sys.allocation_units a    ON p.partition_id = a.container_id WHERE object_id=object_id('dbo.employee')

Again, for this simple table, I get only one row because there is only one partition, no nonclustered indexes, and only one type of data (IN_ROW_DATA). I will use this query later in this chapter, and I will refer to it as the "allocation query." (This query might not work in future versions of SQL Server if and when the relationship between hobts and partitions changes. In SQL Server 2005, because the relationship is always one-to-one, partition_id and hobt_id can be used interchangeably.)

name     partition_id      pnum rows au_id             page_type_desc pages -----    -------------     ---- ---- ----------------  -------------- ----- employee 72057594038779904 1    0    72057594043301888 IN_ROW_DATA    0

Now let's add some new columns to the table that will need to be stored on other types of pages. Varchar data can be stored on row-overflow pages if the total row size exceeds the maximum of 8,060 bytes. By default, text data is stored on text pages. For varchar data that stored on row-overflow pages, and for text data, there is additional overhead in the row itself to store a pointer to the off-row data. We'll look at the details of row-overflow and text data storage later in this section, and we'll look at ALTER TABLE at the end of this chapter, but now I just want to look at the additional rows in sys.allocation_units.

ALTER TABLE dbo.employee ADD resume_short varchar(8000) ALTER TABLE dbo.employee ADD resume_long text

If we run the preceding query that joins sys.partitions and sys.allocation_units, we get the following three rows:

[View full width]

name partition_id pnum rows au_id

page_type_desc pages -------- -------------------- ---- ----

----------------- ----------------- ----- employee 72057594038779904 1 0

72057594043301888 IN_ROW_DATA 0 employee 72057594038779904 1 0

72057594043367424 ROW_OVERFLOW_DATA 0 employee 72057594038779904 1 0

72057594043432960 LOB_DATA 0

You might also want to add an index or two and check the contents of these catalog views again. You should notice that just adding a clustered index does not change the number of rows in sys.allocation_units, but it will change the partition_id numbers because the entire table will be rebuilt internally when you create a clustered index. Adding a nonclustered index will add at least one more row to sys.allocation_units to keep track of the pages for that index. The following query joins all three viewssys.indexes, sys.partitions, and sys.allocation_unitsto show you the table name, index name and type, page type, and space usage information for the dbo.employee table:

SELECT  convert(char(8),object_name(i.object_id)) AS table_name,     i.name AS index_name, i.index_id, i.type_desc as index_type,     partition_id, partition_number AS pnum, rows,     allocation_unit_id AS au_id, a.type_desc as page_type_desc,     total_pages AS pages FROM sys.indexes i JOIN sys.partitions p         ON i.object_id = p.object_id AND i.index_id = p.index_id     JOIN sys.allocation_units a         ON p.partition_id = a.container_id WHERE i.object_id=object_id('dbo.employee')

Because I have not inserted any data into this table, you should notice that the values for rows and pages are 0. When I discuss actual page structures, we'll insert data into our tables so we can look at the internal storage of the data at that time. The queries I've run so far do not provide us with any information about the location of pages in the various allocation units. In SQL Server 2000, the sysindexes table contains three columns that indicate where data is located; these columns are called first, root, and firstIAM. These columns are still available in SQL Server 2005 (with slightly different names: first_page, root_page, and first_iam_page), but they can be seen only in an undocumented view called sys.system_internals_allocation_units. This view is identical to sys.allocation_units except for the addition of these three additional columns, so you can replace sys.allocation_units with sys.system_internals_allocation_units in the preceding allocation query and add these three extra columns to the select list. Keep in mind that as an undocumented object, this view (and others starting with sytem_internals) are for internal use only and are subject to change. Forward compatibility is not guaranteed. We'll briefly look at the first_page column in the next section, and the other two columns will be discussed in Chapter 7 when we look in detail at index structures.

Data Pages

Data pages are the structures that contain user data that has been added to a database's tables. As we saw earlier, there are three varieties of data pages, each of which stores data in a different format. There are pages for in-row data, pages for row-overflow data, and pages for LOB data. As with all other types of pages in SQL Server, data pages have a fixed size of 8 KB, or 8,192 bytes. They consist of three major components: the page header, data rows, and the row offset array, as shown in Figure 6-7.

Figure 6-7. The structure of a data page

Page Header

As you can see in Figure 6-7, the page header occupies the first 96 bytes of each data page (leaving 8,096 bytes for data, row overhead, and row offsets). Table 6-4 shows some of the information contained in the page header.

Table 6-4. Information Contained in the Page Header
Field	What It Contains
pageID	File number and page number of this page in the database
nextPage	File number and page number of the next page if this page is in a page chain
prevPage	File number and page number of the previous page if this page is in a page chain
Metadata: ObjectId	ID of the object to which this page belongs
Metadata: PartitionId	ID of the partition that this page is part of
Metadata: AllocUnitId	ID of the allocation unit that contains this page
Lsn	Log sequence number (LSN) value used for changes and updates to this page
slotCnt	Total number of slots (rows) used on this page
Level	Level of this page in an index (always 0 for leaf pages)
indexed	Index ID of this page (always 0 for data pages)
freeData	Byte offset of the first free space on this page
Pminlen	Number of bytes in fixed-length portion of rows
freeCnt	Number of free bytes on page
reservedCnt	Number of bytes reserved by all transactions
xactreserved	Number of bytes reserved by the most recently started transaction
tornBits	1 bit per sector for detecting torn page writes (discussed in Chapters 4 and 5)
flagBits	2-byte bitmap that contains additional information about the page

Data Rows for In-Row Data

Following the page header is the area in which the table's actual data rows are stored. The maximum size of a single data row is 8,060 bytes of in-row data. Rows can also have row-overflow and LOB data stored on separate pages. The number of rows stored on a given page will vary depending on the structure of the table and on the data being stored. A table that has all fixed-length columns will always be able to store the same number of rows per page; variable-length rows can store as many rows as will fit based on the actual length of the data entered. Keeping row length shorter allows more rows to fit on a page, thus reducing I/O and improving the cache-hit ratio.

Row Offset Array

The row offset array is a block of 2-byte entries, each indicating the offset on the page at which the corresponding data row begins. Every row has a 2-byte entry in this array (as discussed earlier, when I mentioned the 10 overhead bytes needed by every row). Although these bytes aren't stored in the row with the data, they do affect the number of rows that will fit on a page.

The row offset array indicates the logical order of rows on a page. For example, if a table has a clustered index, SQL Server stores the rows in the order of the clustered index key. This doesn't mean the rows are physically stored on the page in the order of the clustered index key. Rather, slot 0 in the offset array refers to the first row in the clustered index key order, slot 1 refers to the second row, and so forth. As we'll see shortly when we examine an actual page, the physical location of these rows can be anywhere on the page.

There's no internal global row number for every row in a table. SQL Server uses the combination of file number, page number, and slot number on the page to uniquely identify each row in a table, and we can use those values with an undocumented command to see the actual bytes on a page.

Examining Data Pages

You can view the contents of a data page by using the DBCC PAGE command, which allows you to view the page header, data rows, and row offset table for any given page in a database. Only a system administrator can use DBCC PAGE. But because you typically won't need to view the contents of a data page, you won't find information about DBCC PAGE in the SQL Server documentation. Nevertheless, in case you want to use it, here's the syntax:

DBCC PAGE ({dbid | dbname}, filenum, pagenum[, printopt])

The DBCC PAGE command includes the parameters shown in Table 6-5. Figure 6-8 shows sample output from DBCC PAGE with a printopt value of 1. Note that DBCC TRACEON(3604) instructs SQL Server to return the results to the client. Without this traceflag, no output will be returned for the DBCC PAGE command.

Table 6-5. Parameters of the DBCC PAGE Command
Parameter	Description
Dbid	ID of the database containing the page
Dbname	Name of the database containing the page
Filenum	File number containing the page
Pagenum	Page number within the file
Printopt	Optional print option; takes one of these values: 0 Default; prints the buffer header and page header 1 Prints the buffer header, page header, each row separately, and the row offset table 2 Prints the buffer and page headers, the page as a whole, and the offset table 3 Prints the buffer header, page header, each row separately, and the row offset table; each row is followed by each of its column values listed separately

Figure 6-8. Sample output from DBCC PAGE

[View full width]
DBCC TRACEON(3604) GO DBCC PAGE (pubs, 1, 153, 1) GO PAGE: (1:153) BUFFER: BUF @0x02C156E8 bpage = 0x057CC000 bhash = 0x00000000 bpageno = (1:153) bdbid = 6 breferences = 0 bUse1 = 30779 bstat = 0xc00009 blog = 0x32159 bnext = 0x00000000 PAGE HEADER: Page @0x057CC000 m_pageId = (1:153) m_headerVersion = m_type = 1 m_typeFlagBits = 0x4 m_level = 0 m_flagBits = 0x4200 m_objId (AllocUnitId.idObj) = 67 m_indexId (AllocUnitId.idInd) = 256 Metadata: AllocUnitId = 72057594042318848 Metadata: PartitionId = 72057594038321152 Metadata: IndexId = 1 Metadata: ObjectId = 2073058421 m_prevPage = (0:0) m_nextPage = (0:0) pminlen = 24 m_slotCnt = 23 m_freeCnt = 6010 m_freeData = 2136 m_reservedCnt = 0 m_lsn = (15:356:2) m_xactReserved = 0 m_xdesId = (0:0) m_ghostRecCnt = 0 m_tornBits = -1021426578 Allocation Status GAM (1:2) = ALLOCATED SGAM (1:3) = NOT ALLOCATED PFS (1:1) = 0x60 MIXED_EXT ALLOCATED 0_PCT_FULL DIFF (1:6) = CHANGED ML (1:7) = NOT MIN_LOGGED DATA: Slot 0, Offset 0x631, Length 88, DumpStyle BYTE Record Type = PRIMARY_RECORD Record Attributes = NULL_BITMAP VARIABLE_COLUMNS Memory Dump @0x5C0FC631 00000000:   30001800 34303820 3439362d 37323233 0 ...408 496-7223 00000010:   43413934 30323501 090000fe 05003300  CA94025.......3. 00000020:   38003f00 4e005800 3137322d 33322d31 8 .?.N.X.172-32-1 00000030:   31373657 68697465 4a6f686e 736f6e31  176WhiteJohnson1 00000040:   30393332 20426967 67652052 642e4d65  0932 Bigge Rd.Me 00000050:   6e6c6f20 5061726b  nlo Park Slot 1, Offset 0xb8, Length 88, DumpStyle BYTE Record Type = PRIMARY_RECORD Record Attributes = NULL_BITMAP VARIABLE_COLUMNS Memory Dump @0x5C0FC0B8 00000000:   30001800 34313520 3938362d 37303230 0 ...415 986-7020 00000010:   43413934 36313801 090000fe 05003300  CA94618.......3. 00000020:   38004000 51005800 3231332d 34362d38 8 .@.Q.X.213-46-8 00000030:   39313547 7265656e 4d61726a 6f726965  915GreenMarjorie 00000040:   33303920 36337264 2053742e 20233431  309 63rd St. #41 00000050:   314f616b 6c616e64  1Oakland Slot 2, Offset 0x110, Length 85, DumpStyle BYTE Record Type = PRIMARY_RECORD Record Attributes = NULL_BITMAP VARIABLE_COLUMNS Memory Dump @0x5C0FC110 00000000:   30001800 34313520 3534382d 37373233 0 ...415 548-7723 00000010:   43413934 37303501 090000fe 05003300  CA94705.......3. 00000020:   39003f00 4d005500 3233382d 39352d37 9 .?.M.U.238-95-7 00000030:   37363643 6172736f 6e436865 72796c35  766CarsonCheryl5 00000040:   38392044 61727769 6e204c6e 2e426572  89 Darwin Ln.Ber 00000050:   6b656c65 79keley /* Data for slots 3 through 20 not shown */ Slot 21, Offset 0x1c0, Length 89, DumpStyle BYTE Record Type = PRIMARY_RECORD Record Attributes = NULL_BITMAP VARIABLE_COLUMNS Memory Dump @0x5C0FC1C0 00000000:   30001800 38303120 3832362d 30373532 0 ...801 826-0752 00000010:   55543834 31353201 090000fe 05003300  UT84152.......3. 00000020:   39003d00 4b005900 3839392d 34362d32 9 .=.K.Y.899-46-2 00000030:   30333552 696e6765 72416e6e 65363720  035RingerAnne67 00000040:   53657665 6e746820 41762e53 616c7420  Seventh Av.Salt 00000050:   4c616b65 20436974  79Lake City Slot 22, Offset 0x165, Length 91, DumpStyle BYTE Record Type = PRIMARY_RECORD Record Attributes = NULL_BITMAP VARIABLE_COLUMNS Memory Dump @0x5C0FC165 00000000:   30001800 38303120 3832362d 30373532 0 ...801 826-0752 00000010:   55543834 31353201 090000fe 05003300  UT84152.......3. 00000020:   39003f00 4d005b00 3939382d 37322d33 9 .?.M.[.998-72-3 00000030:   35363752 696e6765 72416c62 65727436  567RingerAlbert6 00000040:   37205365 76656e74 68204176 2e53616c 7  Seventh Av.Sal 00000050:   74204c61 6b652043 697479t  Lake City OFFSET TABLE: Row - Offset 22 (0x16) - 357 (0x165) 21 (0x15) - 448 (0x1c0) 20 (0x14) - 711 (0x2c7) 19 (0x13) - 1767 (0x6e7) 18 (0x12) - 619 (0x26b) 17 (0x11) - 970 (0x3ca) 16 (0x10) - 1055 (0x41f) 15 (0xf) - 796 (0x31c) 14 (0xe) - 537 (0x219) 13 (0xd) - 1673 (0x689) 12 (0xc) - 1226 (0x4ca) 11 (0xb) - 1949 (0x79d) 10 (0xa) - 1488 (0x5d0) 9 (0x9) - 1854 (0x73e) 8 (0x8) - 1407 (0x57f) 7 (0x7) - 1144 (0x478) 6 (0x6) - 96 (0x60) 5 (0x5) - 2047 (0x7ff) 4 (0x4) - 884 (0x374) 3 (0x3) - 1314 (0x522) 2 (0x2) - 272 (0x110) 1 (0x1) - 184 (0xb8) 0 (0x0) - 1585 (0x631) DBCC execution completed. If DBCC printed error  messages, contact your system administrator.

As you can see, the output from DBCC PAGE is divided into four main sections: BUFFER, PAGE HEADER, DATA, and OFFSET TABLE (really the offset array). The BUFFER section shows information about the buffer for the given page. A buffer in this context is the in-memory structure that manages a page, and the information in this section is relevant only when the page is in memory.

The PAGE HEADER section in Figure 6-8 displays the data for all the header fields on the page. (Table 6-4 shows the meaning of most of these fields.) The DATA section contains information for each row. When DBCC PAGE is used with a printopt value of 1 or 3, DBCC PAGE indicates the slot position of each row, the offset of the row on the page, and the length of the row. The row data is then divided into three parts. The left column indicates the byte position within the row where the displayed data occurs. The next section contains the actual data stored on the page, displayed in four columns of eight hexadecimal digits each. The right column contains an ASCII character representation of the data. Only character data is readable in this column, although some of the other data might be displayed.

The OFFSET TABLE section shows the contents of the row offset array at the end of the page. In Figure 6-8, you can see that this page contains 23 rows, with the first row (indicated by slot 0) beginning at offset 1585 (0x631). The first row physically stored on the page is actually row 6, with an offset in the row offset array of 96. DBCC PAGE with a printopt value of 1 displays the rows in slot number order, even though, as you can see by the offset of each of the slots, that it isn't the order in which the rows physically exist on the page. If you use DBCC PAGE with a printopt value of 2, you see a dump of all 8,096 bytes of the page (after the header) in the order they are stored on the page.

The Structure of Data Rows

A table's data rows have the general structure shown in Figure 6-9. The data for all fixed-length columns is stored first, followed by the data for all variable-length columns. Table 6-6 shows the information stored in each row.

Figure 6-9. The structure of data rows

Table 6-6. Information Stored in a Table's Data Rows
Information	Mnemonic	Size
Status Bits A	TagA	1 byte
Status Bits B (not used in SQL Server 2000)	TagB	1 byte
Fixed-length size	Fsize	2 bytes
Fixed-length data	Fdata	Fsize 4
Number of columns	Ncol	2 bytes
NULL bitmap (1 bit for each column in table; a 1 indicates that the corresponding column is NULL or that the bit is unused.)	Nullbits	Ceiling (Ncol / 8)
Number of variable-length columns	VarCount	2 bytes
Variable column offset array	VarOffset	2 * VarCount
Variable-length data	VarData	VarOff[VarCount] - (Fsize + 4 + Ceiling (Ncol / 8) + 2 * VarCount)

Status Bits A contains a bitmap indicating properties of the row. The bits have the following meaning:

Bit 0 Versioning information; in SQL Server 2005, it's always 0.
Bits 1 through 3 Taken as a 3-bit value, 0 indicates a primary record, 1 indicates a forwarded record, 2 indicates a forwarding stub, 3 indicates an index record, 4 indicates a blob fragment or row-overflow data, 5 indicates a ghost index record, and 6 indicates a ghost data record. (I'll discuss forwarding and ghost records in Chapter 7.)
Bit 4 Indicates that a NULL bitmap exists; in SQL Server 2005, a NULL bitmap is always present, even if no NULLs are allowed in any column.
Bit 5 Indicates that variable-length columns exist in the row.
Bits 6 and 7 Not used in SQL Server 2005.

Within each block of fixed-length or variable-length data, the data is stored in the column order in which the table was created. For example, suppose a table is created with the following statement:

CREATE TABLE Test1 ( Col1 int              NOT NULL, Col2 char(25)         NOT NULL, Col3 varchar(60)      NULL, Col4 money            NOT NULL, Col5 varchar(20)      NOT NULL )

The fixed-length data portion of this row will contain the data for Col1, followed by the data for Col2, followed by the data for Col4. The variable-length data portion will contain the data for Col3, followed by the data for Col5. For rows that contain only fixed-length data, the following is true:

The first hexadecimal digit of the first byte of the data row will be 1, indicating that no variable-length columns exist. (The first hexadecimal digit comprises bits 4 through 7; bits 6 and 7 are always 0, and if there are no variable-length columns, bit 5 is also 0. Bit 4 is always 1, so the value of the four bits is displayed as 1.)
The data row ends after the NULL bitmap, which follows the fixed-length data (that is, the shaded portion shown in Figure 6-9 won't exist in rows with only fixed-length data).
The total length of every data row will be the same.

Column Offset Arrays

A data row that has all fixed-length columns has no variable column count or column offset array. A data row that has variable-length columns has a column offset array in the data row with a 2-byte entry for each variable-length column, indicating the position within the row where each column ends. (The terms offset and position aren't exactly interchangeable. Offset is 0-based, and position is 1-based. A byte at an offset of 7 is in the eighth byte position in the row.)

The two examples that follow illustrate how fixed-length and variable-length data rows are stored.

Storage of Fixed-Length Rows

First let's look at the simpler case of an all fixed-length row:

CREATE TABLE Fixed ( Col1 char(5)     NOT NULL, Col2 int         NOT NULL, Col3 char(3)     NULL, Col4 char(6)     NOT NULL )

When this table is created, you should be able to execute the following queries against the sys.indexes and sys.columns views to receive the information similar to the results shown:

SELECT object_id, type_desc,     indexproperty(object_id, name, 'minlen') as minlen     FROM sys.indexes where object_id=object_id('fixed') SELECT  column_id, name, system_type_id, max_length FROM sys.columns WHERE object_id=object_id('fixed')  RESULTS: object_id   type_desc  minlen ----------- ---------- ------- 53575229    HEAP       22 column_id   name               system_type_id max_length ----------- ------------------ -------------- ---------- 1           Col1               175            5 2           Col2               56             4 3           Col3               175            3 4           Col4               175            6

Note

The sysindexes table in SQL Server 2000 contains columns called minlen and xmaxlen, which store the minimum and maximum length of a row. In SQL Server 2005, these values are available only by using undocumented parameters to the indexproperty function. As with all undocumented features, keep in mind that they are not supported by Microsoft and future compatibility is not guaranteed.

For tables containing only fixed-length columns, the value returned for minlen by the indexproperty function is equal to the sum of the column lengths (from sys.columns.max_length) plus 4 bytes. It doesn't include the 2 bytes for the number of columns or the bytes for the NULL bitmap.

To look at a specific data row in this table, you must first insert a new row:

INSERT Fixed VALUES ('ABCDE', 123, NULL, 'CCCC')

Figure 6-10 shows this row's actual contents on the data page. To run the DBCC PAGE command, I needed to know what page number was used to store the row for this table. I mentioned earlier that a value for first_page was stored in an undocumented view called sys.system_internals_allocation_units, which is almost identical to the sys.allocation_units view. The following query gives me the value for first_page for the table called fixed:

SELECT object_name(object_id) AS name,     rows, type_desc as page_type_desc,     total_pages AS pages, first_page FROM sys.partitions p JOIN sys.system_internals_allocation_units a    ON p.partition_id = a.container_id WHERE object_id=object_id('dbo.fixed') RESULTS: name     rows  page_type_desc  pages  first_page -----    ----  --------------  -----  -------------- Fixed    1     IN_ROW_DATA     2     0xCF0400000100

Figure 6-10. A data row containing all fixed-length columns

I can then take the value of first_page from the preceding sys.system_internals_allocation_units output (0xCF0400000100) and convert it to a file and page address. In hexadecimal notation, each set of two hexadecimal digits represents a byte. I first had to swap the bytes to get 00 01 00 00 04 CF. The first two groups represent the 2-byte file number; the last four groups represent the page number. So the file is 0x0001, which is 1, and the page number is 0x000004CF, which is 1231 in decimal.

Unless you particularly enjoy playing with hexadecimal conversions, you might want to use one of two other options for determining the actual page numbers associated with your SQL Server tables and indexes. First you can create the function shown here to convert a 6-byte hexadecimal page number value (such as 0xCF0400000100) to a file_number:page_number format.

CREATE FUNCTION convert_page_nums (@page_num binary(6))    RETURNS varchar(11) AS   BEGIN    RETURN(convert(varchar(2), (convert(int, substring(@page_num, 6, 1))           * power(2, 8)) +              (convert(int, substring(@page_num, 5, 1)))) + ':' +                convert(varchar(11),    (convert(int, substring(@page_num, 4, 1)) * power(2, 24)) +    (convert(int, substring(@page_num, 3, 1)) * power(2, 16)) +    (convert(int, substring(@page_num, 2, 1)) * power(2, 8)) +    (convert(int, substring(@page_num, 1, 1)))) )   END

You can then execute this SELECT to call the function:

SELECT dbo.convert_page_nums(0xCF0400000100)

You should get back the result 1:1231.

Warning

SQL Server does not guarantee that the first_page column in sys.system_internals_allocation_units will always indicate the first page of a table. (The view is undocumented, after all.) I've found that first_page is reliable until you begin to perform deletes and updates on the data in the table.

The second option for determining the actual page numbers is to use another undocumented command called DBCC IND. Because most of the information returned is relevant only to indexes, I won't discuss this command in detail until Chapter 7. However, for a sneak preview, you can run the following command and note the values in the first two columns of output (labeled PageFID and PagePID) in the row where PageType = 1, which indicates that the page is a data page.

DBCC IND(test, fixed, -1)

You would replace test with the name of whatever database you were in when you created this table. The values for PageFID and PagePID should be the same value you used when you converted the hexadecimal string for the first_page value. In my case, I see that PageFID value is 1 and the PagePID value is 1231. So those are the values I use when calling DBCC PAGE:

DBCC PAGE(test, 1, 1231, 1)

Reading the output of DBCC PAGE takes a bit of practice. First note that the output shows the data rows in groups of 4 bytes at a time. The shaded area in Figure 6-10 has been expanded to show the bytes in an expanded form.

The first byte is Status Bits A, and its value (0x10) indicates that only bit 3 is on, so the row has no variable-length columns. The second byte in the row remains unused. The third and fourth bytes (1,600) indicate the length of the fixed-length fields, which is also the column offset in which the Ncol value can be found. (As a multi-byte numeric value, this information is stored in a byte-swapped form, so the value is really 0x0016, which translates to 22.) To know where in the row between 4 and 22 each column actually is located, we need to know the offset of each column. In SQL Server 2000, the syscolumns system table has a column indicating the offset within the row. Although in SQL Server 2005 you can still select from the compatibility view called syscolumns, the results you get back are not reliable. The offsets can be found in an undocumented view called sys.system_internals_partition_columns that we can then join to sys.partitions to get the information about the referenced objects and join to sys.columns to get other information about each column.

Here is a query to return basic column information, including the offset within the row for each column. I will use the same query for other tables later in this chapter, and I will refer to it as the "column detail query."

SELECT c.name AS column_name, column_id, max_inrow_length,         pc.system_type_id, leaf_offset  FROM sys.system_internals_partition_columns pc     JOIN sys.partitions p       ON p.partition_id = pc.partition_id     JOIN sys.columns c          ON column_id = partition_column_id             AND c.object_id = p.object_id WHERE p.object_id=object_id('fixed') RESULTS: column_name column_id   max_inrow_length system_type_id leaf_offset ----------- ----------- ---------------- -------------- ----------- Col1        1           5                175            4 Col2        2           4                56             9 Col3        3           3                175            13 Col4        4           6                175            16

So now we can find the data in the row for each column simply by using the offset value in the preceding results: the data for column Col1 begins at offset 4, the data for column Col2 begins at offset 9, and so on. As an int, the data in Col2 (7b000000) must be byte-swapped to give us the value 0x0000007b, which is equivalent to 123 in decimal.

Note that the 3 bytes of data for Col3 are all zeros, representing an actual NULL in the column. Because the row has no variable-length columns, the row ends 3 bytes after the data for column Col4. The 2 bytes starting right after the fixed-length data at offset 22 (0400, which is byte-swapped to yield 0x0004) indicate that four columns are in the row. The last byte is the NULL bitmap. The value of 0xf4 is 11110100 in binary, and bits are shown from high order to low order. The low-order 4 bits represent the four columns in the table, 0100, which indicates that only the third column actually IS NULL. The high-order 4 bits are 1111 because those bits are unused. The null bitmap must have a multiple of 8 bits, and if the number of columns is not a multiple of 8, there will be unused bits.

Storage of Variable-Length Rows

Now let's look at the somewhat more complex case of a table with variable-length data. Each row has three varchar columns and two-fixed length columns:

CREATE TABLE variable ( Col1 char(3)        NOT NULL, Col2 varchar(250)   NOT NULL, Col3 varchar(5)     NULL, Col4 varchar(20)    NOT NULL, Col5 smallint       NULL )

When this table is created, you should be able to execute the following queries against the sys.indexes, sys.partitions, sys.system_internals_partition_columns, and sys.columns views to receive the information similar to the results shown here:

SELECT object_id, type_desc,     indexproperty(object_id, name, 'minlen') as minlen     FROM sys.indexes where object_id=object_id('variable') SELECT name, column_id, max_inrow_length, pc.system_type_id, leaf_offset  FROM sys.system_internals_partition_columns pc     JOIN sys.partitions p           ON p.partition_id = pc.partition_id     JOIN sys.columns c           ON column_id = partition_column_id AND c.object_id = p.object_id WHERE p.object_id=object_id('variable') RESULTS: object_id   type_desc minlen ----------- --------- ----------- 69575286    HEAP      9 column_name column_id   max_inrow_length system_type_id leaf_offset ----------- ----------- ---------------- -------------- ----------- Col1        1           3                175            4 Col2        2           250              167            -1 Col3        3           5                167            -2 Col4        4           20               167            -3 Col5        5           2                52              7

Now you insert a row into the table:

INSERT variable VALUES     ('AAA', REPLICATE('X', 250), NULL, 'ABC', 123)

The REPLICATE function is used here to simplify populating a column; this function builds a string of 250 Xs to be inserted into Col2.

In Figure 6-11, the data for the fixed-length columns can be found by using the leaf_offset value in sys.system_internals_partition_columns, in the preceding query results. In this table, Col1 begins at offset 4 and Col5 begins at offset 7. Variable-length columns are not shown in the query output with fixed offset because the offset can be different in each row. Instead, the row itself holds the ending position of each variable-length column within that row in a part of the row called the Row Offset Array. The query output shows that Col2 has an leaf_offset value of 1, which means that Col2 is the first variable-length column; an offset for Col3 of 2 means that Col3 is the second variable-length column, and an offset of 3 for Col4 means that Col4 is the third variable-length column.

Figure 6-11. A data row with variable-length columns

To find the variable-length columns in the data row itself, you first locate the column offset array in the row. Right after the 2-byte field indicating the total number of columns (0x0500) and the NULL bitmap with the value 0xe4, a 2-byte field exists with the value 0x0300 (or 3, decimal) indicating that three variable-length fields exist. Next comes the column offset array. Three 2-byte values indicate the ending position of each of the three variable-length columns: 0x0e01 is byte-swapped to 0x010e, so the first variable byte column ends at position 270. The next 2-byte offset is also 0x0e01, so that column has no length and has nothing stored in the variable data area. Unlike with fixed-length fields, if a variable-length field has a NULL value, it takes no room in the data row. SQL Server distinguishes between a varchar containing NULL and an empty string by determining whether the bit for the field is 0 or 1 in the NULL bitmap. The third 2-byte offset is 0x1101, which, when byte-swapped, gives us 0x0111. This means the row ends at position 273 (and is a total of 273 bytes in length).

The total storage space needed for a row depends on a number of factors. Variable-length fields add additional overhead to a row, and their actual size is probably unpredictable. Even for fixed-length fields, the number of bytes of overhead can change depending on the number of columns in the table. In the earlier example illustrated in Figure 6-3, I mentioned that 10 bytes of overhead will exist if a row contains all fixed-length columns. For that row, 10 is the correct number. The size of the NULL bitmap needs to be long enough to store a bit for every column in the row. In the Figure 6-3 example, the table has 11 columns, so the NULL bitmap needs to be 2 bytes. In the examples illustrated by Figures 6-9 and 6-10, the tables have fewer than eight columns, so the NULL bitmaps need only a single byte. Don't forget that the total row overhead must also include the 2 bytes for each row in the row offset table at the bottom of the page.

Page Linkage

SQL Server 2005 doesn't connect the individual data pages of a table in a doubly linked list unless the table has a clustered index. Pages at each level of an index are linked together, and because the data is considered the leaf level of a clustered index, SQL Server does maintain the linkage. However, for a heap, there is no such linked list connecting the pages to each other. The only way that SQL Server determines which pages belong to a table is by inspecting the Index Allocation Maps (IAMs) for the table.

If the table has a clustered index, you can use the M_nextPage and M_prevPage values in the page header information to determine the ordering of pages in the list. Alternatively, you can use the DBCC IND command to get a list of all pages that belong to an index, along with a column showing the next page and the previous page, for each page. We'll look at a lot more detailed examples of DBCC IND in Chapter 7.

Row-Overflow Data

SQL Server 2005 lets you store some of a row's data off of the actual data page. This is the first change in the maximum row length in the SQL Server product since version 7.0. For SQL Server 7.0, the size of data pages changed from 2 KB to 8 KB, and along with that increase in page size came a corresponding increase in the maximum size of a data row. The maximum row size in SQL Server 6.5 was 1,962 bytes, and in SQL Server 7.0 it increased to 8,060 bytes; that maximum size stayed the same in SQL Server 2000. Along with the increase in page size, the maximum length of a variable-length column (varchar or varbinary) increased from 255 bytes to 8,000 bytes. In SQL Server 2005, 8,060 is still the maximum size of a row on a data page and 8,000 bytes is still the maximum defined length for a variable-length column. As discussed earlier, this maximum row size value includes several bytes of overhead stored with the row on the physical pages, so the total size of all the table's defined columns must be slightly less than this amount. In fact, SQL Server 2005 changed the error message you get if you try to create a table with more rows than the allowable maximum. If you execute the following CREATE TABLE statement with column definitions that add up to exactly 8,060 bytes, you'll get the error message shown.

CREATE TABLE dbo.bigrows (a char(3000),     b char(3000),     c char(2000),     d char(60) ) Msg 1701, Level 16, State 1, Line 1 Creating or altering table 'bigrows' failed because the minimum row size would be 8067, including 7 bytes of internal overhead. This exceeds the maximum allowable table row size of 8060 bytes.

In this message, you can see the number of overhead bytes (7) that SQL Server 2005 wants to store with the row itself. There is also an additional 2 bytes for the row offset bytes at the end of the page, but those bytes are not included in this total here. The only way to exceed this size limit of 8,060 bytes is to use variable-length columns because for variable-length data, SQL Server 2005 can store the columns in special row-overflow pages if all the fixed-length columns will fit into the regular in-row size limit.

Now let's take a look at a table with all variable-length columns. Note that although my columns are all varchar, columns of other data types can potentially also be stored on row-overflow data pages. These other data types include varbinary, nvarchar, and sqlvariant columns, as well as columns that use CLR user-defined data types. The following code creates a table with rows that have a maximum defined length of much greater than 8,060 bytes.

CREATE TABLE dbo.bigrows      (a varchar(3000),    b varchar(3000),    c varchar(3000),    d varchar(3000) )

In fact, if you run this CREATE TABLE statement in SQL Server 7.0, you get an error and the table will not be created at all. In SQL Server 2000, the table will be created, but you'll get a warning that inserts or updates might fail if the row size exceeds the maximum.

Not only can the preceding dbo.bigrows table be created in SQL Server 2005, but you can insert a row with column sizes that add up to more than 8,060 bytes with a simple INSERT, as shown here:

INSERT INTO dbo.bigrows      SELECT REPLICATE('e', 2100), REPLICATE('f', 2100),       REPLICATE('g', 2100), REPLICATE('h', 2100)

To determine whether SQL Server is storing any data in row-overflow data pages for a particular table, you can run the allocation query shown earlier:

SELECT object_name(object_id) AS name,     partition_id, partition_number AS pnum, rows,     allocation_unit_id AS au_id, type_desc as page_type_desc,     total_pages AS pages FROM sys.partitions p JOIN sys.allocation_units a    ON p.partition_id = a.container_id WHERE object_id=object_id('dbo.bigrows')

This query should return output similar to that shown here:

name    partition_id      pnum rows au_id             page_type_desc   pages ----    ----------------- ---- ---- ----------------- --------------   ----- bigrows 72057594039238656 1    1    72057594043957248 IN_ROW_DATA       2 bigrows 72057594039238656 1    1    72057594044022784 ROW_OVERFLOW_DATA 2

You can see there are two pages for the one row of regular in-row data, and two pages for the one row of row-overflow data. Alternatively, you can use the command DBCC IND(test, bigrows, -1) and see the four pages individually. Two pages are for the row-overflow data, and two are for the in-row data. The PageType values have the following meaning:

Page type 1 = data page
Page type 2 = index page
Page type 3 = LOB or row-overflow page
Page type 10 = IAM page

PageFID PagePID  ObjectID    PartitionID         IAM_chain_type     PageType ------- -------- ---------  -----------------    ---------------    -------- 1       2252     85575343  72057594039238656     Row-overflow data  3 1       2251     85575343  72057594039238656     Row-overflow data  10 1       2254     85575343  72057594039238656     In-row data        1 1       2253     85575343  72057594039238656     In-row data        10

We can see that there is one data page and one IAM page for the in-row data, and one data page and one IAM page for the row-overflow data. With the results from DBCC IND, we could also look at the page contents with DBCC page. On the data page for the in-row data, we would see three of the four varchar column values, and the fourth column would be stored on the data page for the row-overflow data. If you run DBCC PAGE for the data page storing the in-row data (page 1:2254 in my example), you'll notice that it isn't necessarily the fourth column in column order that is stored off the row. I won't show you the entire contents of the row because they are almost the entire page size. When I look at the in-row data page using DBCC PAGE, I see the column with e, the column with g, and the column with h, and it is the column with f that has moved to the new row. In the place of that column, we can see the bytes shown below. I have included the last byte with e (ASCII code 65) and the first byte with g (ASCII code 67), and in between there are 24 other bytes. The 17^th through 22^nd bytes of those 24 bytes are treated as a 6-byte numeric value: cc0800000100. We need to reverse the byte order and break it into a 1-byte hex value for the file number and a 4-byte hex value for the page number. So the file number is 0x01, or 1, for the file number, and 0x000008cc, or 2252, for the page number. This is the same file and page numbers that we saw using DBCC IND.

 65020000 61010000 00c92900 00340800 00cc0800 00010000 0067

SQL Server stores variable-length columns on row-overflow pages only under certain conditions. The determining factor is the length of the row itself. It doesn't matter how full the regular page is into which SQL Server is trying to insert the new row. SQL Server constructs the row normally, and only if the row itself needs more than 8,060 bytes will some of its columns be stored on overflow pages. Each column in the table is either completely on the row or completely off the row. This means that a 4,000-byte variable-length column cannot have half its bytes on the regular data page and half on a row-overflow page. If a row is less than 8,060 bytes and there is no room on the page where SQL Server is trying to insert it, normal page-splitting algorithms (which I'll describe in Chapter 7) are applied.

One row can span many row-overflow pages if it contains many large variable-length columns. For example, you can create the table dbo.hugerows and insert a single row into it:

CREATE TABLE dbo.hugerows      (a varchar(3000),    b varchar(8000),    c varchar(8000),    d varchar(8000)) INSERT INTO dbo.hugerows      SELECT REPLICATE('a', 3000), REPLICATE('b', 8000),          REPLICATE('c', 8000), REPLICATE('d', 8000)

Now if I run the allocation query shown earlier, substituting hugerows for bigrows, I get the results shown here:

name     partition_id      pnum rows au_id             page_type_desc   pages -------- ----------------- ---- ---- ----------------- ----------------- ----- hugerows 72057594039304192 1    1    72057594044088320 IN_ROW_DATA       2 hugerows 72057594039304192 1    1    72057594044153856 ROW_OVERFLOW_DATA 4

There are four pages for the row overflow information, one for the IAM page and three for the columns that didn't fit in the regular row. The number of large variable-length columns that a table can have is not unlimited, although it is quite large. There is a limit of 1,024 columns in any table, so that is definitely a limit there. But another limit will be reached before that. When a column has to be moved off a regular page onto a row-overflow page, SQL Server keeps a pointer to the row-overflow information as part of the original row. The pointer is always 24 bytes, and the row still needs 2 bytes in the row for each variable-length column, whether or not the variable-length column is stored in the row. So it turns out that 308 is the maximum number of overflowing columns we can have, and such a row needs 8,008 bytes just for the 26 overhead bytes for each overflowing column in the row.

Note

Just because SQL Server can store lots of large columns on row-overflow pages doesn't mean it's always a good idea to do so. This capability does allow you more flexibility in the organization of your tables, but you might pay a heavy performance price if many additional pages need to be accessed for every row of data. Row-overflow pages are intended to be a solution in the situation where most rows will fit completely on your data pages and you only occasionally need have row-overflow data. Using row-overflow pages, SQL Server can handle the extra data effectively, without requiring a redesign of your table.

In some cases, if a large variable-length column shrinks, it can be moved back to the regular row. However, for efficiency reasons, if the decrease is just a few bytes, SQL Server will not bother checking. Only when a column stored in a row-overflow page is reduced by more than 1,000 bytes will SQL Server even consider checking to see whether the column can now fit on the regular data page. You can observe this behavior if you previously created the dbo.bigrows table for the earlier example and inserted only the one row with 2,100 characters in each column.

The following update reduces the size of the first column by 500 bytes, reducing the row size to 7,900 bytes, which should all fit on the one data page.

UPDATE bigrows SET a = replicate('a', 1600)

However, if you run the allocation query again, you'll see that there are two row-overflow pages. Now reduce the size of the first column by more than 1,000 bytes and run the allocation query once more.

UPDATE bigrows SET a = 'aaaaa'

You should see only three pages for the table now because there is no longer a row-overflow data page. The IAM for the row-overflow data pages has not been removed, but you will no longer have a data page for row-overflow data.

When a column is stored on a row-overflow page, SQL Server has to store a pointer to the row-overflow page from the regular data row, which includes the offset of the column data on the row-overflow page, which, as we've seen, takes 24 bytes in the original row.

Keep in mind that row-overflow data storage applies only to columns of variable-length data, which are defined to be no longer than the normal variable-length maximum of 8000 bytes per column. In addition, to store a variable-length column on a row-overflow page, you must meet the following conditions:

All the fixed-length columns, including overhead bytes, must add up to no more than 8060 bytes. (The pointer to each row-overflow column adds 24 bytes of overhead to the row.)
The actual length of the variable-length column must be more than 24 bytes.
The column must not be part of the clustered index key.

If you have single columns that might need to store more than 8,000 bytes, you should use either large object (text, image, or ntext) columns or use the varchar(MAX) data type.

Large Object Data

If a table contains LOB data (text, ntext, or image types), by default the actual data is not stored on the regular data pages. Like row-overflow data, LOB data is stored in its own set of pages, and the allocation query shows you pages for LOB data as well as pages for regular in-row data and row-overflow data. For LOB columns, SQL Server stores a 16-byte pointer in the data row that indicates where the actual data can be found. Although the default behavior is to store all the LOB data off the data row, SQL Server 2005 allows you to change the storage mechanism by setting a table option to allow LOB data to be stored in the data row itself. We'll talk about the default storage of LOB data now.

As stated earlier, by default no LOB data is stored in the data row. Instead, the data row contains only a 16-byte pointer to a page (or the first of a set of pages) where the data can be found. These pages are 8 KB in size, like any other page in SQL Server, and individual text, ntext, and image pages aren't limited to holding data for only one occurrence of a text, ntext, or image column. A text, ntext, or image page can hold data from multiple columns and from multiple rows; the page can even have a mix of text, ntext, and image data. However, one text or image page can hold only text or image data from a single table.

The collection of 8-KB pages that make up a LOB column aren't necessarily located next to each other. The pages are logically organized in a B-tree structure so operations starting in the middle of the LOB string are very efficient. The structure of the B-tree will vary slightly depending on whether the amount of data is less than or more than 32 KB. (See Figure 6-12 for the general structure.) I'll discuss B-trees in more detail when I talk about index internals in Chapter 7.

Figure 6-12. A text column pointing to a B-tree that contains the blocks of data

If the amount of data is less than 32 KB, the text pointer in the data row points to an 84-byte text root structure. This forms the root node of the B-tree structure. The root node points to the blocks of text, ntext, or image data. While the data for LOB columns is arranged logically in a B-tree, physically both the root node and the individual blocks of data are spread throughout LOB pages for the table. They're placed wherever space is available. The size of each block of data is determined by the size written by an application. Small blocks of data are combined to fill a page. If the amount of data is less than 64 bytes, it's all stored in the root structure.

If the amount of data for one occurrence of a LOB column exceeds 32 KB, SQL Server starts building intermediate nodes between the data blocks and the root node. The root structure and the data blocks are interleaved throughout the text and image pages. The intermediate nodes, however, are stored in pages that aren't shared between occurrences of text or image columns. Each page storing intermediate nodes contains only intermediate nodes for one text or image column in one data row.

LOB Data Stored in the Data Row

If you store all your LOB data outside of your data pages, every time you access that data SQL Server will need to perform additional page reads, just like it does for row-overflow pages. In some cases, you might notice a performance improvement by allowing some of the LOB data to be stored in the data row. You can enable a table option called text in row for a particular table by setting the option to 'ON' or by specifying a maximum number of bytes to be stored in the data row. The following command enables up to 500 bytes of LOB data to be stored with the regular row data in a table called employee:

sp_tableoption employee, 'text in row', 500

Instead of a number, you can specify the option 'ON' (including the quote marks), which sets the maximum size to 256 bytes. Note that the value is in bytes, not characters. For ntext data, each character needs 2 bytes so that any ntext column will be stored in the data row if it is less than 250 characters. Once you enable the text in row option, you never get just the 16-byte pointer for the LOB data in the row, as is the case when the option is not on. If the data in the LOB field is more than the specified maximum, the row will hold the root structure containing pointers to the separate chunks of LOB data. The minimum size of a root structure is 24 bytes, and the possible range of values that text in row can be set to is 24 to 7,000 bytes.

To disable the text in row option, you can set the value to either 'OFF' or 0. To determine whether a table has the text in row property enabled, you can use the OBJECTPROPERTY function, as shown here:

SELECT OBJECTPROPERTY (object_id('employee'), 'TableTextInRowLimit')

This function returns the maximum number of bytes allowed for storing LOBs in a data row. If a 0 is returned, the text in row option is disabled.

Let's create a table very similar to the one we created to look at row structures, but we'll change the varchar(250) column to the text data type. We'll use almost the same insert statement to insert one row into the table.

CREATE TABLE HasText ( Col1 char(3)       NOT NULL, Col2 varchar(5)    NOT NULL, Col3 text          NOT NULL, Col4 varchar(20)   NOT NULL ) INSERT HasText VALUES     ('AAA', 'BBB', REPLICATE('X', 250), 'CCC')

Now let's find the basic information for this table using the allocation query, and also look at the DBCC IND values for this table:

SELECT convert(char(7), object_name(object_id)) AS name,      partition_id, partition_number AS pnum, rows,      allocation_unit_id AS au_id, convert(char(17), type_desc) as page_type_desc,     total_pages AS pages FROM sys.partitions p JOIN sys.allocation_units a    ON p.partition_id = a.container_id WHERE object_id=object_id('dbo.HasText') DBCC IND (test, HasText, -1) name    partition_id      pnum rows  au_id             page_type_desc  pages ------- ----------------- ---- ----- ----------------- --------------- ----- HasText 72057594039435264 1    1     72057594044350464 IN_ROW_DATA     2 HasText 72057594039435264 1    1     72057594044416000 LOB_DATA        2 PageFID  PagePID  ObjectID   PartitionID        IAM_chain_type     PageType -------  -------  --------   -----------        --------------     -------- 1        2251    133575514   72057594039435264  LOB data           3 1        2264    133575514   72057594039435264  LOB data           10 1        2265    133575514   72057594039435264  In-row data        1 1        2266    133575514   72057594039435264  In-row data        10

You can see that there are two LOB pages (the LOB data page and the LOB IAM page) and two pages for the in-row data (again, the data page and the IAM page). The data page for the in-row data is 2265, and the LOB data is on page 2251. Figure 6-13 shows the output from running DBCC PAGE on page 2265. The row structure is very similar to the row structure in Figure 6-10, except for the text field itself. Bytes 21 to 36 are the 16-byte text pointer, and you can see the value cb08 starting at offset 29. When we reverse the bytes, it becomes 0x08cb, or 2251 decimal, which is the page containing the text data, as we saw in the DBCC IND output.

Figure 6-13. A row containing a text pointer

[View full width]

DATA: Slot 0, Offset 0x60, Length 40, DumpStyle BYTE Record Type = PRIMARY_RECORD Record Attributes = NULL_BITMAP VARIABLE_COLUMNS Memory Dump @0x5CFEC060 00000000: 30000700 41414104 00400300 15002580

...AAA..@....%. 00000010: 28004242 420000a2 0f000000 00cb0800

(

.BBB........... 00000020: 00010001 00434343

....CCC

Now let's enable text data in the row, for up to 500 bytes:

EXEC sp_tableoption HasText, 'text in row', 500

Enabling this option does not force the text data to be moved into the row. We have to update the text value to actually force the data movement:

UPDATE HasText SET col2 = REPLICATE('Z', 250)

If you run DBCC PAGE on the original data page, you'll see the text column of 250 z's is now in the data row and the row is practically identical to the row containing varchar data that we saw in Figure 6-10.

Although enabling text in row does not move the data immediately, disabling the option does. If you turn off text in row, the LOB data moves immediately back onto its own pages, so you must make sure you don't turn this off for a large table during heavy operations.

A final issue when working with LOB data and the text in row option is when text in row is enabled but for some rows the LOB is longer than the maximum configured length. If you change the maximum length for text in row to 50 for the HasText table we've been working with, this also forces the LOB data for all rows with more than 50 bytes of LOB data to be moved off the page immediately, just as when you disable the option completely.

EXEC sp_tableoption HasText, 'text in row', 50

However, just setting the limit to a smaller value is different than disabling the option, in two ways. First, some of the rows might still have LOB data that is under the limit, and for those rows, the LOB data will be stored completely in the data row. Second, if the LOB data doesn't fit, the information stored in the data row itself will not simply be the 16-byte pointer, as would be the case if text in row were turned off. Instead, for LOB data that doesn't fit in the defined size, the row will contain a root structure for a B-tree that points to chunks of the LOB data. As long as the text in row option is not OFF (or 0), SQL Server will never store the simple 16-byte LOB pointer in the row. It will store either the LOB data itself, if it fits, or the root structure for the LOB data B-tree.

A root structure is at least 24 bytes long (which is why 24 is the minimum size for setting the text in row limit). Other information in the root includes:

Bytes 0 through 1: the type of column; 1 indicates a LOB root
Byte 2: level in the B-tree
Byte 3: unused
Bytes 4 through 7: a value used by optimistic concurrency control for cursors that increases every time a LOB is updated
Bytes 8 through 11: a random value used by DBCC CHECKTABLE that remains unchanged during the lifetime of each LOB
Bytes 12 through 23 and each succeeding group of 12 bytes in the column: links to LOB data on a separate page

As indicated earlier, when you first enable text in row, no data movement occurs until text data is actually updated. The same is true if the limit is increasedthat is, even if the new limit is large enough to accommodate LOB data that was stored outside the row, the LOB data will not be moved onto the row automatically. You must update the actual LOB data first.

Another point to keep in mind is that even if the amount of LOB data is less than the limit, the data will not necessarily be stored in the row. You're still limited to a maximum row size of 8,060 bytes for a single row on a data page, so the amount of LOB data that can be stored in the actual data row might be reduced if the amount of non-LOB data is large. In addition, if a variable-length column needs to grow, it might push LOB data off the page so as not to exceed the 8,060-byte limit. Growth of variable-length columns always has priority over storing LOB data in the row. If no variable-length char fields need to grow during an update operation, SQL Server will check for growth of in-row LOB data, in column offset order. If one LOB needs to grow, others might be pushed off the row.

Finally, you should be aware that SQL Server logs all movement of LOB data, which means that reducing the limit of or turning OFF the text in row option can be a very time-consuming operation for a large table.

Although large data columns using the LOB data types can be stored and managed very efficiently, using them in your tables can be problematic. Data stored as text, ntext, or image cannot always be manipulated using the normal data manipulation commands, and in many cases you'll need to resort to using the operations readtext, writetext, and updatetext, which require dealing with byte offsets and data-length values. Prior to SQL Server 2005, you had to decide whether to limit your columns to a maximum of 8,000 bytes or deal with your large data columns using different operators than you used for your shorter columns. SQL Server 2005 provides a solution that gives you the best of both worlds, as we'll see in the next section.

Storage of varchar(MAX) Data

SQL Server 2005 gives us the option of defining a variable-length field using the MAX specifier. Although this functionality is frequently described by referring only to varchar(MAX), the MAX specifier can also be used with nvarchar and varbinary. You can indicate the MAX specifier instead of an actual size when you define a column, variable, or parameter using one of these types. By using the MAX specifier, you leave it up to SQL Server to determine whether to store the value as a regular varchar, nvarchar, or varbinary value or as a LOB. In general, if the actual length is 8,000 bytes or less, the value will be treated exactly as if it were one of the regular variable-length data types, including possibly overflowing onto the special row-overflow pages as discussed earlier. If the actual length is greater than 8,000 bytes, SQL Server will store and treat the value exactly as if it were text, ntext, or image. Because variable-length columns with the MAX specifier are treated either as regular variable-length columns or as LOB columns, no special discussion of their storage is needed. You can see examples of working with varchar(MAX) data in Inside Microsoft SQL Server 2005: T-SQL Programming.

The size of values specified with MAX can reach the maximum size supported by LOB data, which is currently 2 gigabytes (GB). By using the MAX specifier, though, you are indicating that the maximum size should be the maximum the system supports. If, in the future, you upgrade a table with a varchar(MAX) column to some new SQL Server version, the MAX length will be whatever the new maximum is in the new version.

Storage of sql_variant Data

The sql_variant data type provides support for columns that contain any or all of the SQL Server base data types except LOBs and variable-length columns with the MAX qualifier, rowversion (timestamp), XML, and the types that can't be defined for a column in a table, namely cursor and table. For instance, a column can contain a smallint value in some rows, a float value in others, and a char value in the remainder.

This feature was designed to support what appears to be semi-structured data in products sitting above SQL Server. This semi-structured data exists in conceptual tables that have a fixed number of columns of known data types and one or more optional columns whose type might not be known in advance. An example is e-mail messages in Microsoft Outlook and Microsoft Exchange. With the sql_variant data type, you can pivot a conceptual table into a real, more compact table with sets of property-value pairs. Here is a graphical example: The conceptual table shown in Table 6-7 has three rows of data. The fixed columns are the ones that exist in every row. Each row can also have values for one or more of the three different properties, which have different data types.

Table 6-7. Conceptual Table with an Arbitrary Number of Columns and Data Types
Row	Fixed Columns	Property 1	Property 2	Property 3
row -1	XXXXXX	value-11		value -13
row -2	YYYYYY	value-22
row -3	ZZZZZZ	value-31	value-32

This can be pivoted into Table 6-8, where the fixed columns are repeated for each different property that appears with those columns. The column called value can be represented by sql_variant data and be of a different data type for each different property.

Table 6-8. Semi-Structured Data Stored Using the sql_variant Data Type
Fixed Columns	Property	Value
XXXXXX	property-1	value-11
XXXXXX	property-3	value-13
YYYYYY	property-2	value-22
ZZZZZZ	property-1	value-31
ZZZZZZ	property-2	value-32

Internally, columns of type sql_variant are always considered variable length. Their storage structure depends on the type of data, but the first byte of every sql_variant field always indicates the actual data type being used in that row.

I'll create a simple table with a sql_variant column and insert a few rows into it so we can observe the structure of the sql_variant storage.

CREATE TABLE variant (a int, b sql_variant) GO INSERT INTO variant VALUES (1, 3) INSERT INTO variant VALUES (2, 3000000000) INSERT INTO variant VALUES (3, 'abc') INSERT INTO variant VALUES (4, current_timestamp)

SQL Server decides what data type to use in each row based on the data supplied. For example, the 3 in the first INSERT is assumed to be an integer. In the second INSERT, the 3000000000 is larger than the biggest possible integer, so SQL Server assumes a decimal with a precision of 10 and a scale of 0. (It could have used a bigint, but that would need more storage space.) We can now use DBCC IND to find the first page of the table and use DBCC PAGE to see its contents:

DBCC IND (test, variant, -1) -- (I got a value of file 1, page 2508 for the data page in this table) GO DBCC TRACEON (3604) DBCC PAGE (test, 1, 2508, 1)

Figure 6-14 shows the contents of the four rows. I won't go into the details of every single byte because most are the same as what we've already examined.

Figure 6-14. Rows containing sql_variant data

[View full width]

DATA: Slot 0, Offset 0x60, Length 21, DumpStyle BYTE Record Type = PRIMARY_RECORD Record Attributes = NULL_BITMAP VARIABLE_COLUMNS Memory Dump @0x5C07C060 00000000: 30000800 01000000 0200fc01 00150038

..............8 00000010: 01030000 00

..... Slot 1, Offset 0x75, Length 24, DumpStyle BYTE Record Type = PRIMARY_RECORD Record Attributes = NULL_BITMAP VARIABLE_COLUMNS Memory Dump @0x5C07C075 00000000: 30000800 02000000 0200fc01 0018006c

..............l 00000010: 010a0001 005ed0b2

....^.. Slot 2, Offset 0x8d, Length 26, DumpStyle BYTE Record Type = PRIMARY_RECORD Record Attributes = NULL_BITMAP VARIABLE_COLUMNS Memory Dump @0x5C07C08D 00000000: 30000800 03000000 0200fc01 001a00a7

............... 00000010: 01401f08 d0003461 6263

@....4abc Slot 3, Offset 0xa7, Length 25, DumpStyle BYTE Record Type = PRIMARY_RECORD Record Attributes = NULL_BITMAP VARIABLE_COLUMNS Memory Dump @0x5C07C0A7 00000000: 30000800 04000000 0200fc01 0019003d

..............= 00000010: 011bd4e6 00bf9700 00

........ --------------------------------------------------

------------------ 00000000: 30000800 01000000 0200fc01 00150038

01030000 00 00000000: 30000800 02000000 0200fc01 0018006c

010a0001 005ed0b2 00000000: 30000800 03000000 0200fc01 001a00a7

01401f08 d0003461 6263 00000000: 30000800 04000000 0200fc01 0019003d

011bd4e6 00bf9700 00

The difference between the three rows starts at bytes 13 to 14, which indicate the position where the first variable-length column ends. Because there is only one variable-length column, this is also the length of the row. The sql_variant data begins at byte 15. Byte 15 is the code for the data type. You can find the codes in the system_type_id column of the sys.types catalog view. I've reproduced the relevant part of that view here:

system_type_id name -------------- ---------------- 34             image 35             text 36             uniqueidentifier 48             tinyint 52             smallint 56             int 58             smalldatetime 59             real 60             money 61             datetime 62             float 98             sql_variant 99             ntext 104            bit 106            decimal 108            numeric 122            smallmoney 127            bigint 165            varbinary 167            varchar 167            tid 167            id 173            binary 175            char 175            empid 189            timestamp 231            sysname 231            nvarchar 239            nchar

In our table, we have the data types 38 hex (which is 56 decimal, which is int), 6C hex (which is 108 decimal, which is numeric), A7 hex (which is 167 decimal, which is varchar), and 3D hex (which is 61 decimal, which is datetime). Following the byte for data type is a byte representing the version of the sql_variant format, and that is always 1 in SQL Server 2005. Following the version, there can be one of the following four sets of bytes:

For numeric and decimal: 1 byte for the precision and 1 byte for the scale
For strings: 2 bytes for the maximum length and 4 bytes for the collation ID
For binary and varbinary: 2 bytes for the maximum length
For all other types: no extra bytes

These bytes are then followed by the actual data in the sql_variant column.

There are many other issues related to working with sql_variant data and comparing data of different types. Some of the issues are discussed in Inside Microsoft SQL Server 2005: T-SQL Programming.

Figure 6-5. Basic catalog information stored after a table is created

The sys.indexes Catalog View

Data Storage Metadata

Figure 6-6. The relationship between sys.indexes, sys.partitions, and sys.allocation_units

Querying the Catalog Views

Data Pages

Figure 6-7. The structure of a data page

Page Header

Table 6-4. Information Contained in the Page Header

Data Rows for In-Row Data

Row Offset Array

Examining Data Pages

Table 6-5. Parameters of the DBCC PAGE Command

Figure 6-8. Sample output from DBCC PAGE

The Structure of Data Rows

Figure 6-9. The structure of data rows

Table 6-6. Information Stored in a Table's Data Rows

Column Offset Arrays

Storage of Fixed-Length Rows

Figure 6-10. A data row containing all fixed-length columns

Storage of Variable-Length Rows

Figure 6-11. A data row with variable-length columns

Page Linkage

Row-Overflow Data

Large Object Data

Figure 6-12. A text column pointing to a B-tree that contains the blocks of data

LOB Data Stored in the Data Row

Figure 6-13. A row containing a text pointer

Storage of varchar(MAX) Data

Storage of sql_variant Data

Table 6-7. Conceptual Table with an Arbitrary Number of Columns and Data Types

Table 6-8. Semi-Structured Data Stored Using the sql_variant Data Type

Figure 6-14. Rows containing sql_variant data