Table and Index Management > Page Structure
You may recall that each database file is divided into fixed size pages. The B+-tree module manages all those pages. Each page is either a tree page (internal or leaf), or an overflow page, or a free page. (Figure 2-1 in the "Database File Structure" section shows that free pages are organized into a single trunk list.) In this section, we study structures of internal, leaf, and overflow pages.
The logical content of each tree page is partitioned into what are called cells: a cell contains a (part of a) payload. Cells are units of space allocation and deallocation on tree pages. The format of a tree page is given in Figure 5-3.
Each tree page is divided into four sections:
The page header
The cell content area
The cell pointer array
The cell pointer array and the cell content area grow toward each other. The cell pointer array acts as the page-directory that helps map logical cell order to their physical cell storage in the cell content area.
A page header contains the management information only for that page. The header is always stored at the beginning of page (low address). (Page 1 is an exception: the first 100 bytes contain the file header.) The structure of the page header is given in Table 5-1. The first two columns are in bytes.
|0||1||Flags. 1: intkey, 2: zerodata, 4: leafdata, 8: leaf|
|1||2||Byte offset to the first free block|
|3||2||Number of cells on this page|
|5||2||First byte of the cell content area|
|7||1||Number of fragmented free bytes|
|8||4||Right child (the Ptr(n) value). Omitted on leaves.|
The flags at offset 0 define the format of the page. If the leaf flag bit is true, it means that the page is a leaf node and has no children. If the zerodata flag bit is true, it means that this page carries only key values and no data. If the intkey flag bit is true, it means that the key is an integer that is stored in the key size entry in the cell header rather than in the payload area (see later in this section). If the leafdata flag bit is true, it means that the tree stores data on leaf nodes only. For internal pages, the header also contains the rightmost child pointer at offset 8.
Cells are stored at the very end of the page (high address), and they grow toward the beginning of the page. The cell pointer array begins on the first byte after the page header, and it contains zero or more cell pointers. Each cell pointer is a 2-byte integer number indicating an offset (from the beginning of the page) to the actual cell within the cell content area. The cell pointers are stored in sorted order (by the corresponding key values), even though cells may be stored unordered. The number of elements in the array is stored in the page header at offset 3.
Because of random inserts and deletes of cells, a page may have cells and free space interspersed (inside the cell content area). The unused space within the cell content area is collected into a linked list of free blocks. The blocks on the list are arranged in ascending order of their addresses. The head pointer (a 2-byte offset) to the list originates in the page header at offset 1. Each free block is at least 4 bytes in size. The first 4 bytes on each free block store control information: the first 2 bytes for the pointer to the next free block (a zero value indicates no next free block), and the other 2 bytes for the size of this free block. Because a free block must be at least 4 bytes in size, any group of 3 or fewer unused bytes (called a fragment) in the cell content area cannot exist on the free block chain. The cumulative size of all fragments is recorded in the page header at offset 7. (The size can be at most 255. Before it reaches the max value, the page is defragmented.) The first byte of the cell content area is recorded in the page header at offset 5. The value acts as the border between the cell content area and the unallocated space.
Cells are variable length byte strings. A cell stores a single payload. The structure of a cell is given in Table 5-2. The size column is in bytes.
|4||Page number of the left child. Omitted if leaf flag bit is set.|
|var(1 9)||Number of bytes of data. Omitted if the zerodata flag bit is set.|
|var(1 9)||Number of bytes of key. Or the key itself if intkey flag bit is set.|
|4||First page of the overflow chain. Omitted if no overflow.|
For internal pages, each cell contains a 4-byte child pointer; for leaf pages, cells do not have child pointers. Next comes the number of bytes in the data image, and the number of bytes in the key image. (The key image does not exist if the intkey flag bit is set in the page header. The key length bytes instead store the integer key value. The data image is nonexistent if the zerodata flag is set in the page header.) Figure 5-4 depicts the cell format: (a) structure of a cell, and (b) structure of payloads. The key or the data or both may be absent in a payload.
SQLite uses variable length integers to represent integer sizes (and integer key values). A variable length integer is 1 to 9 bytes long, where the lower 7 bits of each byte are used for the integer value calculation. The integer consists of all consecutive bytes that have bit 8 set and the first byte with bit 8 clear. The most significant byte of the integer appears first. A variable-length integer is at most 9 bytes long. As a special case, all 8 bits of the 9th byte are used in the value determination. This representation allows SQLite to encode a 64-bit integer in at most 9 bytes. It is called the Huffman code, and it was invented in 1952. SQLite prefers the variable length Huffman encoding over fixed length 64-bit encoding because only one or two bytes are required to store the most common cases, but up to 64-bits of information can he encoded (in 9 bytes) if needed.
You may recall that SQLite puts a restriction in storing payloads on a tree page. A payload might not be stored in its entirety on a page even if there is sufficient free space on the page. The maximum embedded payload fraction is the amount of the total usable space in a page that can be consumed by a single cell of an internal page. This value is available in the database file header at offset 21 (see Table 2-1 in the "Database File Structure" section). If the payload of a cell on an internal page is larger than the max payload, then extra payload is spilled into a chain of overflow pages. Once an overflow page is allocated, as many bytes as possible are moved into the overflow pages without letting the cell size drop below the min embedded payload fraction value (in the database file header at offset 22). The min leaf payload fraction (in the file header at offset 23) is like the min embedded payload fraction, except that it applies to leaf pages. The maximum payload fraction for a leaf page is always 100 percent, and it is not specified in the header.
Multiple small entries can fit on a single tree page, but a large entry can span multiple overflow pages. Overflow pages (for a cell) form a singly linked list. Each overflow page except the last one is completely filled with data of length equal to usable space minus four bytes: the first four bytes store the next overflow page number. The last overflow page can have as little as one byte of data. An overflow page never stores content from two cells.