Page Cache Management > Cache Management
SQLite maintains a separate page cache for each open database file. If a thread opens the same file two or many times, the pager creates and initializes a separate page cache for the file only at the first open call. If two or more threads open the same file, there will be many independent caches for the same file. In-memory databases do not refer to any external storage devices. But, they are also treated like ordinary native files, and are stored entirely within the cache. Thus, the B/B+-tree module uses the same interface to access either type of database.
Page caches reside in an application's memory space. Note that the same pages may be cached by the native operating system. When an application reads a piece of data from any file, the operating system normally makes its own copy of the data first, and then a copy in the application. We are not interested in knowing how the operating system manages its own cache. SQLite's page cache organization and management are independent of those of the native operating system.
Management of page cache is crucial for system performance. The following subsections discuss how the pager organizes and maintains page cache, and how cache clients read and modify cache elements.
In general, to speed up searching a cache, the currently held in-cache items are well organized. SQLite uses a hash table to organize cached pages, and uses page-slots to hold pages in the table. The cache is fully associative, that is, any slot can store any page. The hash table is initially empty. As page demand increases, the pager creates new slots and inserts them in the hash table. There is a maximum limit on the number of slots a cache can have. (In-memory databases have no such limit so long as the native operating system allows the application space to grow.)
Figure 3-2 depicts a typical cache layout. Each page in the hash table is represented by a header object of PgHdr type. The page image is stored after the PgHdr object. The image is followed by a piece of private data that is used by the B+-tree module to keep page-specific control information there. (In-memory databases have no journal file, so their recovery information is recorded in in-memory objects. Pointers to those objects are stored following the private part: these pointers are used by the pager only.) This (additional nonpage) space is initialized to zeros when the pager brings the page into the cache. All pages in the cache are accessible via a hash array, namely aHash in the Pager object; the array size is fixed when the SQLite library is compiled. Each array element points to a "bucket" of pages; pages in each bucket are organized in an unordered doubly linked list.
The PgHdr is only visible to the pager module, and not visible to the B+-tree and higher-level modules. The header has many control variables. The pgno variable identifies the page number of the database page it represents. The injournal variable is true if the page has already been written to the rollback journal. The needSync variable is true if the journal needs a flush before writing this page back in the database file. (A flush on a file transfers modified parts of the file to the disk surface.) The dirty variable is true if the page has been modified, and the new value is not yet written back to the database file. The inStmt variable is true if the page is in the current statement journal. The nRef variable is the reference count on this page. If the nRef value is greater than zero, the page is in active use, and we say that the page is pinned down; otherwise, the page is unpinned and free. There are many pointer variables in PgHdr object (not all of them are shown in the figure). The pNextHash and pPrevHash pointers are used to link together pages in the same hash bucket. The pNextStmt and pPrevStmt pointers link together those pages that are in the statement journal. (I talk about the statement journal later.) The pNextFree and pPrevFree pointers are used to link together all free pages. Free pages are never taken out of hash buckets, even though the above figure seems to suggest otherwise. All pages (free or not) in the cache are linked together through the pNextAll pointer. The pDirty pointer links together all dirty pages. Note that a free page can be dirty, too.
The cache is referenced by using a search key the page number. To read a page, the client B+-tree module invokes the sqlite3pager_get pager API function on the page number. The function performs the following steps for a page P:
It searches the cache space.
It applies a hash function on P and gets an index. (SQLite uses a very simple hash function to determine the index value: page number modulo the size of the aHash array.)
It uses the index into the aHash array and gets the hash bucket.
It searches the bucket by chasing the pNextHash pointers. If P is found there, we say a cache hit has occurred. It pins down (i.e., increments the nRef value by 1) the page and returns the address of the page to the caller.
If P is not found in the cache, it is considered a cache miss. The function looks for a free slot that can be used to load the desired page. (If the cache has not reached the maximum limit, it instead creates a new free slot.)
If no free slot is available or can be created, it determines a slot from which the current page can be released to reuse the slot. This is called a victim slot.
If the victim (or the free slot) is dirty, it writes the page to the database file. (Following write-ahead-log (WAL) protocol, it flushes the journal file.)
It reads page P from the database file into the free slot, pins down the page (i.e., it sets the nRef value to 1), and returns the address of the page to the caller. If P is greater than the current max page in the file, it does not read the page; instead, it initializes the page to zeros. It also initializes the bottom private part to zeros whether it reads the page from the file.
SQLite strictly follows fetch-on-demand policy to keep the page fetch logic very simple.
When a page address is returned to the client B+-tree module, the pager does not know when the client actually works on the page. SQLite follows this standard protocol for each page: the client acquires the page, uses the page, and then releases the page. After acquiring a page, the client can directly modify the content of the page, but it must call the sqlite3pager_write pager API function on the page prior to making any modifications there. On return from the call, the client can update the page in place.
The first time the write function is called on a page, the pager writes the original content of the page into the rollback journal file as part of a new log record, and sets the injournal and needSync flags on. Later, when the log record is flushed onto the disk surface, the pager clears the needSync flag. (SQLite follows WAL protocol: it does not write a modified page back into the database file until the corresponding needSync has been cleared.) Every time the write function is called on a page, the dirty flag is set; the flag is cleared only when the pager writes back the page content to the database file. Because the time when the client modifies a page is not known to the pager, updates on the page are not immediately propagated to the database file. The pager follows a delayed write (write-back) page update policy. The updates are propagated to the database file only when the pager performs a cache flush or selectively recycles dirty pages.
Cache replacement refers to the activity that takes place when a cache becomes full, and old pages are removed from the cache to make room for new ones. As mentioned in the "Cache read" section, when a requested page is not in the cache and a free slot is not available in the cache, the pager victimizes a slot for replacement. You may recall that the page cache is fully associative, that is, any slot is good for the new page. The slot that is victimized is dictated by the cache replacement policy. SQLite follows a kind of least recently used (LRU) cache replacement policy.
SQLite organizes free pages in a logical queue. When a page is unpinned, the pager appends the page at the tail end of the queue. The victim is chosen from the header end of the queue, but may not always be the head element on the queue as is done in the pure LRU. SQLite tries to find a slot on the queue starting at the head such that recycling that slot would not involve doing a flush of the journal file. (Following the WAL protocol, before writing a dirty page to the database file, the pager flushes the journal file. Flushing is a slow operation, and SQLite tries to postpone the operation as long as possible.) If such a victim is found, the foremost one on the queue is recycled. Otherwise, SQLite flushes the journal file first, and then recycles the head slot from the queue. If the victim page is dirty, the pager writes the page to the database file before recycling it.
Pinned pages are currently in active use and cannot be recycled. To avoid the scenario in which all pages in the cache are pinned down, SQLite needs a minimum number of slots in the cache so that it always has some cache slot(s) to recycle; the minimum value is 10 as of the SQLite 3.3.6 release.