Transaction Management > Transactional Operations
Like any other DBMS, SQLite's transaction management has two components: (1) normal processing, and (2) recovery processing. During normal processing, the pager saves recovery information in journal files, and it uses the saved information at the recovery processing time if there is a need.
Normal processing involves reading pages from and writing pages into database files, and committing transactions and subtransactions. In addition, the pager flushes the page-cache as a part of normal processing work.
Most transactions and statement subtransactions commit themselves. But occasionally, some transactions or statements abort themselves. In rare cases, there are applications and system failures. In either case, SQLite may need to recover the database into an acceptable consistent state by performing some rollback actions. In the former two cases (statement and transaction aborts), in-memory reliable information might be available at the time of recovery. In the latter case (crashes), the database may be corrupt, and there is no in-memory information.
To act on a database page, the client B+-tree module needs to apply the sqlite3pager_get function on the page number. The client needs to invoke the function, even if the page is nonexistent in the database file: the page will be created by the pager. The function obtains a shared lock on the file if a shared or stronger lock has not already been obtained on the file. If it fails to obtain the shared lock, then some other transaction holds an incompatible lock, and it returns SQLITE_BUSY error code to the caller. Otherwise, it performs a cache read operation (see the "Cache read" section), and returns the page to the caller. The cache read operation pins down the page.
The first time the pager acquires the shared lock on a database file, we say it has started an implicit read-transaction on the file for the caller. At this point it determines whether or not the file needs a recovery. If the file does need a recovery, the pager performs the recovery before returning the page to the caller. I discuss this in the "Recovery from failures" section.
Before modifying a page, the client B+-tree module must have already pinned down the page (by invoking the sqlite3pager_get function). It applies the sqlite3pager_write function on the page to make the page writable. The first time the sqlite3pager_write function is called on a (any) page, the pager needs to acquire a reserved lock on the database file. If the pager is unable to obtain the lock, it means that another transaction already has a reserved or stronger lock on the file. In that case, the write attempt fails, and the pager returns SQLITE_BUSY to the caller.
The first time the pager acquires a reserved lock, we say it escalates the read-transaction into a write-transaction. At this point, the pager creates and opens the rollback journal. It initializes the first segment header record (see Figure 4-4), notes down the original size of the database file in the record, and writes the record in the journal file.
To make a page writable, the pager writes the original content of that page (in a new log record) into the rollback journal. Once writable, the client can modify the page as many times as it wants without informing the pager. Changes to the page are not written to the database file immediately.
Once a page image is copied into the rollback journal, the page will never be in a new log record, even if the current transaction invokes the write function on the page multiple times. One nice property of this logging is that a page can be restored by copying the contents from the journal. Further, the undo is idempotent, and hence undos do not produce any compensating log records.
SQLite never saves a new page (that is added, i.e., appended, to the database file by the current transaction) in the journal because there is no old value for the page. Instead, the journal notes the initial size of the database file in the journal segment header record (see Figure 4-4) when the journal file is created. If the database file is expanded by the transaction, the file will be truncated to its original size on a rollback.
The cache flush is an internal operation of the pager module; the module client can never invoke a cache flush directly. Eventually, the pager wants to write some modified pages back to the database file, either because the cache has filled up and there is a need for cache replacement, or because the transaction is ready to commit its changes. The pager performs the following steps:
It determines whether there is a need to flush the journal file. If the transaction is not asynchronous, and has written new data in the journal file, and the database is not a temporary file,[*] then the pager decides to do a journal flush. In that case, it makes a fsync system call on the journal file to ensure that all log records written so far have actually reached the surface of the disk (and are not just being held in the operating system space or the disk controller's internal memory). At this time, the pager does not write the number of log records (nRec) value in the current log segment header. (The nRec value is a precious resource for rollback operation. When the segment header is formed, the number is set to zero for synchronous transactions and to 0xFFFFFFFF for asynchronous ones.) After the journal is flushed, the pager writes the nRec value in the current log segment header, and flushes the file again. As disk writes are not atomic, it will not rewrite the nRec field any more. It instead creates a new log segment for the new oncoming log records. In these scenarios, SQLite uses multisegment journal files.
[*] For temporary databases, we do not care if we are able to roll back after a power failure, so no journal flush occurs.
 The journal file is flushed twice. The second flush leads to an overwrite of the disk block that stores the nRec field. If we assume that this overwrite is atomic, then we are guaranteed that the journal will not be corrupted at this point of flushing. Otherwise, we are at some minor risk.
It tries to obtain an EXCLUSIVE lock on the database file. (If other transactions are still holding SHARED locks, the lock attempt fails, and it returns SQLITE_BUSY to the caller. The transaction is not aborted.)
It writes all modified pages or selective ones out to the database file. The page writing is done in-place. It marks cache copies of these pages clean. (It does not flush the database file to disk at this time.)
If the reason for writing to the database file is because the page cache is full, then the pager does not commit the transaction right away. Instead, the transaction might continue to make changes to other pages. Before subsequent changes are written to the database file, these three steps are repeated once again.
The EXCLUSIVE lock the pager has obtained to write to the database file is held until the transaction is complete. It means that other applications will not be able to open another (read or write) transaction on the database from the time the pager first writes the database file until the transaction commits or aborts. For short transactions, updates are held in-cache, and the exclusive lock will be acquired only at the commit time for the duration of the commit.
SQLite follows a slightly different commit protocol depending on whether the committing transaction modifies a single database or multiple databases.
Committing a read-transaction is easy. The pager releases the shared lock from the database file, and purges the page cache. To commit a write-transaction, the pager performs the following steps in the order listed:
It obtains an EXCLUSIVE lock on the database file. (If the lock acquisition fails, it returns SQLITE_BUSY to the caller. It cannot commit the transaction now, as other transactions are still reading the database.) It writes all modified pages back to the database file following the algorithmic steps 1 through 3 given in the "Cache flush" section.
The pager makes an fsync system call on the database file to flush the file to the disk.
It then deletes the journal file.
Finally, it releases the EXCLUSIVE lock from the database file, and purges the page cache.
The transaction commit point occurs at the instant the rollback journal file is deleted. Prior to that, if a power failure or crash occurs, the transaction is considered to have failed during the commit processing. The next time SQLite reads the database, it will roll back the transaction's effects from the database.
The commit protocol is a little more involved, and it resembles a transaction commit in distributed systems. The VM (virtual machine) module actually drives the commit protocol as the commit coordinator. Each pager does its own part of "local" commit on its database. For a read-transaction or a write-transaction that modifies a single database file (the temp database is not counted), the protocol executes a normal commit on each database involved. If the transaction modifies more than one database file, the following commit protocol is performed:
Release the SHARED lock from those databases the transaction did not update.
Acquire EXCLUSIVE locks on those databases the transaction has updated.
Create a new master journal file. Fill the master journal with the names of all individual rollback journal files, and flush the master journal and the journal directory to disk. (The temp database name is not included in the master journal.)
Write the name of the master journal file into all individual rollback journals in a master journal record (see Figure 4-6, earlier), and flush the rollback journals. (A pager may not know that it has been a part of multidatabase transaction until the transaction commit time. Only at this point it comes to know that it is a part of a multidatabase transaction.)
Flush individual database files.
Delete the master journal file and flush the journal directory.
Delete all individual rollback journal files.
Release EXCLUSIVE locks from all database files, and purge those page-caches.
The transaction is considered to have been committed when the master journal file is deleted. Prior to that, if a power failure or crash occurs, the transaction is considered to have failed during the commit processing. When SQLite reads these databases next time, it will recover them to their respective states prior to the start of the transaction.
Normal operations at statement subtransaction level are read, write, and commit. These are discussed below.
A statement subtransaction reads pages through the main user transaction. All rules are followed as those for the main transaction.
There are two parts in a write operation: locking and logging. A statement subtransaction acquires locks through the main user transaction. But statement logging is a little different, and is handled by using a separate temporary statement journal file. SQLite writes some log records in the statement journal, and some in the main rollback journal. The pager performs one of the following two alternative actions when a subtransaction tries to make a page writable via the sqlite3pager_write operation:
If the page is not already in the rollback journal, it adds a new log record to the rollback journal.
Otherwise, it adds a new log record to the statement journal if the page is not already in this journal. (The pager creates the journal file when the transaction writes the first log record in the file.)
The pager never flushes a statement journal, because this is never required for failure recovery. If a power loss occurs, the main rollback journal will take care of database recovery. You may note that an in-cache page can be a part of both the rollback journal and the statement journal: the rollback journal has the oldest page image.
Statement commit is very simple. The pager deletes the statement journal file.
Recovery from abort is very simple in SQLite. The pager may or may not need to remove the effects of the transaction from the database file. If the transaction holds only a RESERVED or PENDING lock on the database, it is guaranteed that the file is not modified; the pager deletes the journal file, and discards all dirty pages from the page cache. Otherwise, some pages are written back in the database file by the transaction, and the pager performs the following rollback actions:
The pager reads log records one after another from the rollback journal file, and restores the page images from the records. (You may recall that a database page is logged at most once by the transaction and the log record stores the before image of the page.) Thus, at the end of the journal scan, the database is restored to its original state prior to the start of the transaction. If the transaction has expanded the database, the pager truncates the database to the original size. It (the pager) then flushes the database file first and deletes the rollback journal file next. It releases the exclusive lock, and purges the page cache.
As noted in the "Statement operations" section, a statement may have added log records to both the rollback journal and the statement journal. SQLite needs to roll back all log records from the statement journal, and some from the rollback journal. When a statement subtransaction writes the first log record in the rollback journal, the pager keeps a note of the record position in an in-memory data structure. This record onward, until the end of the rollback journal file, is written by the subtransaction. The pager restores the page images from those log records. It then deletes the statement journal file, but keeps the rollback journal file without changing the content. When a statement subtransaction starts, the pager also records the database size. The pager truncates the database file to that size.
After a crash or system failure, inconsistent data may have been left in a database file. When no application is updating a database, but there is the rollback journal file, it means that the previous transaction may have failed, and SQLite may need to recover the database from the effects of the transaction before the database can be used for normal business. A rollback journal file is said to be hot if the corresponding database file is unlocked or shared locked. A journal becomes hot when a write-transaction is midway toward its completion and a failure prevents the completion. However, a rollback journal is not hot if it is produced by a multidatabase transaction and there is no master journal file; this implies that the transaction is committed by the time the failure had occurred. A hot journal implies that it needs to be rolled back to restore the database consistency.
If no master journal is involved, then a rollback journal is hot if it exists and the database file does not have a reserved or stronger lock. (You may recall that a transaction with a reserved lock creates a rollback journal file; this file is not hot.) If a master journal name appears in the rollback journal, then the rollback journal is hot if the master journal exists and there is no reserved lock on the corresponding database file.
When a database starts, in most DBMSs, the transaction manager initiates a recovery operation on the database immediately. SQLite uses a deferred recovery. As explained in the "Read operation" section, when the first read of a (any) page from the database is performed, the pager goes through the recovery logic and recovers the database only if the rollback journal is hot.
It performs the following sequence of recovery steps before it actually reads a page from the file:
It obtains a SHARED lock on the database file. If it cannot get the lock, it returns SQLITE_BUSY to the application.
It checks if the database has a hot journal. If the database does not have a hot journal, the recovery operation finishes. If there is a hot journal, the journal is rolled back by the subsequent steps of this algorithm.
It acquires an EXCLUSIVE lock (via a PENDING lock) on the database file (see Figure 4-1). (The pager does not acquire a RESERVED lock because that would make other pagers think the journal is no longer hot and read the database. It needs an exclusive lock because it is about to write to the database file as a part of the recovery work.) If it fails to acquire these locks, it means another pager is already trying to do the rollback. In that case, it drops all locks, closes the database file, and returns SQLITE_BUSY to the application.
It reads all log records from the journal file and undoes them. This restores the database to its original state prior to the start of the crashed transaction, and hence, the database is in a consistent state now.
It flushes the database file. This protects the integrity of the database in case another power failure or crash occurs.
It deletes the rollback journal file.
It deletes the master journal file if it is safe to do so. (This step is optional. See the next sidebar for details.)
It drops the EXCLUSIVE (and PENDING) locks, but retains the SHARED lock. (This is because the pager performs the recovery in the sqlite3pager_get function.)
After the preceding algorithm completes successfully, the database file is guaranteed to have been restored to the state as of the start of the failed transaction. It is safe to read from the file now.
A master journal is stale if no individual rollback journal is pointing to it. To figure out whether a master journal is stale, the pager first reads the master journal to obtain the names of all rollback journals. It then examines each of those rollback journals. If any of them exists and points back to the master journal, then the master journal is not stale. If all rollback journals are either missing or refer to other master journals or to no master journal at all, then the master journal is stale, and the pager deletes the journal. There is no requirement that stale master journals be deleted. The only reason for doing so is to free up disk space occupied by them.
To reduce workload at failure recovery time, most DBMSs perform checkpoints on database at regular intervals. You may recall that SQLite can have at most one write-transaction on a database file at a time. The journal file contains log records from only that transaction, and SQLite deletes the journal file when the transaction completes. Consequently, SQLite does not need to perform checkpoints, and it does not have any checkpoint logic embedded in it. When a transaction commits, SQLite makes sure that all updates from the transaction are in the database file before deleting the journal file.