Data Recovery and Transaction Logs | Microsoft Exchange Server 2003 Administrators Companion (Pro-Administrators Companion)

Three of the top 10 questions that Microsoft’s technical support receives are related to the ESE and data recovery. This section discusses the role of transaction logs and describes how they are used in the recovery of your databases in the event of a catastrophe. It also covers why databases fail and looks at some of the common error messages that accompany a database failure. See Chapter 28 for a step-by-step description of how to restore a database.

The Extensible Storage Engine

The Extensible Storage Engine is a transaction logging system that ensures data integrity and consistency in the event of a system crash or media failure. The design of the ESE was guided by four criteria. The first was a question: “What happens if there’s a crash?” Every development was guided by the notion that it should improve recoverability in the event of a disaster. The second criterion was to reduce the number of I/O operations that ESE would perform, and every effort was made to do so. Three I/O operations are better (faster) than four, and four are better than five. Even if it means expanding an I/O operation to include additional calculations, eliminating one I/O operation greatly improves performance. The third criterion was for the database engine to be as self-tuning as possible. Finally, ESE is designed to provide an uptime as close to 24 hours a day, 7 days a week as possible. Achieving the online maintenance level will enhance the success of this last goal.

How ESE Works

The main function of ESE is to manage transactions. ESE applies four tests to the databases to ensure their integrity. They are sometimes referred to as the ACID tests:

Atomic Either all the operations performed in a transaction must be completed or none will be completed.
Consistent A transaction must start with a database in a consistent state and leave the database in a consistent state when finished.
IsolatedChanges are not visible until all operations within the transaction are completed. When all operations are completed and the database is in a consistent state, the transaction is said to have been committed.
Durable Committed transactions are preserved even if the system experiences significant stress such as a system crash.

Note

Durability can be seen when the system crashes during the performance of the operations. If some of the operations were completed before a system crash (for example, if the e-mail was deleted from the Inbox and copied to the Private folder but the item count on each folder was not updated), when the Store.exe process starts on reboot, it will detect that the database is in an inconsistent state and will roll back the operations. This precaution means that the e-mail message cannot be lost while it is being moved, nor will there be two copies of the message upon reboot. ESE ensures that, when restarted, the database is in the same state it was in immediately before the operations began.

Real World What Happens When a Change Is Made to a Page in the Database?

Let’s say that you move an important e-mail message from your Inbox to a private folder named Private. The following operations occur to complete this transaction:

Inserting the e-mail message into the Private folder.
Deleting the e-mail message from the Inbox folder.
Updating the information about each folder to correctly display the number of items in each folder.
Committing the transaction in the temporary transaction log file.

Because these operations are performed in a single transaction, Exchange either performs all of them or none of them. This is the Atomic test. The commit operation cannot be carried out until all the operations have been performed successfully. Once the transaction is committed, the Isolated test is passed. And since the database is left in a consistent state, the Consistent test is passed. Finally, after the transaction is committed to the database, the changes will be preserved even if there is a crash. This meets the Durable test.

How Data Is Stored Inside an ESE database file, data is organized in 4-KB sections called pages. Information is read from an ESE database file and loaded into memory in the form of a page. Each page contains data definitions, data, indexes, checksums, flags, timestamps, and other B-tree information. The pages are numbered sequentially within the database file to maximize performance. Pages contain either the actual data or pointers to other pages that contain the data. These pointers form a B-tree structure, and rarely is the tree more than three or four levels deep. Hence, the B-tree structure is wide but shallow.

More Info

If you’d like to learn more about the B-tree database structures, you can find plenty of information on the Internet. You can start by going to http://www.bluerwhite.org/btree. For a short definition of a B-Tree structure, go to http://searchdatabase.techtarget.com/sDefinition/0,,sid13_gci508442,00.html. And, as always, a Google search on “B-tree” will yield other sites with more information than you’ll be able to read in a single sitting.

A transaction is a series of modifications to a page in a database. Each modification is called an operation. When a complete series of operations has been performed on an object in a database, a transaction is said to have occurred.

Note

An ESE database can contain up to 2³² or 4,292,967,296 pages. At 4 KB per page, an ESE database can hold 16 terabytes (4,292,967,296 4096 = 17,583,994,044,416 bytes). Practically speaking, your database size will be limited by hardware space or backup and restore considerations rather than by ESE design.

When a page is first read from disk and stored in memory, it is considered clean. Once an operation has modified the page, the page is marked as dirty. Dirty pages are available for further modifications if necessary, and multiple modifications can be made to a dirty page before it is written back to disk. The number of modifications to a page has no bearing on when the page will be written back to disk. This action is determined by other measures, which we will discuss later in this chapter.

While the operations are being performed, they are being recorded in the version store. The version store keeps a list of all of the changes that have been made to a page but have not yet been committed. If your server loses power before the series of operations can be committed, the version store will be referenced when ESE starts again to roll back, or undo, the unfinished operations. The version store is a virtual store—you won’t find a Version Store database on the hard disk. The version store is held in RAM and really constitutes versioned pages of a single page that was read from the disk to memory. Figure 2-5 illustrates this process.

click to expand
Figure 2-5: How ESE handles transactions.

To actually commit a transaction, the operations must be written to the transaction log buffer before being written to the transaction logs on disk. ESE uses “write-ahead” logging, which means that before ESE makes a change to the database, it notes what it’s going to do in the log file. Data is written to a cached version of the log in the log buffer area, the page in memory is modified, and a link is created between these two entries. Before the modifications of the page can be written to disk, the change recorded in the log buffer must first be written to the log file on disk.

Tip

One operation can hang or be so large that the version store takes up hundreds of megabytes. This situation could occur if your operation is indexing a large table or writing a very large file to the database. Because the version store keeps track of all changes that have occurred to the database since the oldest transaction began, you might get the following error: “-1069 error (JET_errVersionStoreOutOfMemory).” If this happens, consider moving your databases and stores to another disk with more free disk space, and also consider increasing the RAM on your system.

Often, the cached version of the changes to the pages is not written to disk immediately. This does not present a problem, since the information is recorded in the log files. Should the modifications in memory be lost, when ESE starts, the log files will be replayed (a process discussed in more detail in the section “How Log Files Are Replayed During Recovery,” later in this chapter), and the transactions will be recorded to the disk. Moreover, not writing cached information to the database right away can improve performance. Consider the situation in which a page is loaded from memory and then modified. If it needs to be modified again soon thereafter, it does not need to be reread from the disk because it is already in memory. Thus, the modifications to the database can be batched to increase performance.

Database Files The database itself is a combination of the .EDB and .STM files stored on the hard disk. Eventually, all transactions are written to one of these files. Before a page is written to disk, however, a checksum is calculated for that page, and the checksum is then written to the page along with the data. When the page is read from disk, the checksum is recalculated and the page number is verified to be sure that it matches the requested page number. If the checksum fails or if there is a mismatch on the page number, a -1018 error is generated. This error means that what was written to disk is not what was read by ESE from the disk into memory.

Note

Beginning with Service Pack 2 (SP2) in Exchange Server 5.5 and continuing in Exchange Server 2003, ESE attempts to read the data 16 times before generating a -1018 error, reducing the chance that a transient event might cause a problem. Hence, if you receive a -1018 error, you know that ESE attempted to read the data repeatedly before warning you.

ESE and Memory Management Before it can load a page into memory, ESE must reserve an area in memory for its own use. Dynamic buffer allocation (DBA) is the process of increasing the size of the database buffer cache before the memory is needed. More than a few Exchange administrators have complained that Exchange eats up all the memory on their servers. This situation is by design, although the design doesn’t necessarily call for using all the memory, nor is the memory allocated to Exchange unavailable for other system processes. If other processes need more memory, Exchange will release the memory to that process so that it can run efficiently. This happens on the fly, and the methods used by ESE are not configurable.

In Exchange 4 and 5.0, the size of the cache was set by the Performance Optimizer. In Exchange 5.5, the process was changed to be dynamic: ESE observes the system and adjusts the size of the database cache as necessary. To observe how much of your RAM is being reserved by the Store process, use the Cache Size performance counter.

At this point, it might be helpful to take a look at the overall design goals of the DBA process. Understanding these will answer any questions you might have about memory management in Exchange Server 2003. The two design goals of DBA are as follows:

Maximize system performance The Store process uses the amount of overall paging and I/O activity, among other factors, to determine how much RAM to allocate for the database buffer. Overall system performance really is the focus of this goal. It does no good to have Exchange running quickly if the operating system is constantly paging.
Maximize memory utilization Unused system memory is wasted dollars. ESE will allocate to itself as much memory as it can without negatively impacting other applications. If a new application starts that needs additional memory, ESE will release memory so that the other application can run efficiently.

As you can see, you don’t need to be alarmed if you go into Task Manager and see that, for example, out of the 1 GB of RAM on your system, only 200 MB is left and the Store.exe process is using 800 MB of RAM. You’re not running out of memory and the Store.exe process does not have a memory leak. All it means is that the DBA feature of ESE has allocated additional RAM to increase your system performance. Figures 2-6 and 2-7 illustrate what this looks like in Task Manager. In Figure 2-6, you can see both the Store.exe and Mad.exe processes using more memory than most of the other processes. This figure was shot on a server that was not busy and still the Store.exe process was at the top of the memory usage list. Figure 2 7 shows that only 49,176 KB is available in physical memory. Look under the Physical Memory (K) box for the Available value.

click to expand
Figure 2-6: The Processes tab in Windows Task Manager, showing the memory allocated to Store.exe and Mad.exe.

click to expand
Figure 2-7: The Performance tab in Windows Task Manager, showing memory usage and availability.

Transaction Log Files In theory, the transaction log file could be one ever- expanding file. But it would grow so big that it would consume large amounts of disk space, thus becoming unmanageable. Hence, the log is broken down into generations—that is, into multiple files, each 5 MB in size and each representing a generation. The generations are named EdbXXXXX.log, where the XXXXX is incremented sequentially, using hexadecimal numbering.

The Edb.log file is the highest generation. When it becomes full, it is renamed with the next hexadecimal number in sequence. As this happens, a temporary log file, Edbtemp.log, is created to hold transactions until the new Edb.log can be created.

Each log file consists of two sections: the header and the data. The header contains hard-coded paths to the databases that it references. In Exchange Server 2003, multiple databases can use the same log file, since the log files service the entire storage group. From an administrative perspective, this arrangement simplifies recovery. No matter which database in a storage group you’re restoring, you will reference the same log files for that group. The header also contains a signature matched to the database signature. This keeps the log file from being matched to a wrong but identically named database.

You can dump the header information of a log file with the command ESEUTIL /ML (Figure 2-8). The dump displays the generation number, the hard-coded database paths, and the signatures. The data portion of the log file contains the transactional information, such as BeginTransaction, Commit, or Rollback information. The majority of it contains low-level, physical modifications to the database. In other words, it contains the records that say, “This information was inserted on this page at this location.”

click to expand
Figure 2-8: A header dump produced using ESEUTIL /ML.

When a database is modified, several steps occur. First, the page is read into the database cache, and then the timestamp on the page is updated. This timestamp is incremented on a per-database basis. Next, the log record is created, stating what is about to be done to the database. This occurs in the log cache buffer. Then the page is modified and a connection is created between these two entries so that the page cannot be written to disk without the log file entry being written to disk first. This step guarantees that a modification to the database will first be written to the log file on disk before the database on disk is updated.

Hence, there is legitimate concern over the write-back caching that can be enabled on a log file disk controller. Essentially, write-back caching means that the hardware reports back to ESE a successful disk write even though the information is held in the disk buffer of the controller to be written to disk at a later time. Write-back caching, while improving performance, can also ruin the ESE process of writing changes to the log file before they are written to the database. If a controller or disk malfunction of some sort occurs, you could experience a situation in which the page has been written to disk but not recorded in the log file—which will lead to a corrupted database.

How Log Files Are Replayed During Recovery After you have restored your database, the logs will be replayed when you start the Store.exe process. Replaying the logs and then rolling back operations constitute the “starting” of the Store process; this is often referred to as the recovery process. Replaying the transaction logs is the first part of the recovery process, and it consumes most of the time necessary to start the Store.exe process.

Replaying the transaction log files means that for each log record, the page is read out of the database that the record references, and the timestamp on the page read from the database is compared to the timestamp of the log entry that references that page. If, for example, the log entry has a timestamp of 12 and the page read from the database has a timestamp of 11, ESE knows that the modification in the log file has not been written to disk, and so it writes the log entry to the database. However, if the timestamp on the page on disk is equal to or greater than the timestamp on the log entry, ESE does not write that particular log entry to disk and continues with the next log entry in the log file.

In the second and last phase of the recovery process, any uncommitted operations are rolled back: If a piece of e-mail was transferred, it is untransferred. If a message was deleted, it is restored. This is called physical redo, logical undo. Recovery runs every time the Store.exe process is started. If you stop and then start the Store process five times, the recovery process runs five times.

Even though the recovery process is run on the log files and not on the databases, if you’ve moved your databases, recovery won’t work because the hard- coded path to the database in the header of the log file will no longer point to the database. At the end of recovery, the process will appear to have been successful, but when you attempt to use that database, you’ll get an error with an event ID of 9519 from MSExchangeIS in the application log, indicating an error in starting your database (Figure 2-9).

click to expand
Figure 2-9: Message indicating that an error occurred when starting the database.

If you move the database back to the location where the log file is expecting to find it and then start the Store process, you should find that recovery brings your database to a consistent and usable state.

Checkpoint File The checkpoint file is an optimization of the recovery process. It records which entries in the log files have already been written to disk. If all the entries in the log files have been written to disk, the log files don’t need to be replayed during recovery. The checkpoint file can speed up recovery time by telling ESE which log file entries need to be replayed and which do not.

Faster recovery of a database is sometimes why circular logging is enabled. Circular logging deletes log files older than the current checkpoint location. The problem with circular logging is that you lose the ability to roll forward from a backup tape. If some of your log files since the last full backup have been deleted by circular logging, you’ll be able to recover only to the last full backup. However, if you have all your old log files, restoring the last full backup of your database from tape will allow for a full restore up to the point of your disaster, because all the log files can be replayed into the restored database. Remember that in order for a full restore to work, the database must be in the same physical condition as it was when the log files were written. A physically corrupt database cannot service a restore process.

Caution

Never, never, never delete your log files! Here’s why: Assume that log file 9 contains a command to insert a new page at a particular place in the database. Log file 10 contains a command to delete this page. Now suppose that an administrator deletes log file 9, perhaps thinking that the file’s timestamp is too old, and also deletes the checkpoint file. The administrator then needs to reboot the system for unrelated reasons. When the Store.exe process is started, ESE automatically enters recovery mode. Finding no checkpoint file, ESE has no choice but to replay all the log files. When log file 10 is replayed, the delete command will be carried out on that page, and its contents will be destroyed. ESE won’t know that there was an earlier command to insert a new page in that location in the database because log file 9 was deleted. Your database will be corrupted. Whatever you do, do not delete your log files. Furthermore, be aware that write-back caching can have the effect of deleting log files. The best practice is to disable write-back caching and never delete your log files or checkpoint file.

How Log Entries Are Written to the Database As we mentioned earlier, modified pages in memory and committed transactions in the log buffer area are not written immediately to disk. Committed transactions in the transaction log file are copied to the database when one of the following occurs:

The checkpoint falls too far behind in a previous log file. If the number of committed transactions in the log files reaches a certain threshold, ESE will flush these changes to disk.
The number of free pages in memory becomes too low, possibly affecting system performance. In this case, committed transactions in memory are flushed to disk to free up pages in memory for system use.
Another service is requesting additional memory and ESE needs to free up some of the memory it is currently using. ESE flushes pages from memory to the database and then updates the checkpoint file.
The database service is shutting down. In this case, all updated pages in memory are copied to the database file.

Bear in mind that pages are not copied from memory in any particular order and might not all be copied at the same time. The random order in which the pages are copied back to disk means that if there is a system crash while pages are being written to disk, the database file might have only portions of a committed transaction updated in the actual file. In this event, when the Store.exe process is started, the transaction will be replayed from the transaction log files and the database will be updated completely.