16.7. Recovering OracleSince an Oracle database consists of several interrelated parts, recovering such a database is done through a process of elimination. Identify which pieces work, then recover the pieces that don't work. The following recovery guide follows that logic and works regardless of the chosen backup method. It consists of a number of related steps that can be followed in order. If you're performing user-managed backups, you'll have to perform much of the logic yourself. If you're using rman, you'll be able to skip right through a lot of these steps.
16.7.1. Using This Recovery GuideThe process presented in the following sections for recovering an Oracle database assumes nothing. Specifically, it does not assume that you know the cause of the database failure. By following these steps you'll work through a series of tasks that determine which part(s) of the database are no longer functional. You can then bring the database up as soon as possible while allowing recovery of the pieces that are damaged. ("Damaged" may mean that a file is either missing or corrupted.) Start with Step 1. If it succeeds, it directs you to Step 10. If the "startup mount" fails, it directs you to Step 2. Each of the steps follows a similar pattern, directing you to the appropriate step following the failure or success of the current step. Assume that prior to any SQL commands, we issued the following commands: $ sqlplus /nolog SQL> connect /as sysdba; Prior to running any rman commands, we connected to the appropriate target database and recovery catalog (if we're using one). Consult your manual for the dozens of different ways you can do that. Our examples didn't use a recovery catalog, and we were logging into the target database using OS-level authentication, so we simply had to run the following command: $ rman target / For each sqlplus or rman command, I describe the state the database needs to be in prior to running the command. For example, the paragraph will say something like, "issue the following SQL command against a mounted, closed database." For more detailed information about individual steps, please consult Oracle's documentation, especially the Backup and Recovery Basics documentation. These steps can be used on a Windows machine as well as a Unix machine. Differences are noted when necessary.
16.7.2. Seriously, Think About rmanYou can see that the restore and recovery of an Oracle database can be a very complicated process. If you're not yet using rman, now is the time to think about it. As long as you still have your control files, this 26-step recovery process would be reduced to the following 6 rman commands: connect target...... connect catalog..... startup mount; restore database; recover database; alter database open; rman automatically figures out what's broken, restores it, then applies media recovery against it. 16.7.3. Step 1: Try Startup MountThe first step in verifying the condition of an Oracle database is attempting to mount it. This works because mounting a database (without opening it) reads the control files but does not open the datafiles. If the control files are multiplexed/mirrored, Oracle attempts to open each of the control files listed in the parameter file. If any of them are damaged, the mount fails. To mount an unmounted database, simply run sqlplus /nolog, connect to the database, and enter startup mount: SQL> startup mount; If it succeeds, the output looks something like this: SQL> startup mount; ORACLE instance started. Total System Global Area 5130648 bytes Fixed Size 44924 bytes Variable Size 4151836 bytes Database Buffers 409600 bytes Redo Buffers 524288 bytes Database mounted. If the attempt to mount the database fails, the output looks something like this: SQL> startup mount; Total System Global Area 5130648 bytes Fixed Size 44924 bytes Variable Size 4151836 bytes Database Buffers 409600 bytes Redo Buffers 524288 bytes ORACLE instance started. ORA-00205: error in identifying controlfile, check alert log for more info
16.7.4. Step 2: Are All Control Files Missing?Don't panic if the attempt to mount the database fails. Control files are easily restored if they were multiplexed/mirrored and can even be rebuilt from scratch if necessary. The first important piece of information is that one or more control file is missing. Unfortunately, since Oracle aborts the mount at the first failure it encounters, it could be missing one, two, or all of the control files, but so far you know only about the first missing file. So, before embarking on a course of action, determine the severity of the problem by doing a little research. First, determine the names of all of the control files. You can do that by querying v$parameter, which is a fixed table available from an unmounted (or mounted) database. SQL> select value from v$parameter where name like 'control_files' /db/Oracle/a/oradata/crash/control01.ctl, /db/Oracle/b/oradata/crash/control02.ctl, /db/Oracle/c/oradata/crash/control03.ctl Or on Windows, it might look like this: SQL> select value from v$parameter where name like 'control_files' D:\ORACLE\ORADATA\CRASH\CONTROL01.CTL, E:\ ORACLE\ORADATA\CRASH\CONTROL02.CTL, F:\ ORACLE\ORADATA\CRASH\CONTROL03.CTL It's also important to get the name of the control file that Oracle is complaining about. Find this by looking for the phrase controlfile: in the alert log. (The alert log can be found in the location specified by the background_dump_dest value also available in v$parameter.) Use the technique shown previously to get the locations of the control files to get this value. (Typically, the alert log is in the <ORACLE_BASE>/admin/<ORACLE_SID>/bdump directory.) In that directory, there should be a file called alert_<ORACLE_SID>.log. In that file, there should be an error that looks something like this: Sun Jul 23 18:53:19 PDT 2006 alter database mount exclusive Sun Jul 23 18:53:19 PDT 2006 ORA-00202: controlfile: '/db/a/oradata/crash/control01.ctl' ORA-27037: unable to obtain file status SVR4 Error: 2: No such file or directory
With the names of all of the control files and the name of the damaged file, it's easy to determine the severity of the problem. Do this by listing each of the control files and comparing their size and modification time. (Remember the game "One of these is not like the others" on Sesame Street?) The following scenarios assume that the control files were multiplexed/mirrored to three locations, which is a very common practice. The possible scenarios are:
16.7.5. Step 3: Replace Missing Control File
Please note the careful choice of words here. We are not going to restore the control file, which would imply that we're going to get it back from backups. Remember that restoring the control file forces us to perform an alter database open resetlogs, which is not desirable if we can afford it. We are going to replace it by copying one of the other copies that we multiplexed/mirrored it to. Let's see if we can do that before we actually restore one from backup.
If you're able to identify that at least one of the multiplexed/mirrored copies of the control file is good, then this part is easy. Simply copy another one of the multiplexed/mirrored copies of the control file to the damaged control file's name and location. (The details of this procedure follow.) Once this is done, you can attempt to mount the database again.
The first thing to do is to get the name of the damaged control file. Again, this is relatively easy. Look in the alert log for a section like this one: Sun Jul 23 18:53:19 PDT 2006 alter database mount exclusive Sun Jul 23 18:53:19 PDT 2006 ORA-00202: controlfile: '/db/a/oradata/crash/control01.ctl' ORA-27037: unable to obtain file status SVR4 Error: 2: No such file or directory The next step would be to copy a known good control file to the damaged control file's location. 16.7.5.1. "But I don't have a good control file!"It's possible that there is no known copy of the current control file. If this is the case and you weren't using your control file to store rman backup history, it's probably best to attempt to use the create controlfile script. If you have rman backup history in your control file, it's probably best to actually restore your control file from backup. To use the create controlfile script, proceed to Steps 4 through 7. To restore your control file from backup, proceed to Step 8. If that's not possible or probable, try the following procedure. First, make backups of all the control files. Then, one at a time, try copying every version of each control file to all the other locationsexcluding the one that Oracle has already complained about because it's obviously damaged.
For example, assume there are three control files: /a/control1.ctl, /b/control2.ctl, and /c/control3.ctl. The alert log says that the /c/control3.ctl is damaged, and because /a/control1.ctl and /b/control2.ctl have different modification times, there's no way to know which one is good. Try the following steps: First, make backup copies of all the files: $ cp /a/control1.ctl /a/control1.ctl.sav $ cp /b/control2.ctl /b/control2.ctl.sav $ cp /c/control3.ctl /c/control3.ctl.sav Or, for a Windows system, run the following commands: C:\ copy C:\CONTROL01.CTL C:\CONTROL01.SAV C:\ copy D:\CONTROL02.CTL D:\CONTROL02.SAV C:\ copy E:\CONTROL03.CTL E:\CONTROL03.SAV Second, try copying one file to all locations. No sense in copying from control3.ctl because it's obviously damaged. Try starting with control1.ctl: $ cp /a/control1.ctl /b/control2.ctl $ cp /a/control1.ctl /c/control3.ctl Or, for a Windows system, run the following commands: C:\ copy C:\CONTROL01.CTL D:\CONTROL02.CTL C:\ copy C:\CONTROL01.CTL E:\CONTROL03.CTL Now attempt a startup mount: SQL> startup mount Sun Jul 23 18:53:19 PDT 2006 alter database mount exclusive Sun Jul 23 18:53:19 PDT 2006 ORA-00202: controlfile: '/a/control3.ctl' ORA-27037: unable to obtain file status This error says that the file that was copied to all locations is also damaged. Now try the second file, control2.ctl. This time we have to copy from backup since we overwrote the original with the last step. $ cp /b/control2.ctl.sav /a/control1.ctl $ cp /b/control2.ctl.sav /a/control3.ctl Or, for a Windows system, run the following commands: C:\ copy D:\CONTROL02.SAV C:\CONTROL01.CTL C:\ copy D:\CONTROL02.SAV E:\CONTROL03.CTL Now attempt to do a startup mount: SQL> startup mount; ORACLE instance started. Total System Global Area 5130648 bytes Fixed Size 44924 bytes Variable Size 4151836 bytes Database Buffers 409600 bytes Redo Buffers 524288 bytes Database mounted. It appears that control2.ctl was a good copy of the control file.
16.7.6. Step 4: Are All Datafiles and Redo Logs OK?
In 10g, Oracle enhanced the backup controlfile to trace output to include both a noresetlogs and a resetlogs option in the output. If you want to use the noresetlogs option, all datafiles and online redo logs must be intact. The datafiles can be older versions that were restored from backup because they will be rolled forward by the media recovery. The online redo logs must be current and intact for the noresetlogs version of the create controlfile script to work. If any online redo log group is completely damaged, you will have to use the resetlogs option. Why? Because the rebuild process uses the online redo logs to identify the current SCN, then identifies it as the current SCN in the control file. It's OK if one or more of the datafiles have older SCNs. But if a datafile shows that it has an SCN that is more recent than the available online redo logs, it indicates that we're not dealing with the current online redo logsthe control file rebuild process aborts. If it's likely that one or more of the datafiles or online redo logs is damaged, go to Step 5. If it's more likely that they are all intact, go to Step 6.
16.7.7. Step 5: Restore Damaged Datafiles or Redo LogsIf one or more of the datafiles or online redo logs are definitely damaged, follow all the instructions given here to see if there are any other damaged files. (A little extra effort now saves a lot of frustration later.) If it's possible that all the datafiles and online redo logs are OK, another option is to skip this step and try to recreate the control file now. (An unsuccessful attempt at this does not cause any harm.) If it fails, return to this step. If there is plenty of time, go ahead and perform this step first. To try to recreate the control files now, proceed to Step 6. The first thing to find out is where all the datafiles and redo logs are. To determine this, run the following commands on the mounted, closed database. SQL> select name from v$datafile; /db/Oracle/a/oradata/crash/system01.dbf /db/Oracle/a/oradata/crash/rbs01.dbf /db/Oracle/a/oradata/crash/temp01.dbf /db/Oracle/a/oradata/crash/tools01.dbf /db/Oracle/a/oradata/crash/users01.dbf /db/Oracle/a/oradata/crash/test01.dbf SQL> select group#, member from v$logfile; 1 /db/Oracle/a/oradata/crash/redocrash01.log 3 /db/Oracle/c/oradata/crash/redocrash03.log 2 /db/Oracle/b/oradata/crash/redocrash02.log 1 /db/Oracle/b/oradata/crash/redocrash01.log 2 /db/Oracle/a/oradata/crash/redocrash03.log 3 /db/Oracle/c/oradata/crash/redocrash02.log The Windows output of these commands would look very similar, with different paths, of course. Let's check each of the files shown by the preceding command. First, look at the datafiles. Most of the datafiles probably have the same modification time, or there might be a group of them with one modification time and another group with a different modification time. If there are read-only tablespaces, there may be a few datafiles with modification times a lot older than the others. That's OK. The main thing to look for is a missing file or a zero-length file. Something else to look for is one or more files that have a modification time that is newer than the newest online redo logfile. If a datafile meets any one of these conditions, it must be restored from backup. Redo logfiles, however, are a little different. Each redo logfile within a log group should have the same modification time. For example, the output of the preceding example command shows that /db/Oracle/a/oradata/crash/redocrash01.log and /db/Oracle/b/oradata/crash/redocrash01.log are in log group 1. They should have the same modification time and size. The same should be true for groups 2 and 3. There are a couple of possible scenarios: One or more log groups has at least one good and one damaged log. This is why redo logs are multiplexed/mirrored! Copy the good redo log to the location of the damaged redo log. For example, if /db/Oracle/b/oradata/crash/redocrash01.log is missing, but /db/Oracle/a/oradata/crash/redocrash01.log is intact, issue the following command: $ cp /db/Oracle/a/oradata/crash/redocrash01.log \ /db/Oracle/b/oradata/crash/redocrash01.log. Or, if the errors showed that D:\ORACLE\ORADATA\CRASH\REDOCRASH01.LOG was OK, but E:\ORACLE\ORADATA\CRASH\REDOCRASH01.LOG was missing or corrupt: C:\ COPY D:\ORACLE\ORADATA\CRASH\REDOCRASH01.LOG \ E:\ORACLE\ORADATA\CRASH\REDOCRASH01.LOG. All redo logs in at least one log group are damaged. This is a bad place to be. The create controlfile script in Step 6 requires that all online redo logs be present to use the noresetlogs option. If even one log group is completely damaged, you will have to use the resetlogs portion of the script. If all datafiles are available, but all the control files are missing, proceed to Step 6. If the database will not open for some other reason, proceed to Step 10. 16.7.8. Step 6: Is There a "Backup to Trace" of the Control File?
You'll need the output from the sqlplus command alter database backup controlfile to trace. It creates a trace file that contains two create controlfile scripts. This command should be run (via cron on a Unix/Linux system, a Windows scheduled task, or via the DBMS scheduler) on a regular basis. To find out if there is such a script available, follow these instructions. The first thing to find out is the destination of the trace files. This is specified by the user_dump_dest value in the v$parameter table. (Typically, user_dump_dest is set to $ORACLE_BASE /admin/ $ORACLE_SID /udump.) First cd to that directory, then use grep on Unix or find on Windows to look for the phrase CREATE CONTROLFILE, as shown in Example 16-1. Example 16-1. Locating the most recent create controlfile script
On Windows, try these commands: D:\ cd D:\Oracle\admin\crash\udump D:\ type *.TRC|find 'CREATE CONTROLFILE' In Example 16-1, crash_ora_661.trc is the most recent file to contain the create controlfile script. If there is a create controlfile script, proceed to Step 7. If there is not a create controlfile script, and all the control files are missing, proceed to Step 8. 16.7.9. Step 7: Run the create controlfile ScriptFirst, find the trace file that contains the script. The instructions on how to do that are in Step 6. Once you find it, copy it to another filename, such as rebuild.sql. If your online redo logs are intact, you can use the noresetlogs section; if any online redo log group is completely damaged, you'll need to use the resetlogs section. Take a look at the output, and select the appropriate section, deleting all other comments before and after the section. The file then should look something like the one in Example 16-2. Example 16-2. Example create controlfile script
Once the file looks like Example 16-2, add the following line just above the STARTUP NOMOUNT line: CONNECT /AS SYSDBA; After you add this line, run the following command on the idle instance, substituting rebuild.sql with the appropriate name: $ sqlplus /nolog < rebuild.sql This should work without intervention and completely rebuild the control files. You then open the database when it's done.
16.7.10. Step 8: Restore Control Files and Prepare the Database for Recovery
If the precautions mentioned elsewhere in this chapter were followed, there is really only one scenario that would result in this positionloss of the entire system due to a cataclysmic event. Loss of a disk drive (or even multiple disk drives) is easily handled if the control files are multiplexed/mirrored. Even if all control files are lost, they can be rebuilt using the trace file created by running the backup controlfile to trace command.
Follow the next steps, starting with restoring the control files from backup. Chances are that the database files need to be restored as well. This is because it is difficult to use a control file that is older than the most recent database file. (Oracle complains and aborts if this happens.) To find out if the control file is newer than the datafiles, try the following steps without overwriting the database files and see what happens. 16.7.10.1. 1) Restore control files from backupThe very first step in this process is to find and restore the most recent backup of the control file. If you're using rman with a catalog and you're backing up the control file explicitly, you can use the restore controlfile command: RMAN> restore controlfile If you've set controlfile autobackup to on, connect to rman, and issue the restore controlfile from autobackup command: RMAN> restore controlfile from autobackup If you're performing user-managed backups, you need to issue the backup controlfile to filename command in sqlplus. Locate the backed up control file, and copy it to all of the locations and filenames shown when running a select value from v$parameter where name like "control_files". Again, this backup control file should be more recent than the most recent database file in the instance. If this isn't the case, Oracle will complain. 16.7.10.2. 2) Start up mountTo find out if the control file is valid and has been copied to all the correct locations, attempt to start up the database with the mount option. (This is the same command from Step 1.) If you're running rman, you can simply issue this command after your restore controlfile command: RMAN> startup mount If you're performing user-managed backups, run the following command on the mounted, closed database: SQL> startup mount; Once this step has been completed, proceed to Step 9. 16.7.11. Step 9: Recover the Database
Once the control file is restored with a backup copy, attempt to recover the database using the backup control file. 16.7.11.1. Recover and open the database with rmanIf you're using rman, this section is easy. Just issue recover database as your next command. This single command handles recovering the database even if you're using a backup control file. RMAN> recover database; Here's an area where rman really shines. If recovery requires applying archived redo logfiles that have been backed up, rman automatically restores them from the appropriate location, applies them, then deletes them. The entire process occurs automatically, regardless of where the archived redo logs have to come from. If you did not use a backup control file, you just need to open the database: RMAN> alter database open; If you did use a backup control file, you need to open the database with the resetlogs option: RMAN> alter database open resetlogs; 16.7.11.2. Attempt to recover database manuallySince recovering the database with a backup control file requires the alter database open resetlogs option, it never hurts to try recovering the database normally first: SQL> recover database; If the backup control file option is required, Oracle will complain: ORA-00283: Recover session cancelled due to errors ... ORA-01207: file is more recent than controlfile - old controlfile
If Oracle complains, you need to recover the database using the backup controlfile option. Attempt to recover the database using the following command on the mounted, closed database: SQL> recover database using backup controlfile If it works, the output will look something like Example 16-3. Example 16-3. Sample output of recover database command
Oracle requests all archived redo logs since the time of the oldest restored datafile. For example, if the backup that was used to restore the datafiles was from three days ago, Oracle needs all archived redo logs created since then. Also, the first logfile that it asks for is the oldest logfile that it wants. If you're performing user-managed backups, and you need older archived redo logs, you have to restore them yourself. If you have to find and apply the logs yourself, the most efficient way to roll through the archived redo logs is to have all of them sitting uncompressed in the directory that Oracle suggests as the location of the first file. (This, of course, requires much more storage than the rman method.) If this is the case, simply enter auto at the prompt. Otherwise, specify alternate locations, or press enter as it asks for each one, giving time to compress or remove the files that it no longer needs. If it is able to do so, Oracle automatically rolls through all the archived redo logs and the online redo log. It then says Media recovery complete. However, once Oracle rolls through all the archived redo logs, it may prompt for the online redo log. It does this by prompting for an archived redo log with a sequence number that is higher than the most recent archived redo log available. This means that it is looking for the online redo log. Try answering its prompt with the names of the online redo logfiles that you have. Unfortunately, as soon as you give it a name it doesn't like, it makes you start the recover database using backup controlfile command again. For example, suppose that you have the following three online redo logs: /oracle/data/redolog01.dbf /oracle/data/redolog02.dbf /oracle/data/redolog03.dbf When you are prompting for an archived redo log that has a higher sequence number than the highest numbered archived redo log that you have, answer the prompt with one of these files (e.g., /oracle/data/redolog01.dbf ). If the file that you give it does not contain the sequence number it is looking for, you will see a message like the following: ORA-00310: archived log contains sequence 2; sequence 3 required ORA-00334: archive log: '/oracle/data/redolog01.dbf' Oracle cancels the database recovery, requiring you to start it over. Once you get to the same prompt again, respond with a different filename, such as /oracle/data/redolog02.dbf. If it contains the recovery thread it is looking for, it responds with a message like the following: Log applied. Media recovery complete. If, after trying all the online redo logs, Oracle still asks for a log that you do not have, simply enter cancel. 16.7.11.3. Alter database open resetlogsOnce the media recovery is complete, the next step is to open the database. As mentioned earlier, when recovering the database using a backup control file, it must be opened with the resetlogs option. Do this by entering the following SQL command against the mounted, closed database: SQL> alter database open resetlogs; Later versions of Oracle allow you to create a control file with a resetlogs option. If this was done, you can simply open the database and do not need the resetlogs option here. If you're running a version of Oracle prior to 10g, you must take a backup immediately after recovering the database with the resetlogs option! Even if you're running 10g, it would be good to take a backup as soon as you can. While 10g databases can recover through a resetlogs incident, you will have to redo the restore and recovery you just did. It is best if it is a cold backup after shutting down the database, but a hot backup would be better than nothing. Just remember that if you allow the database to become operational, and something happens before you get it backed up again, there is a risk that:
16.7.12. Step 10: Does "alter database open" Work?
If the startup mount worked, this is actually only the second step that you perform. Mounting the database checks only the presence and consistency of the control files. If that works, opening the database is the next step. Doing so checks the presence and consistency of all datafiles, online redo logfiles, and any rollback segments or undo segments. To open the database, run the following command on the mounted, closed database: SQL> alter database open; If the attempt to open the database worked, Oracle simply says Database Altered. If this is the first attempt to open the database, and no datafiles or rollback segments were taken offline, you're done!
16.7.12.1. Big shortcut for rman usersIf you're using rman, and the attempt to open the database did not work, you have a big choice to make. You can continue to follow this procedure to determine what's wrong and fix things one at a time, or you can issue two commands and be done. If the latter idea sounds better, try these two commands: RMAN> restore database ; RMAN> recover database ; 16.7.12.2. User-managed backups: Read onIf you're performing user-managed backups, or you'd like to perform the rest of these steps manually with rman, you now need to find out why the database wouldn't open. The output varies depending on the condition. Here is a listing of what some of those conditions may be, accompanied by what the error might look like when that condition occurs:
16.7.12.3. Damaged datafileA damaged datafile is actually very easy to recover from. This is a good thing because this occurs more often than any other problem. Remember that there is only one copy of each datafile, unlike online redo logs and control files that can be multiplexed/mirrored. So, statistically speaking, it's easier to lose one datafile than to lose all multiplexed/mirrored copies of a log group or all multiplexed/mirrored copies of the control file. Oracle also can recover parts of the database while other parts of the database are brought online. Unfortunately, this helps only if a partially functioning database is of any use to the users in your environment. Therefore, a database that is completely worthless unless all tables are available will not benefit from the partial online restore feature. However, if the users can use one part of the database while the damaged files are being recovered, this feature may help to save face by allowing at least partial functionality during an outage. (Because doing an online (partial) restore actually makes more work for you, you should seriously investigate whether this option will help if you use itbefore you actually have to use it. That way, when you need to do a large restore, that question will already be answered.) There are three types of datafiles as far as recovery is concerned:
The final type of datafile is a file contained within a required tablespace (system , sysaux, or undo tablespaces in 10g, and system in previous versions). This datafile cannot be recovered with the database online because the database cannot be brought online without it. 16.7.12.4. Damaged log groupIf all members of a log group are damaged, there is great potential for data loss. The entire database may have to be restored, depending on the status of the log group that was damaged and the results of some attempts at fixing it. This may seem like a broken record, but this is why multiplexing/mirroring the log groups is so important. If the error refers to a damaged log group, one option is to proceed directly to Step 17. However, to verify that nothing else is wrong, you should probably read the rest of this step and proceed to the next one. 16.7.12.5. Damaged required tablespaceIf any datafiles belonging to required tablespaces are damaged, then you have to restore and recover them offline. The database cannot be opened without them. If the error refers to a required tablespace, one option is to proceed directly to Step 11. However, to verify that nothing else is wrong, you should probably read the rest of this step and proceed to the next one. 16.7.12.6. Damaged rollback segment
Since Oracle has to open the datafiles that contain this rollback segment before it can verify that the rollback segment is available, this error will not occur unless a datafile has been taken offline. If Oracle encounters a damaged datafile (whether or not it contains a rollback segment), it will complain about that datafile and abort the attempt to open the database. Remember that a rollback segment is a special part of a tablespace that stores rollback information. Rollback information is needed in order to undo (or roll back) an uncommitted transaction. Since a crashed database almost always contains uncommitted transactions, recovering a database with a damaged rollback segment is a little tricky. As previously mentioned, a damaged datafile may be taken offline, but Oracle will not open the database without the rollback segment. If the error indicates that there is a damaged rollback segment, proceed to Step 18. 16.7.12.7. Before going any farther...Remember that Oracle will stop attempting to open the database as soon as it encounters an error with one file. This means, of course, that there could be other damaged files. If there is at least one damaged datafile, now is a good time to see whether other files are damaged. Detailed instructions on how to do that are provided in Step 5. Once you know the names of all the damaged files, you can recover them as described next. 16.7.12.8. How media recovery worksIf any datafiles are restored from backup, the sqlplus or rman recover command is needed. These commands use the archived and online redo logs to redo any transactions that have occurred since the time that the backup of a datafile was taken. You can recover a complete database, a tablespace, or a datafile by issuing the commands recover database, recover tablespace tablespace_name, or recover datafile data_file_name, respectively. With rman or user-managed backups, you can simply issue the command recover database. Both automatically figure out what needs to have media recovery applied to it and perform the appropriate recovery. The difference between rman and user-managed backups comes when you need an archived log that is available only on backup. If rman needs to obtain any archived redo logs from backup, it automates restoring them and deleting them as necessary. If you're performing user-managed backups, you need to perform this step yourself. You can also apply media recovery to an individual datafile by issuing the SQL commands below against a mounted, closed database: SQL> recover datafile '/db/Oracle/a/oradata/crash/datafile01.dbf' This command allows the restore of an older version of a datafile and uses redo to roll it forward to the point of failure. For example, if you took a backup of a datafile on Wednesday night, and that datafile was damaged on Thursday evening, you would restore that datafile from Wednesday night's backup. Of course many transactions would have occurred since Wednesday night, making changes to the datafiles that you restored. Running the command recover [database|tablespace|datafile] reapplies those transactions to the restored datafile, rolling them forward to Thursday evening. This recovery can work in a number of ways. After receiving the recover command, Oracle prompts for the name and location of the first archived redo log that it needs. If that log and all logs that have been made since that log are online, uncompressed, and in their original location, enter the word auto. This tells Oracle to assume that all files that it needs are online. It therefore can automatically roll through each log. In order to do this, all files that Oracle will need must be online. First, get the name of the oldest file, because that is the first file it will need. That filename is displayed immediately after issuing the recover command: ORA-00279: change 18499 generated at 02/21/06 11:49:56 needed for thread 1 ORA-00289: suggestion : /db/Oracle/admin/crash/arch/arch.log1_481.dbf ORA-00280: change 18499 for thread 1 is in sequence #481 Specify log: {<RET>=suggested | filename | AUTO | CANCEL} In the preceding example, the first file that Oracle needs is /db/Oracle/admin/crash/arch/arch.log1_481.dbf. Make sure that this file is online and not compressed or deleted. If it is deleted, restore it from backup. If it is compressed, uncompress it and any archived redo logfiles in that directory that are newer than it, because Oracle may need all of them to complete the media recovery. If you're not using rman, it might be necessary to delete some of the older archived redo logs to make enough room for the files that need to be uncompressed. Once all archived redo logs that are newer than the one requested by Oracle have been restored and uncompressed, enter auto at the Specify log prompt. rman can be configured to restore only one redo log at a time from tape, apply that redo log, then delete it, removing the space issue. It can also be configured to keep a certain amount on disk using the maxsize parameter.
If there isn't enough space for all of the archived redo logs to be uncompressed, a little creativity may be required. Uncompress as many as possible, and then press Enter each time it suggests the next file. (Pressing Enter tells Oracle that the file that it is suggesting is available. If it finds that it is not available, it prompts for the same file again.) Once it has finished with one archive log, compress that log, and uncompress a newer log, because it will be needed shortly. (Obviously, a second window is required, and a third window wouldn't hurt!) At some point, it may ask for an archived redo log that is not available. This could mean that some of the archived redo logs or online redo logs are damaged. If the file cannot be located or restored, enter cancel. More details on media recovery are available in Oracle's documentation. If any of the damaged datafiles is a member of any required tablespaces, proceed to Step 12. If none of them is a member of any required tablespaces, proceed to Step 13. 16.7.13. Step 11: Are There Damaged Datafiles for Required Tablespaces?If the damaged file is part of a required tablespace (system, sysaux, undo, and temp in 10g, or system in previous versions), an offline recovery is required. Unfortunately, Oracle complains only that the datafile is missingwithout saying what kind of datafile it is. Fortunately, even if Oracle is down, there is an easy way to determine which files belong to the required tablespaces. Finding out if the datafile contains a rollback segment is a little more difficult, but it is still possible. If you use rman with a recovery catalog, you can connect to the catalog, set the DBID of the database, and issue the report schema command. This command displays the files, locations, and whether they contain rollback segments. RMAN> report schema; Report of database schema File Size(MB) Tablespace RB segs Datafile Name ---- ---------- ---------------- ------- ------------------- 1 307200 SYSTEM NO /oracle/oradata/trgt/system01.dbf 2 20480 UNDOTBS YES /oracle/oradata/trgt/undotbs01.dbf 3 10240 CWMLITE NO /oracle/oradata/trgt/cwmlite01.dbf 4 10240 DRSYS NO /oracle/oradata/trgt/drsys01.dbf 5 10240 EXAMPLE NO /oracle/oradata/trgt/example01.dbf 6 10240 INDX NO /oracle/oradata/trgt/indx01.dbf 7 10240 TOOLS NO /oracle/oradata/trgt/tools01.dbf 8 10240 USERS NO /oracle/oradata/trgt/users01.dbf To find out which datafiles are in the system tablespace, you'll need to query sys.dba_data_files: SQL> select file_name, tablespace_name from sys.dba_data_files; /oracle/oradata/trgt/system01.dbf SYSTEM /oracle/oradata/trgt/undotbs01.dbf UNDOTBS /oracle/oradata/trgt/cwmlite01.dbf CWMLITE /oracle/oradata/trgt/drsys01.dbf DRSYS /oracle/oradata/trgt/example01.dbf EXAMPLE /oracle/oradata/trgt/indx01.dbf INDX /oracle/oradata/trgt/tools01.dbf TOOLS /oracle/oradata/trgt/users01.dbf USERS This example report shows three datafiles that are members of the system , sysaux, and undo tablespaces. In your configuration, however, there may be multiple datafiles assigned to these tablespaces. If any of the damaged datafiles is a member of a required tablespace, proceed to Step 12. If none of them are members of required tablespaces, proceed to Step 13. 16.7.14. Step 12: Restore All Datafiles in Required TablespacesUnlike other tablespaces, required tablespaces (system, sysaux, undo, and temp in 10g, or system in previous versions) must be available in order to open the database. Therefore, if any members of these tablespaces are damaged, they must be restored now. Before doing this, make sure that the database is not open. (It is OK if it is mounted.) To make sure, run the following command on the mounted, closed database: SQL> select status from v$instance; STATUS ------- MOUNTED 1 row selected. The preceding example shows that this instance is mounted, not open. If the database is not open, restore the damaged files from the most recent backup available. If you're performing user-managed backups, you'll need to figure out where those backups are and restore them. If you're using rman, simply issue a single command: RMAN> restore database; rman automatically figures out which files need to be restored and restores only those files from the most recent backup. It does not restore files that appear to be alright. Once all damaged files in the required tablespace are restored, run the following command on the mounted, closed database. If you want to just bring these tablespaces online and then fix other tablespaces later, you'll need to apply media recovery just against the individual tablespaces. This may be more complex and may take more time than just restoring all damaged datafiles and issuing a single recover database command. To recover just the required tablespaces, run individual recover commands against them. The following examples show media recovery being applied against the system tablespace: RMAN> recover tablespace system; Or in SQL: SQL> recover tablespace system; If you're recovering multiple tablespaces, it's probably best if you just issue the recover database command, which applies media recovery against the entire database. Once this command has completed, the required tablespaces are recovered to the time of failure. If it does complete successfully, and no other datafiles are damaged, return to Step 10. For more information about the recover tablespace command, read the earlier section "How media recovery works" at the end of Step 10. If there are other datafiles to recover, proceed to Step 13. 16.7.15. Step 13: Damaged Nonrequired Datafile?So far, we have mounted the database, which proves that the control files are okay. It may have taken some effort if one or more of the control files were damaged, but it succeeded. We also have verified that any required tablespaces are intact, even if they required a restore and recovery. Most of the rest of this procedure concentrates on disabling damaged parts of the database so that it may be brought online as soon as possible. The process of elimination identifies all damaged datafiles once the database is opened successfully. They can then be easily restored. If there are damaged datafiles that are not members of required tablespaces, proceed to Step 14. If there are no more damaged datafiles, proceed to Step 17. 16.7.16. Step 14: Take Damaged Datafile OfflineTo open a database with a damaged, nonrequired datafile, take the datafile offline.
If this instance is operating in archivelog mode, just take the datafile offline. It can be restored and recovered later, after the instance has been brought online. Here's the command to do this: SQL> alter database datafile 'filename' offline; If the instance is operating in noarchivelog mode, that's a different problem. Oracle does not allow the datafile to be taken offline because it knows it can't be brought back online without media recovery. Without archivelog mode, there is no media recovery. If this is the case, you need to treat any damaged datafiles as required and go back to Step 12. Once any damaged files are taken offline, return to Step 10 and attempt to open the database again. 16.7.17. Step 15: Were Any Datafiles Taken Offline?
This step is really a very simple question: if the database opens without taking any datafiles offline, you're done. If you had to take some datafiles offline to open the database, you need to restore and recover them now. To find out if any datafiles were taken offline, run the following command: SQL> select name from v$datafile where status = 'OFFLINE' ; NAME ------------------ /db/oracle/a/oradata/crash/tools01.dbf /db/oracle/a/oradata/crash/users01.dbf If some datafiles were taken offline to open the database, proceed to Step 16. If you're unsure, proceed to Step 16. 16.7.18. Step 16: Restore and Recover Offline DatafilesIf any datafiles were taken offline, you need to restore them and bring them online. 16.7.18.1. Restore the damaged datafilesOnce the names of the datafiles that need to be restored are determined, restore them from the latest available backup. If you're performing user-managed backups, you have to decide which files to restore and where to restore them from. If you're using rman, you only need to issue the following command: RMAN> restore database ; Once they are restored, recovery within Oracle can be accomplished in three different ways. These ways vary greatly in complexity and flexibility. Examine the following three media recovery methods and choose whichever one is best for you. 16.7.18.2. Datafile recoveryIf there are a small number of datafiles to recover, this may be the easiest option. As each file is restored, issue the recover datafile command against it, and then bring it online. The following commands work in rman or sqlplus: recover datafile 'datafile_name' ; alter database datafile 'datafile_name' online ; The downside to this method is that media recovery may take a while for each datafile. If recovering multiple datafiles within a single tablespace, this is probably wasting time. 16.7.18.3. Tablespace recoveryThis is the hardest of the methods, but it may work faster than the previous method if there are several damaged datafiles within a tablespace. If forced to leave the partially functional database open while recovering the damaged datafiles, and there are several of them to recover, this is probably the best option. First, find the names of all datafiles and the tablespace to which they belong. Since the database is now open, this can be done in one step, as demonstrated in Example 16-4. Example 16-4. Listing dba_data_files
Once all of the datafiles are restored and the names of all the tablespaces that contain these datafiles have been determined, issue the recover tablespace command against each of those tablespaces. Before doing so, however, each tablespace must be taken offline, as shown in Example 16-5. The commands in Example 16-5 work in rman or sqlplus. Example 16-5. Tablespace-based recovery
It's obvious that this method is quite involved! It's not pretty, and it's not easy, but it allows recovery of multiple tablespaces while the instance continues to operate. If a partially functioning database is of any value to the users, this method may be their best friend. 16.7.18.4. Database recoveryThis is actually the easiest method, but it requires that the database be shut down to perform it. After restoring all the database files that were taken offline, close the database, and issue the recover database command. Once all the database files are restored, issue the commands shown in Example 16-6. These commands will work in rman or sqlplus. Example 16-6. Normal database recovery
To make sure that all tablespaces and datafiles have been returned to their proper status, run the commands shown in Example 16-7.
Example 16-7. Obtaining the names of all datafiles, control files, and logfiles
Example 16-7 shows that all datafiles, control files, and logfiles are in good condition. (In the case of the logfiles and control files, no status is good status.)
16.7.19. Step 17: Is There a Damaged Online Log Group?When we refer to a "damaged online log group," we mean that all members of a log group are damaged. If at least one member of a multiplexed/mirrored log group is intact, Oracle opens the database and simply puts an error message in the alert log. However, if all members of a log group are damaged, the database will not open, and the error will look something like this: ORA-00313: open failed for members of log group 2 of thread 1 ORA-00312: online log 2 thread 1: '/db/Oracle/b/oradata/crash/redocrash02.log' ORA-00312: online log 2 thread 1: '/db/Oracle/a/oradata/crash/redocrash03.log'
The first thing that must be determined is the status of the damaged log group. The three statuses to worry about are CURRENT, ACTIVE, and INACTIVE. To determine the status of the damaged log group, run the following command on the mounted, closed database: SQL> select group#, status from v$log; The output looks something like this: GROUP# STATUS ---------- ---------------- 1 INACTIVE 2 CURRENT 3 ACTIVE 3 rows selected. The preceding example shows that log group 1 is inactive, group 2 is current, and group 3 is active. Here is an explanation of these statuses and how they affect recovery:
To determine what action to take next, first get the number of the log group whose logfiles are damaged. The preceding example error reads open failed for members of log group 2. Reference this number against the log groups listed by the select * from v$log command. In the previous example, log group 2 was CURRENT at the time the database crashed.
16.7.20. Step 18: Are Any Rollback Segments Unavailable?
If a rollback segment is damaged, Oracle will complain when attempting to open the database. The error looks like the following: ORA-01545: rollback segment 'USERS_RS' specified not available Cannot open database if all rollback segments are not available. If the preceding error is displayed when attempting to open the database, proceed to Step 19. If not, return to Step 10. 16.7.21. Step 19: Recover Tablespace Containing Unavailable Rollback Segment
The first thing that must be determined is which tablespace the damaged rollback segment is in. Unfortunately, there is no fixed view that contains this information. That means that it will have to be discovered through common sense and deduction.
First, remember that this error is not displayed unless a datafile has been taken offline. To get a complete list of files that were taken offline, run the following command on a mounted, closed database: SQL> select TS#, name from v$datafile where status = 'OFFLINE' ; NAME -------------------------------------------------------------------------------- 5 /db/oracle/a/oradata/crash/test01.dbf 1 row selected. Now, find the name of the tablespace that contains this datafile: SQL> select name from v$tablespace where TS# = '5' ; NAME -------------------------------------------------------------------------------- TEST 1 row selected. Admittedly, the previous example was easy. There was only one datafile that was offline, which made finding its tablespace pretty easy. What if there were multiple datafiles contained within multiple tablespaces? How do you know which one contains the rollback segment? Unfortunately, there is no way to be sure while the database is closed. The rest of this step is simple. Restore any files that were taken offline, and use either the recover datafile or recover tablespace commands to roll them forward in time. If there are only one or two damaged datafiles, it's probably quicker to use the recover datafile command. If there are several damaged datafiles, especially if they are all in one tablespace, the recover tablespace command is probably easiest. Either way will work.
16.7.22. Step 20: Is the Current Online Log Damaged?
If the current online log group is damaged, you'll see a message like the following when you attempt to open the database: ORA-00313: open failed for members of log group 2 of thread 1 ORA-00312: online log 2 thread 1: '/db/Oracle/b/oradata/crash/redocrash02.log' ORA-00312: online log 2 thread 1: '/db/Oracle/a/oradata/crash/redocrash03.log' In the preceding example, a select group#, status from v$log command also would have shown that log group 2 was CURRENT at the time of failure. This is the worst kind of failure to have because there definitely will be data loss. That is because the current online log can be required to restart even a fully functioning database. If the current file contains redo needed to bring the instance online, you will not be able to open the database without a full restore and incomplete recovery. First, try opening the database with an alter database open resetlogs, which induces Oracle to recreate the online redo logs. If that works, you're done. If you cannot open the database with the resetlogs option, the only other step is to restore an older version of the control file. Unfortunately, you can't restore only the control file because the datafiles then would be more recent than the control file. The only remaining option is to restore the entire database.
16.7.23. Step 21: Restore and Recover All Database Files from Backup
Perform this step only after verifying (or rebuilding or restoring) the control files, and verifying that all members of the current online log group are damaged. This procedure is relatively easy. Simply determine the names and locations of all the datafiles, and restore them from their latest backup.
Again, if you're using rman, you can do all of this with two commands: RMAN> restore database ; RMAN> recover database ; If you're performing user-managed backup, you need to use your backup system to restore all datafiles. To determine the names of all the datafiles, run the following command on the mounted, closed database: SQL> select name from v$datafile ; Once all datafiles are restored, you should issue a media recovery against them. SQL> recover database ; Once all datafiles are restored and recovered, proceed to Step 22. 16.7.24. Step 22: Run alter database open resetlogs
The alter database open resetlogs command causes Oracle to open the database after clearing all contents of the online redo logfiles. Since there is no way to undo this step, it is a good idea to make copies of the online redo logfiles now. To find out all their names, run the following command on a mounted, closed database: SQL> select member from v$logfile ; To create an "undo" option, copy each file to <filename>.bak. After making a backup of the online redo logfiles, run the following command on a mounted, closed database: SQL> alter database open resetlogs ; Statement processed. If the database opens, congratulations!
Once that backup is completed, you're done! 16.7.25. Step 23: Is an Active Online Redo Log Damaged?
If an active online log group is damaged, you'll see a message like the following when attempting to open the database: ORA-00313: open failed for members of log group 2 of thread 1 ORA-00312: online log 2 thread 1: '/db/Oracle/b/oradata/crash/redocrash02.log' ORA-00312: online log 2 thread 1: '/db/Oracle/a/oradata/crash/redocrash03.log' In the preceding example, a select group#, status from v$log command also would have shown that log group 2 was active at the time of failure. Remember that an active log is one that is still needed for recovery. Depending on how the database crashed, there might still be data in the buffers. If you can get that data to flush to disk with a checkpoint, you can turn this active log into an inactive log and delete it. To perform a checkpoint, proceed to Step 24. If there are no damaged active online redo logs, proceed to Step 25. 16.7.26. Step 24: Perform a CheckpointThe way to attempt to recover from the scenario in Step 23 is to perform a checkpoint. If it is successful, the database should open successfully. To perform a checkpoint, issue the following command on the mounted, closed database: SQL> alter system checkpoint local ; Statement processed. Be patient. The reason that there is an active log group is probably that the checkpoint took a long time in the first place, and the database crashed during the checkpoint. Wait for Oracle to say that the checkpoint succeeded or failed. If it succeeded, Oracle will simply say, Statement processed. If it failed, there could be any number of Oracle errors. After issuing the checkpoint, even if it is unsuccessful, return to Step 10, and attempt to open the database. If this attempt fails, return to Step 21 and recover the entire database. 16.7.27. Step 25: Is an Inactive Online Redo Log Damaged?
If an inactive online log group is damaged, you'll see a message like this when attempting to open the database: ORA-00313: open failed for members of log group 2 of thread 1 ORA-00312: online log 2 thread 1: '/db/Oracle/b/oradata/crash/redocrash02.log' ORA-00312: online log 2 thread 1: '/db/Oracle/a/oradata/crash/redocrash03.log' In the preceding example, a select group#, status from v$log command also would show that log group 2 was inactive at the time of failure. By comparison, this one should be a breeze. An inactive log is not needed by Oracle. If the log group is not needed, simply drop it and create another to replace it. To drop and add an inactive log group, proceed to Step 26. 16.7.28. Step 26: Drop/Add a Damaged, Inactive Log Group
In all the previous examples, the damaged log group was group 2. Before we drop that group, we should make sure that we can add it back easily. Ensure that all the original redo log locations are still valid. To do this, get the names of all the members of that log group: SQL> select member from v$logfile where GROUP# = 2 ; For this example, Oracle returned the values: /logs1/redolog01.dbf /logs2/redolog01.dbf /logs3/redolog01.dbf Verify that the locations of all these files are still valid. For this example, assume that /logs3 is completely destroyed, and we are relocating all its contents to /logs4. Therefore, the members of log group 2 will be /logs1/redolog01.dbf, /logs2/redolog01.db, and /logs4/redolog01.dbf. To drop log group 2, issue the following command on a mounted, closed database: SQL> alter database drop logfile group 2 ; Once that command completes successfully, add the log group back to the database. To do this, issue the following command (remember that we have replaced /logs3/redolog01.dbf with /logs4/redolog01.dbf ): SQL> alter database add logfile group 2 ('/logs1/redolog01.dbf', ' /logs2/redolog01.dbf', '/logs4/redolog01.dbf') size 500K ; Statement processed. Once this command completes successfully, return to Step 10 and attempt to open the database. 16.7.29. You're Done!If you've made it this far, you're done! All datafiles, control files, and logfiles should be online. Take a backup of the entire database immediately, preferably a cold one with the database down. If that can't be done, then perform a hot backup. |