17.9. Sybase Recovery ProcedureThe first step to recover from a database problem is diagnosing exactly what is wrong with the database. This section provides step-by-step directions for diagnosing and repairing server problems. Here are a few important things to keep in mind when working on a Sybase server:
17.9.1. Step 1: Can You Connect to Your Server Using isql?The first thing you should try after someone has reported a server problem is to connect to your database server using a utility such as isql. If you succeed, go to Step 2. If you fail, go to Step 5. 17.9.2. Step 2: Run the Stored Procedure sp_whoOnce you log in to the database using isql, you should run the stored procedure sp_who. This procedure tells you what users are doing on your server and identifies blocked processes.
If you have any blocked processes, go to Step 3. If you see any processes in LOG SUSPEND state, go to Step 4. If everything looks OK, you can also run the stored procedure sp_lock to do further diagnostics or sp_helplogin loginname to check whether the user is locked out because of security policy (such as too many password attempts). If sp_who shows no obvious problems and you still think something is wrong with your server, go to Step 6, and check the server error log. 17.9.3. Step 3: Blocked ProcessesIf Step 2 revealed that some of your server processes are blocked, you must decide whether there is a problem that requires you to terminate one of them. If you find that a query is blocking other queries, you can kill it by terminating the client program or by using the Sybase T-SQL command kill. 17.9.4. Step 4: Log SuspendIf sp_who shows a process in LOG SUSPEND state, one of your databases has a full transaction log. You need to either manually truncate or extend the log, or kill the process that filled the log. Details on how to proceed are explained in the "Transaction Log " section earlier in this chapter. 17.9.5. Step 5: You Can't Connect Using isqlIf you can't connect to your server using a utility such as isql, you should log in to the system that runs your server (using terminal services or telnet) and go to Step 6. 17.9.6. Step 6: Check the Sybase Server Error LogWhen you have a major problem, it is always useful to check the server's error log. The error log is not always in the same location; it is essentially set in the run file that was used to start your database. But usually the server's error log is in the file $SYBASE/$SYBASE_ASE/install/<SERVER>.log. Review this file, which is in chronological order, for errors.
17.9.7. Step 7: Check Whether Your Server Is RunningUse Procedure 2 to see whether your server is running. If it is, but you cannot connect to it from a remote client (Step 5), go to Step 8. Otherwise, go to Step 9. 17.9.8. Step 8: Running Server but Can't Connect RemotelyAttempt to connect to the server from the system that runs that server. If you cannot, double-check that your environment variables are set correctly; look at the SYBASE.sh file that is in the $SYBASE directory. If you still cannot connect to the server, but the server is running on your system, follow the procedure to shut down the server, and then move to Step 9. 17.9.9. Step 9: Restart Your ServerA full explanation of how to restart your server can be found earlier in this chapter in the section "Procedure 1: How to Start Sybase." The following commands are a quick summary: $ cd $SYBASE/$SYBASE_ASE/install $ ./startserver f run_MYSYBASESERVER Look for the message recovery is complete near the end of a successful recovery process. If the server starts cleanly, you are done. If not, go to Step 10. 17.9.10. Step 10: Startup FailureIf your server does not start, you have a serious problem. Try the following steps; perhaps there is a simple solution to the problem. 17.9.11. Step 11: Contact Sybase Support ImmediatelyIf you have a Sybase support contract, you should call the minute a problem on your production server appears to be serious. Sybase support is excellent and should not be considered optional if your data is critical. It can, however, take at least 30 minutes to get a return phone call even for a priority 1 production issue, so you should call them the minute you realize you have a serious issue. 17.9.12. Step 12: Able to Get Shared Memory?Sybase uses shared memory extensively for interprocess communication and caching data pages for quick access. If Sybase is prevented from acquiring the minimum shared memory it needs, it displays an error message like the following: 00:2006/03/21 23:39:02.78 kernel os_create_region: can't allocate 2147479552 bytes 00:2006/03/21 23:39:02.80 kernel kbcreate: couldn't create kernel region. 00:2006/03/21 23:39:02.80 kernel kistartup: could not create shared memory If your server does not start with a short error message like this, you are having a problem allocating your shared memory. There are a couple of explanations for this. The most obvious reason is that the memory has not been freed up from the last server shutdown. The fix for this is to remove a file <SERVERNAME>.krg that can be found in $SYBASE/$SYBASE_ASE. Remove the <SERVERNAME>.krg file, and try to restart the server again. If this does not work, find out if someone changed the memory configuration of the system or if you changed the amount of memory required by the server at startup. Figure out the amount of memory on your system, and compare it to the amount of memory that the server tried to allocate (2,147,479,552 bytes in the previous error message). If the server is trying to allocate too much memory, that value can be changed in the configuration file <SERVERNAME>.cfg. Finally, Sybase requires that you set a few options in /etc/system. These arguments define the maximum shared memory that the server can allocate. If you have not set these values (first-time install), you should check out the Sybase installation guide for your platform. If you have changed the amount of memory available on your system, the /etc/system values may need to be updated. You can view the shared memory being used using the ipcs command. Check the manpages for the correct usage of the command, but the output should look something like this: ------ Shared Memory Segments -------- shmid owner perms bytes nattch status 129 sybase 600 11964416 1 2 sybase 666 1024 3 56 curtis 606 33334342 3 Here, the curtis process has also taken some of the shared memory for itself. By stopping this process, the shared memory should be freed up. If stopping the process does not free up the memory, it can be freed using the Unix command ipcrm. Again, check the Unix manpages for more information on this command on your operating system.
17.9.13. Step 13: Master Device FailureIf your server does not start because something is wrong with the master database, you will get the following error message during startup: 00:2006/03/22 00:53:48.44 kernel kdconfig: unable to read primary master device 00:2006/03/22 00:53:48.44 kernel kiconfig: read of config block failed The primary master device contains the master database. This message indicates that the dataserver cannot read the master file. You should check the file permissions and ensure that the sybase account can access the file (you are starting the server as sybase, right?). The master device can be readily found by looking at the run_FILE. The Sybase dataserver executable identifies its master device with the d flag. If there are no problems accessing the master data device, the master database might be corrupted. This is an unlikely but fatal scenario; the server will not start if the master device is corrupt. Master device failure is a special case for the Sybase recovery procedure. If this occurs, you should immediately contact Sybase support if you have a maintenance plan. You should also immediately go to the online Sybase troubleshooting guide and look at the section on recovering a failed master device from backups. It gives detailed step-by-step procedures for this scenario. One main reason we do not store user tables in the master database is the critical nature of this database. Because we do not store transactional tables in master, this database should be stable and not change much. The master database is also generally a very small database. The online Sybase Troubleshooting Guide provides recovery steps for corrupt master devices. Basically, you will be given a procedure to recover the system from a backup of master and a separate procedure to recreate the server from scratch. The first procedure is much superior because you will not need to recreate and reload any of your databases. 17.9.14. Step 14: Disk Device FailureAs mentioned earlier, disk failure is the most common problem you will encounter. You should use disk mirroring either at the operating system level (preferred) or at the database level to protect yourself from this problem. If your dataserver error log indicates the failure of a disk device, follow these procedures to check OS device problems, fix them, and maybe replace them. You should first check ownership and permissions on devices. The Sybase account should have read and write access to all disk devices and raw partitions that it needs. Check the specific file listed in the error log. You should also check the system log for device failure messages. Sometimes, because of a change in the system, the operating system might no longer "see" the device. There might be hardware failures or configuration errors that could cause the device not to appear. Use the appropriate OS procedure to check the device's status, and review all error logs. If the disk has gone bad, you have no choice but to rebuild the databases that it contained and reload them from backups. You need a configuration audit to do this. 17.9.14.1. Get a list of the databases that failed to loadThe first thing you need is a list of the bad devices. Your server startup messages list the devices that did not start. It will also list the databases that did not start. Use the T-SQL statement in Example 17-6 to list databases that use a particular disk device. Example 17-6. Locating databases that use a particular device
17.9.14.2. Check your available free spaceIf you lose a 32 GB disk device, you should ensure that you have at least 32 GB of disk space that has been assigned to the server but not allocated to a database. You can use the sp_helpdevice procedure to do this. If you do not have sufficient disk space available to replace the disk that went bad, you need to replace your disk and reassign it with the disk init command. To drop a device, you can run the sp_dropdevice procedure. If you have sufficient disk space, you can skip this step. If not, you will need to add a device to replace the bad disk. The disk init command is used for this. If the original disk's physical name is /dev/dsk/c0t2d1s0 with a disk init command of: disk init name="BusDev1, " physname="/dev/dsk/c0t2d1s0, " vdevno=6, size=2048 the new command using the replacement device /dev/dsk/c1t3d0s1 is: disk init name="BusDev1, " physname="/dev/dsk/c1t3d0s1, " vdevno=10, size=2048 Once the device has been recreated, all the databases that were using that device need to be restored. 17.9.14.3. Get database recreation informationAt this point, you will need to get the information necessary to recreate your database. The commands in Example 17-7 will give you the file allocations used by this database. Example 17-7. Database allocations
Most database tools can reverse-engineer the exact SQL statement that you will need to recreate your databases. If you do not have a tool that does this, you can use the stored procedure sp_ _revdb from www.edbarlow.com. 17.9.14.4. Drop the broken databaseIf your database has a fatal error, you may need to drop the database using this T-SQL command: $ drop database baddbname It is possible that this command will fail. In that case, you must use dbcc to drop the database: $ dbcc repairdb(dropdb,baddbname) To verify the database has been dropped, run the stored procedure sp_helpdb. 17.9.14.5. Recreate the databaseUsing the database allocations determined earlier, recreate the database using the same allocations it had. Here is an example based upon the preceding example output: create database baddbname on device1 = 3 log on logdev1 = 2 alter database baddbname on device3 = 4 17.9.14.6. Reload your databaseNow reload the database using the most recent database and transaction dumps. First, apply the full database backup. For our example database, here is an example load from /dumps/baddbname.dmp: $ load database baddbname from '/dumps/baddbname.dmp' After this completes, apply each transaction log starting with the oldest and finishing with the newest. The system will not allow transaction logs to be loaded out of order. In fact, if any of the logs are missing or corrupt, the rest of the logs cannot be applied. To load the transaction logs for the preceding example, enter the following command, repeating it for each transaction dump: $ load transaction baddbname from '/dumps/baddbname_trn.dmp' For more information on how to load dumps and transaction logs, please refer to the section "Restoring from a Hot Backup," earlier in this chapter. 17.9.14.7. Bring the database onlineAt this point, the database has been recreated, but the system will not bring it online until it is told to. The reason the system does this is that it has no way of knowing if there are any more transaction logs to process. The database should be brought online with the online database baddbname command.
|