Section 17.9. Sybase Recovery Procedure | Backup & Recovery: Inexpensive Backup Solutions for Open Systems

17.9. Sybase Recovery Procedure

The first step to recover from a database problem is diagnosing exactly what is wrong with the database. This section provides step-by-step directions for diagnosing and repairing server problems.

Here are a few important things to keep in mind when working on a Sybase server:

It may take some time to have a support call returned. If you have a support contract, contact Sybase technical support early in the process. Sybase technical support will call you back within an hour of placing a priority 1 call.
Take a deep breath before starting. Errors made during a disaster can have dramatic consequences.
The online Sybase manuals, particularly the Troubleshooting Guide, can be very useful.

17.9.1. Step 1: Can You Connect to Your Server Using isql?

The first thing you should try after someone has reported a server problem is to connect to your database server using a utility such as isql.

If you succeed, go to Step 2. If you fail, go to Step 5.

17.9.2. Step 2: Run the Stored Procedure sp_who

Once you log in to the database using isql, you should run the stored procedure sp_who. This procedure tells you what users are doing on your server and identifies blocked processes.

If you have loaded my extended system stored procedure library, you can use sp_ _whodo instead of sp_who. sp_ _whodo shows you only active processes and, more importantly, clearly shows you blocked processes. sp_who output shows one row per connected user; this can be a lot of data on large systems.

If you have any blocked processes, go to Step 3. If you see any processes in LOG SUSPEND state, go to Step 4. If everything looks OK, you can also run the stored procedure sp_lock to do further diagnostics or sp_helplogin loginname to check whether the user is locked out because of security policy (such as too many password attempts).

If sp_who shows no obvious problems and you still think something is wrong with your server, go to Step 6, and check the server error log.

17.9.3. Step 3: Blocked Processes

If Step 2 revealed that some of your server processes are blocked, you must decide whether there is a problem that requires you to terminate one of them. If you find that a query is blocking other queries, you can kill it by terminating the client program or by using the Sybase T-SQL command kill.

17.9.4. Step 4: Log Suspend

If sp_who shows a process in LOG SUSPEND state, one of your databases has a full transaction log. You need to either manually truncate or extend the log, or kill the process that filled the log. Details on how to proceed are explained in the "Transaction Log " section earlier in this chapter.

17.9.5. Step 5: You Can't Connect Using isql

If you can't connect to your server using a utility such as isql, you should log in to the system that runs your server (using terminal services or telnet) and go to Step 6.

17.9.6. Step 6: Check the Sybase Server Error Log

When you have a major problem, it is always useful to check the server's error log. The error log is not always in the same location; it is essentially set in the run file that was used to start your database. But usually the server's error log is in the file $SYBASE/$SYBASE_ASE/install/<SERVER>.log. Review this file, which is in chronological order, for errors.

Sybase errors usually have a number associated with them. If you have an error in your error log, you should look up the description of the error in the Sybase Troubleshooting Guide, which can be easily found online at http://sybooks.sybase.com. That guide has good explanations of each error and usually provides a step-by-step procedure to rectify the problem.

17.9.7. Step 7: Check Whether Your Server Is Running

Use Procedure 2 to see whether your server is running. If it is, but you cannot connect to it from a remote client (Step 5), go to Step 8. Otherwise, go to Step 9.

17.9.8. Step 8: Running Server but Can't Connect Remotely

Attempt to connect to the server from the system that runs that server. If you cannot, double-check that your environment variables are set correctly; look at the SYBASE.sh file that is in the $SYBASE directory. If you still cannot connect to the server, but the server is running on your system, follow the procedure to shut down the server, and then move to Step 9.

17.9.9. Step 9: Restart Your Server

A full explanation of how to restart your server can be found earlier in this chapter in the section "Procedure 1: How to Start Sybase." The following commands are a quick summary:

$ cd $SYBASE/$SYBASE_ASE/install $ ./startserver f run_MYSYBASESERVER

Look for the message recovery is complete near the end of a successful recovery process. If the server starts cleanly, you are done. If not, go to Step 10.

17.9.10. Step 10: Startup Failure

If your server does not start, you have a serious problem. Try the following steps; perhaps there is a simple solution to the problem.

17.9.11. Step 11: Contact Sybase Support Immediately

If you have a Sybase support contract, you should call the minute a problem on your production server appears to be serious. Sybase support is excellent and should not be considered optional if your data is critical. It can, however, take at least 30 minutes to get a return phone call even for a priority 1 production issue, so you should call them the minute you realize you have a serious issue.

17.9.12. Step 12: Able to Get Shared Memory?

Sybase uses shared memory extensively for interprocess communication and caching data pages for quick access. If Sybase is prevented from acquiring the minimum shared memory it needs, it displays an error message like the following:

00:2006/03/21 23:39:02.78 kernel  os_create_region: can't allocate 2147479552 bytes 00:2006/03/21 23:39:02.80 kernel  kbcreate: couldn't create kernel region. 00:2006/03/21 23:39:02.80 kernel  kistartup: could not create shared memory

If your server does not start with a short error message like this, you are having a problem allocating your shared memory. There are a couple of explanations for this. The most obvious reason is that the memory has not been freed up from the last server shutdown. The fix for this is to remove a file <SERVERNAME>.krg that can be found in $SYBASE/$SYBASE_ASE. Remove the <SERVERNAME>.krg file, and try to restart the server again.

If this does not work, find out if someone changed the memory configuration of the system or if you changed the amount of memory required by the server at startup. Figure out the amount of memory on your system, and compare it to the amount of memory that the server tried to allocate (2,147,479,552 bytes in the previous error message). If the server is trying to allocate too much memory, that value can be changed in the configuration file <SERVERNAME>.cfg.

Finally, Sybase requires that you set a few options in /etc/system. These arguments define the maximum shared memory that the server can allocate. If you have not set these values (first-time install), you should check out the Sybase installation guide for your platform. If you have changed the amount of memory available on your system, the /etc/system values may need to be updated.

You can view the shared memory being used using the ipcs command. Check the manpages for the correct usage of the command, but the output should look something like this:

------ Shared Memory Segments -------- shmid     owner     perms     bytes     nattch    status 129       sybase    600       11964416  1 2         sybase    666       1024      3 56        curtis    606       33334342  3

Here, the curtis process has also taken some of the shared memory for itself. By stopping this process, the shared memory should be freed up. If stopping the process does not free up the memory, it can be freed using the Unix command ipcrm. Again, check the Unix manpages for more information on this command on your operating system.

If possible, make one change to the configuration options at a time. Sybase configuration options interact, some causing the system to need more memory, others less. By making one change at a time, the configuration option that prevents the system from restarting is known and can be adjusted accordingly.

Once you have corrected these problems, return to Step 1.

17.9.13. Step 13: Master Device Failure

If your server does not start because something is wrong with the master database, you will get the following error message during startup:

00:2006/03/22 00:53:48.44 kernel  kdconfig: unable to read primary master device 00:2006/03/22 00:53:48.44 kernel  kiconfig: read of config block failed

The primary master device contains the master database. This message indicates that the dataserver cannot read the master file. You should check the file permissions and ensure that the sybase account can access the file (you are starting the server as sybase, right?). The master device can be readily found by looking at the run_FILE. The Sybase dataserver executable identifies its master device with the d flag.

If there are no problems accessing the master data device, the master database might be corrupted. This is an unlikely but fatal scenario; the server will not start if the master device is corrupt. Master device failure is a special case for the Sybase recovery procedure. If this occurs, you should immediately contact Sybase support if you have a maintenance plan. You should also immediately go to the online Sybase troubleshooting guide and look at the section on recovering a failed master device from backups. It gives detailed step-by-step procedures for this scenario.

One main reason we do not store user tables in the master database is the critical nature of this database. Because we do not store transactional tables in master, this database should be stable and not change much. The master database is also generally a very small database.

The online Sybase Troubleshooting Guide provides recovery steps for corrupt master devices. Basically, you will be given a procedure to recover the system from a backup of master and a separate procedure to recreate the server from scratch. The first procedure is much superior because you will not need to recreate and reload any of your databases.

17.9.14. Step 14: Disk Device Failure

As mentioned earlier, disk failure is the most common problem you will encounter. You should use disk mirroring either at the operating system level (preferred) or at the database level to protect yourself from this problem. If your dataserver error log indicates the failure of a disk device, follow these procedures to check OS device problems, fix them, and maybe replace them. You should first check ownership and permissions on devices. The Sybase account should have read and write access to all disk devices and raw partitions that it needs. Check the specific file listed in the error log. You should also check the system log for device failure messages. Sometimes, because of a change in the system, the operating system might no longer "see" the device. There might be hardware failures or configuration errors that could cause the device not to appear. Use the appropriate OS procedure to check the device's status, and review all error logs.

If the disk has gone bad, you have no choice but to rebuild the databases that it contained and reload them from backups. You need a configuration audit to do this.

17.9.14.1. Get a list of the databases that failed to load

The first thing you need is a list of the bad devices. Your server startup messages list the devices that did not start. It will also list the databases that did not start. Use the T-SQL statement in Example 17-6 to list databases that use a particular disk device.

Example 17-6. Locating databases that use a particular device

select sysdevices.name as DevName, sysdatabases.name as DBName, sysusages.size/512 as Size from sysdatabases, sysusages, sysdevices where sysdevices.name="BadDeviceName" and sysdevices.low <= sysusages.vstart and sysdevices.high >= sysusages.vstart and sysusages.dbid = sysdatabases.dbid Example Output: DevName                        DBName                         Size  ----------------------------- ------------------------------ -----------  BusDev1                        BillingDB                     3  BusDev1                        ClientDB                      2

17.9.14.2. Check your available free space

If you lose a 32 GB disk device, you should ensure that you have at least 32 GB of disk space that has been assigned to the server but not allocated to a database. You can use the sp_helpdevice procedure to do this. If you do not have sufficient disk space available to replace the disk that went bad, you need to replace your disk and reassign it with the disk init command.

To drop a device, you can run the sp_dropdevice procedure.

If you have sufficient disk space, you can skip this step. If not, you will need to add a device to replace the bad disk. The disk init command is used for this. If the original disk's physical name is /dev/dsk/c0t2d1s0 with a disk init command of:

disk init name="BusDev1, " physname="/dev/dsk/c0t2d1s0, " vdevno=6, size=2048

the new command using the replacement device /dev/dsk/c1t3d0s1 is:

disk init name="BusDev1, " physname="/dev/dsk/c1t3d0s1, " vdevno=10, size=2048

Once the device has been recreated, all the databases that were using that device need to be restored.

17.9.14.3. Get database recreation information

At this point, you will need to get the information necessary to recreate your database. The commands in Example 17-7 will give you the file allocations used by this database.

Example 17-7. Database allocations

select sysdevices.name,  size as Blocks,  size/512 as Mbytes from sysusages, sysdevices, sysdatabases where sysdatabases.name = "dbname" and sysusages.dbid = sysdatabases.dbid and sysdevices.low <= sysusages.vstart and sysdevices.high >= sysusages.vstart and sysdevices.cntrltype = 0 order by vstart Example Output: name                   Blocks     Mbytes ---------------------- ---------- --------- device1                1536       3 logdev1                1024       2 device3                2048       4

Most database tools can reverse-engineer the exact SQL statement that you will need to recreate your databases. If you do not have a tool that does this, you can use the stored procedure sp_ _revdb from www.edbarlow.com.

17.9.14.4. Drop the broken database

If your database has a fatal error, you may need to drop the database using this T-SQL command:

$ drop database baddbname

It is possible that this command will fail. In that case, you must use dbcc to drop the database:

$ dbcc repairdb(dropdb,baddbname)

To verify the database has been dropped, run the stored procedure sp_helpdb.

17.9.14.5. Recreate the database

Using the database allocations determined earlier, recreate the database using the same allocations it had. Here is an example based upon the preceding example output:

create database baddbname on device1 = 3 log on logdev1 = 2 alter database baddbname on device3 = 4

17.9.14.6. Reload your database

Now reload the database using the most recent database and transaction dumps. First, apply the full database backup. For our example database, here is an example load from /dumps/baddbname.dmp:

$ load database baddbname from '/dumps/baddbname.dmp'

After this completes, apply each transaction log starting with the oldest and finishing with the newest. The system will not allow transaction logs to be loaded out of order. In fact, if any of the logs are missing or corrupt, the rest of the logs cannot be applied. To load the transaction logs for the preceding example, enter the following command, repeating it for each transaction dump:

$ load transaction baddbname from '/dumps/baddbname_trn.dmp'

For more information on how to load dumps and transaction logs, please refer to the section "Restoring from a Hot Backup," earlier in this chapter.

17.9.14.7. Bring the database online

At this point, the database has been recreated, but the system will not bring it online until it is told to. The reason the system does this is that it has no way of knowing if there are any more transaction logs to process. The database should be brought online with the online database baddbname command.

BackupCentral.com has a wiki page for every chapter in this book. Read or contribute updated information about this chapter at http://www.backupcentral.com.