Flylib.com

Books Software

 
 
 

Is the system still usable?


Is the system still usable?

If the system reboots itself after a panic, chances are good that the system will be usable, if only for a short while. Some panics and crashes will show up once in a blue moon, whereas others, once encountered , will increase in frequency. It all depends on the nature of the crash.

Assuming that the system is usable for now, you can use the system to analyze the savecore files that are awaiting you in the savecore directory.

If your system is one that serves several users, whether directly or indirectly as a data server, you may want to notify your user base that the system may be going down unexpectedly in the near future. Although not the best of news, it does give the users the option of backing up their work more frequently. For the moment, though, assume that the system is usable.

If you have not backed up your file systems recently, now would be a good time! However, just to be extra safe, use a different set of tapes, in case damage has already been done and you need to revert to the prior set of backups .


Turn off savecore? (How many dumps will you need?)

Once an image of a system crash has been captured, you need to again assess how you are doing on disk space. Do you have room for a subsequent set of savecore files should the system crash again? If not, you might want to move the files to another file system for analysis, clearing up space for the next crash. If you don't plan to analyze the files yourself, archive them to tape as soon as possible and free up the disk space.

At this time, the second question you need to ask yourself is whether you really need another set of postmortem files? To answer this, you need to consider the recent history of the system's performance. Has it been crashing a lot lately and you've just enabled savecore to capture one crash? Have the symptoms of the past crashes been reliably predictable?

For example, if your system crashes only when you boot a certain kernel, you probably only need the one set of savecore files and can disable savecore for the time being. If, however, your system has never crashed before, it would be wise to keep savecore enabled for now. It is often a good idea to have at least two or more sets of crashes for comparison.

Generally speaking, we feel savecore should always be enabled and ready to go in case the worst happens and your system decides to panic.

If you choose to maintain the savecore files on disk, use the UNIX compress command to squeeze them down to a smaller size . This will gain you some disk space. If you've never used compress before, here's an example that might convince you of its worth. The following savecore files are from a large Sun SPARCcenter 2000 server.

Figure 4-1 Compress your savecore files to save disk space
Hiya...

ls -l

total 268154 
-rw-rw-rw-  1 kbrown   15       1272308 Sep  1 12:28 unix.0 
-rw-rw-rw-  1 kbrown   15     135077888 Sep  1 12:29 vmcore.0 
Hiya...

compress unix.0 vmcore.0

Hiya...

ls -l

total 51082 
-rw-rw-rw-  1 kbrown   15        669336 Sep  1 12:28 unix.0.Z 
-rw-rw-rw-  1 kbrown   15      24592643 Sep  1 12:29 vmcore.0.Z 
Hiya...

The 135-megabyte vmcore.0 file compressed to less than 25 megabytes ” a huge saving!


Saving the crash to tape for shipment or archives

When archiving a set of crash dumps to tape, you may wish to first compress the ( vm ) unix. X and vmcore. X files, again, to use less media space. This also makes life a bit easier for the person who will later read the tape onto his own system to analyze the files, initially allowing him to use less disk space until he is ready to uncompress the files and start the analysis work.

When compressing the files, please use the standard UNIX compress command instead of your favorite public domain or third-party compression utilities. Don't assume that the person to whom you are sending the tapes uses nonstandard programs.

After writing the files to tape, write-protect the tape and, only then, verify that you can read the tape successfully. Too many potentially valuable system crash files have been lost due to faulty tapes!

Finally, label the tape!

Once the savecore files are safely archived, you can remove them from the disk. In general, it is a good idea to maintain the bounds file, which contains the next sequence number to use. Not only does it help provide a history of how many crashes have been captured, but it helps prevent you from ending up with a dozen vmcore.0 files on tapes over time. It also, again, makes life just a bit easier for the person you send your crashes to for analysis. He won't have to keep shuffling things around to avoid overwriting the previous crash that had the same sequence number and thus the same file names .

If you plan to send the tape to another person for analysis, it is best to provide the following information:

  • System activity as best known at the time of the crash.

  • A brief description of the crash history for this system from the system administrator's point of view.

  • The system configuration and tuning files. From a Solaris 1 system, provide the kernel configuration file and the param.c file. From Solaris 2 systems, provide the /etc/system file.

  • List of software modifications and patches installed, showrev -p output if Solaris 2.

  • General system and network information, including:

    • Hardware configuration. From a Solaris 1 system, devinfo -vp output is helpful. From a Solaris 2 system, provide prtconf -vp output.

    • List of third-party drivers and applications.

    • Network-based server and client relationships.

  • The /var/adm/messages* files.

The more information you can provide to the person who will analyze the crash files, the better idea he will have of where to start his search for the cause of the problem.