Flylib.com

Books Software

 
 
 

Chapter 4. Hey We Got One


Chapter 4. Hey! We Got One!

Whether you were expecting it or not, you've discovered that your system has panic'ed. If all went well, savecore did its job and there are now system crash dump files in the savecore directory for you to analyze. However, not every crash goes so well.

Before we move on to analyzing postmortem files, let's discuss a few other issues regarding system crashes.


What to do when your system has crashed

Depending on the cause of the system crash, the system may not have been able to reboot itself successfully. Cases where this would be true include:

  • Catastrophic hardware failure, such as faulty memory or a crashed disk

  • Major kernel configuration faults, such as a buggy device driver

  • Major kernel tuning errors, such as maxusers being much too big

  • Data corruption including corruption of the operating system files

  • Manual intervention is needed, for example, fsck needing answers to its queries

Was the system recently tuned ?

If you just tuned your system and tried to reboot under the new kernel and the system panic'ed, you already have a good idea where to start your search for the cause of the panic. If you named your new, untested kernel / vmunix on your Solaris 1 system or if you directly edited /etc/system on your Solaris 2 system, you will most likely find the system in an endless boot and panic loop. Rebooting the "generic" kernel for Solaris 1 will get the system back up. For a Solaris 2 system in this scenario, you can use boot -a and choose /dev/null as your /etc/system file to return to a generic kernel.

When tuning systems and testing the new kernel changes, it's a good idea not to use /vmunix or /etc/system until you know the changes are good. Instead, use /vmunix.test or /etc/system.test , for example. That way, should the system panic, at least the system will have a better chance of coming back up under a known good kernel. This is particularly sound advice if you are planning on going on vacation right after tuning a new kernel and booting it up.

Has anything else changed recently?

If the system had been running beautifully for the past year, suddenly died, and now won't come back up, you will need to read the messages that appear during the boot attempts. Look for messages that might point to hardware trouble. It would be a good idea to check all of the cables for proper connections. Also, make sure all the disk drives and other peripherals are still getting power. If everything seems to be in order, attempt to run diagnostics on the hardware.

On occasion, systems demonstrate sensitivity to their environment. With a workstation sitting on your desk next to your plants and your coffee mug, it's sometimes easy to forget that computers are ultrasensitive electronic devices. Always remember:

  • Proper air flow is required for cooling the electronic components .

  • If the environment is much too hot for you, it is probably also too hot for your computer. Power down your computer equipment if you expect the air-cooling systems in your area to be shut down.

  • Unless protected by an Uninterruptible Power Supply (UPS), your system can suffer damage during electrical storms and interruptions of power.

  • Dirt and dust inside some computers can lead to problems over time. Discuss with your vendor whether Preventative Maintenance visits are recommended.

  • Unless a system is designed to ruggedized standards, it can be damaged by high vibration and excessive movement.

  • Power down all components of the system whenever you need to do hardware repairs , replacements , or rearrangements. Don't, for example, change SCSI devices while the system is running.

  • Electrostatic discharge will easily damage your computer. Never touch or let anyone else touch the internal workings of your system without proper ESD protection .