Managing Outages

   


System and network outages can seriously affect availability targets normally set as part of a service level agreement. The system manager must try to be as realistic as possible when agreeing to an acceptable level of availability, taking into account any scheduled maintenance outages that might be essential to maintain the computing environment. These outages may not actually be on the systems themselves , but they could be periodic maintenance on air conditioning units or the uninterruptible power supply (UPS), for example. On these occasions, it is sometimes necessary to shut down all the systems, although this can often be avoided when there is sufficient redundancy to allow the operation to continue when part of the environment is disabled.

Scheduled Outages

Undoubtedly times will arise when a scheduled outage is needed, usually for a major upgrade or for work on the power supply. It is the responsibility of the system manager to coordinate the outage of any system under his control and to ensure that any disruption to the operation of the business is kept to an absolute minimum. If there is a "quiet" time, such as during the middle of the night or on weekends, the system manager should take advantage of these where possible.

The most important issue, though, is to keep the user community informed of any scheduled outages so that users also can plan their activities around it. To this end, three facilities are available on Solaris systems ”the "Message of the Day" file, email, and message broadcasting:

  • The Message of the Day file (MOTD) ”Whenever a user logs in to a Solaris system, the contents of this file are displayed on the screen. Place notification of any upcoming scheduled outages clearly in this file so that users are kept fully informed. Conversely, if the outage is cancelled for any reason, the file should be updated to reflect the amended situation. This file can be edited using any normal editor, such as vi, and can be found in the /etc directory as file motd.

  • Email ”Using email to inform users of forthcoming scheduled outages is less than ideal because there is no guarantee that every user will open the email that has been sent. Thus, some users might never know that there is to be an outage. Email presents just another means of contacting the user community.

  • Message broadcasting ”In this instance, a message is sent to every user logged in to the system through the use of the wall command. The message flashes on the user's screen and causes an audible beep. This kind of message is normally used immediately before an outage, to give the logged-in users the opportunity to save their work and log out. Typically, a message might be sent 30 minutes before an outage, reminding everyone that the outage will shortly commence. Further messages 10 minutes and 5 minutes before the outage remind the users that they should be getting ready to log out. Finally, a message is sent informing users that the system is going down.

Unscheduled Outages

When the system is unavailable due to an unscheduled fault, it is difficult (if not impossible ) to inform the user community of resolution progress or an estimated time of return to normal operation.

To combat this situation, many organizations employ other methods of providing essential information at such times:

  • Telephone answering machine ”This option is suitable for smaller organizations where users can telephone a specific number to obtain information about the current status of the systems. The IT staff is responsible for keeping the answering machine recorded information up-to-date. If a serious outage occurs, the message should be updated every 5 or 10 minutes, even if there is nothing to report ”this demonstrates a professional approach to keeping everyone informed.

  • Help desk updates ”Inform the help desk that the problem has occurred, and keep the staff updated on progress and estimated time of recovery. This will avoid multiple trouble tickets being raised for the same problem.

  • Do a voice broadcast ”Many companies have a voice broadcast system used for announcements. This facility can be used to convey information to all users in the same building regarding a system outage. Regular updates keep the user community informed of any progress or expected resolution time.

  • Use a tickertape system ”Larger organizations frequently make use of tickertape systems, which can be centrally controlled and administered. This system can convey messages to alert users to an incident, and the messages can be quickly updated to reflect an amended status. These types of systems are also useful for informing users of virus attacks, especially through email attachments. The advantage is that the message can be left on for extended periods, reminding users of the threat.

Of course, most users tend to have a PC running a corporate email system and probably could be informed of a Solaris system outage via this method. This list definitely would apply if some users relied totally on the Solaris systems to carry out their business.


   
Top


Solaris System Management
Solaris System Management (New Riders Professional Library)
ISBN: 073571018X
EAN: 2147483647
Year: 2001
Pages: 101
Authors: John Philcox

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net