Daily Procedures | Solaris System Management (New Riders Professional Library)

The system manager has a primary responsibility for providing a continued level of service. Therefore, he must ensure that certain essential administration tasks are carried out. Some of these tasks need to be done only once a day, for example, while others must be carried out at regular intervals throughout the day. The purpose of this section is to identify the things that need to be done to provide the best possible service to the customers.

Some system managers like to use a checklist approach in which the system administrator manually completes a list of required actions and enters a confirmation either on a hard copy or into a file on the system. This approach is perfect for junior system administrators in the process of learning how to do the job, but it is less appropriate for a senior system administrator, who might resent the implication that he is not trusted to do his job properly. The system manager must make a judgment call on how this is to be handled. My own preference is to use the mandatory checklist only for junior administrators in the first few months because it reinforces and reminds them of the tasks that need to be carried out, and it helps with their own development.

It is worth restating here that the system manager does not expect to have to do the system administrator's job, but he does have to possess a good knowledge of what is involved in keeping the systems running. Of course, the system manager also is responsible for the administrator from a personnel perspective as well as a technical one and will no doubt have to assess his performance at least on an annual basis. Unless he knows what his staff is doing (or is meant to be doing), it is difficult to accurately assess performance.

What Must Be Done?

The following is a list of the tasks that must be carried out. The system manager, together with the system administrator(s), should review the daily procedures on a regular basis to ensure that everything that needs to be done is being done. To some, this may seem like stating the obvious, and sometimes it is. However, the system manager is ultimately responsible, and with companies relying so heavily on the continued availability and performance of the IT systems, he must be sure that they are being monitored in a professional and efficient manner:

Check the systems ”The first thing that must be done is to determine the current status of all the systems, ensuring that they are all available and operating correctly. This task is the highest priority because a system that is down or in a hung state will seriously affect the business. A network management utility likely will be running in larger organizations, but this may not capture a struggling or underperforming system. The system messages file /var/adm/messages should be examined on each system to see if any system errors have occurred. This check is invaluable because it can highlight the beginning of a problem, such as a memory module giving intermittent (and correctable) errors. In this instance, the module can be replaced before the error has any detrimental effect on the performance or availability of the system.

Of course, most organizations running high-availability environments make use of a monitoring or management tool, such as Solstice Site or Domain Manager (discussed in Chapter 14, "Network Management Tools"), or, for smaller, less critical environments, a monitoring application such as Big Brother (discussed in Chapter 13, "Network Monitoring"). However, these applications still need to be monitored and occasionally modified to reflect changes in procedures or the environment. The downside, if there is one, is that too much reliance on automation products can gradually diminish personal skills due to absence of use. It doesn't hurt to do it manually once in a while ”one day it might be necessary!
Deal with any help desk trouble tickets ”See if any incidents have occurred that you should be aware of. This includes general incidents, such as network problems, that could adversely affect the systems and the processing being carried out on them.
Check application logs ”An inspection of any application logs, especially for custom applications, is required to see if any errors have occurred. Tools such as Expect and Analog are freely available in the public domain. These can scan output or very large log files in a fraction of the time it would normally take. These tools also can flag any messages that might be of interest or that require addressing.
Examine the processes ”This check looks particularly for "runaway processes" ”that is, processes that are constantly using up system resources and that have been disconnected from owning processes. A good example of these can be found in organizations where terminal emulation sessions are used, or client/server database applications. Runaway processes occur most frequently when a user simply turns off a machine at the end of the workday instead of logging out in the proper manner. The processes consume vast amounts of system resources and eventually slow the system to an unacceptable level. The top command is commonly used to monitor the state of processes on the system; it's particularly useful in this instance because of the refreshing display that highlights processes of this nature.
Check the backup logs ”When it has been established that the systems are functioning correctly, check the backup logs to make sure that there aren't any errors that could affect the integrity of the backup. This task, although of very high importance, is normally done after the system status has been verified because the previous checks are relevant to the real-time operation of the system, whereas the backups have already run and involve historical information.
Examine security logs ”Examine files such as /var/adm/sulog, /var/adm/loginlog, and sudolog if you're using sudo. These files will report any instances of attempts to gain access to the superuser account, repeated attempts at login in which the password was not entered correctly, and all instances when the sudo utility was used ”both successfully and unsuccessfully.
Monitor disk space ”Keep a regular watch on disk space utilization. This can cause problems when a process encounters a failure and continually writes to a log file. An entire disk can be filled up very quickly, sometimes in minutes. Again, a monitoring tool such as Big Brother can detect this and page out the administrator. A management tool can also be configured to take some predefined action in an attempt to rectify the situation.
Monitor performance ”Performance utilities such as vmstat and iostat should be run regularly to check that no performance issues could affect the business's capability to carry out its function. Performance monitoring is discussed in more detail later in this chapter in the section "Performance Monitoring."

This list is not exhaustive: Different organizations will have other priorities, depending on the nature of the business being carried out. However, this list does give a good general appreciation of the kind of things that must be done.

Top