Using Standard Commands and Tools | UNIX Fault Management: A Guide for System Administrators

I l @ ve RuBoard

Many UNIX commands exist to check configuration, status, and resource information. These tools generally report on only a snapshot in time. You can write or use custom scripts that incorporate these or other commands and run them periodically so that you can track configuration changes or test the status of system resources over time.

The more commonly used commands are described in this section. Note that they are organized alphabetically . You may also want to check the online man pages for additional information about each command. Unless otherwise noted, the commands listed in this section are available on multiple UNIX platforms. (Tools that are specific to networking, such as netstat and nfsstat , are discussed in Chapter 6.)

In addition to these commands, you may want to check the system log file, /var/adm/syslog /syslog.log, for error messages if your system is experiencing problems. Messages written to this log file include information regarding the module experiencing the problem and the time that the event occurred, which can be very valuable when troubleshooting.

bdf and df

The bdf and df commands are commonly used to show the amount of disk and swap space used and available. bdf “ i reports the number of used and free filesystem structures (inodes) in the kernel.

By default, bdf shows information for all mounted filesystems. If this information is too lengthy, you can also run the command and specify a filesystem as a command-line option. An example is shown in Listing 4-1.

ioscan

The ioscan command is used to discover and display the system hardware, usable I/O system devices, or kernel I/O system data structures. The results displayed list the default hardware path to the device, the class of hardware, and a brief description. ioscan includes information on the following hardware: processors, memory, network interface cards, and I/O devices. Listing 4-2 shows how you can check the number of processors on your system by using ioscan.

ioscan is a good tool to use to get a complete picture of your system hardware layout. It reports the status of the installed software, indicating whether the proper drivers are loaded. By storing the command output in files, you can maintain a history of the hardware configuration changes to your system.

iostat

iostat reports CPU statistics and I/O statistics for disks and terminals. For disks, it lists the device name , number of bytes transferred per second (bps), number of seeks per second (sps), and milliseconds per average seek (msps). For terminals, it shows the number of characters read and the number of characters written. For the CPU, it shows the percentages of time that the system has spent in user mode, nice mode (low-priority user processes), and system mode. Listing 4-3 shows sample output for a system with only one physical disk.

Listing 4-1 bdf output for a specific filesystem.

 # bdf /dev/vg00/lvol3 Filesystem          kbytes    used    avail   %used Mounted on /dev/vg00/lvol3     126976    33003   93912    26%  / #

Listing 4-2 Output from ioscan for a two-processor HP-UX system.

 # ioscan grep processor 32          processor              Processor 34          processor              Processor #

Listing 4-3 Output from the iostat command showing performance measures for disks, terminals, and CPU.

 # iostat -t                    tty             cpu                  tin tout        us  ni  sy  id                    0    1         1   0   2  97   device    bps     sps    msps   c0t5d0      0     0.0     1.0

You may want to use iostat to compare the activity on different disks, to see whether a load imbalance exists. It is normal for the system disk to have more activity.

ipcs

The ipcs command shows the status of active message queues, shared memory, and system semaphores. Listing 4-4 shows example output from using ipcs. You may want to consult the online manpage to see all the available options for this command.

mailstats

If your system is being used as a mail server, you may want to use mailstats to check mail statistics. The mailstats command shows the number of messages and amount of data sent or received for each mailer running on the system.

ps

The ps command is used to display information about all processes on the system. The metrics provided by ps include: Process Identifier (PID), parent PID, process start time, cumulative execution time, process state, priority, physical size (in pages), and the command with its command-line options.

ps is a quick way to get a profile of the processes on your system. It is useful for checking whether a specific application or process is running. For example, Listing 4-5 shows an easy way to display the Network File System (NFS) daemons running on your system. This listing can be used to identify runaway processes, both in CPU time and size. Numerous processes in the wait state may be an indication of a system bottleneck.

sar

sar is the System Activity Reporter. It is useful for monitoring system activity and can be used to identify memory, CPU, and kernel bottlenecks. It enables you to specify the polling interval and has the ability to log data to a file (in binary format). It can report on activity from many system resources, including CPU utilization by processor, buffer cache, swapping, disks and tape, run and swap queues, and several system tables. Refer to the online man page for the command-line options.

Listing 4-4 Output from ipcs showing active message queues, shared memory, and semaphores.

 #ipcs IPC status from /dev/kmem as of Sun Mar 14 17:47:20 1999 T      ID     KEY        MODE        OWNER     GROUP Message Queues: q       0 0x3c1c0330 -Rrw--w--w-     root      root q       1 0x3e1c0330 --rw-r--r--     root      root Shared Memory: m       0 0x2f180002 --rw-------     root       sys m     201 0x411c031b --rw-rw-rw-     root       sys m     402 0x4e0c0002 --rw-rw-rw-     root       sys m     403 0x41201219 --rw-rw-rw-     root       sys Semaphores: s       0 0x2f180002 --ra-ra-ra-     root       sys s      65 0x411c031b --ra-ra-ra-     root       sys s     130 0x4e0c0002 --ra-ra-ra-     root       sys s     131 0x4120121a --ra-ra-ra-     root       sys s       4 0x00446f6e --ra-r--r--     root      root s       5 0x00446f6d --ra-r--r--     root      root s       6 0x01090522 --ra-r--r--     root      root s       7 0x411c1f3a --ra-ra-ra-     root      root s       8 0x410c319a --ra-ra-ra-     root      root #

Listing 4-5 Finding your NFS daemons.

 #ps -ef grep -E 'nfsPPID'      UID   PID  PPID  C  STIME   TTY       TIME COMMAND     root   681     1  0  Dec 22  ?         0:00 /usr/sbin/nfsd 4     root   682   681  0  Dec 22  ?         0:00 /usr/sbin/nfsd 4     root   686   681  0  Dec 22  ?         0:00 /usr/sbin/nfsd 4     root   688   681  0  Dec 22  ?         0:00 /usr/sbin/nfsd 4     root 16761 16718  1 12:14:48 pts/0     0:00 grep nfs #

For CPU activity, sar shows CPU utilization by user mode, system mode, idle time waiting for I/O to complete, and idle time either on a per-processor level or averaged for all processors. Sample output is shown in Listing 4-6.

Listing 4-6 sar output showing system activity.

 # sar 5 5 HP-UX cadbury B.10.20 A 9000/871    03/15/99 20:36:32    %usr    %sys    %wio   %idle 20:36:37       0       1       0      98 20:36:42       0       1       0      99 20:36:47       1       1       0      99 20:36:52       0       1       0      99 20:36:57       0       1       0      99 Average        0       1       0      99 #

By using sar -q, you can look at the average lengths of the run and swap queues, and the percentage of times the queues were occupied. This is shown in Listing 4-7. High CPU utilization and a large run queue may indicate a CPU bottleneck. A large swap queue is one sign of memory contention .

sar can be used to check the effectiveness of buffer cache use. It reports the rates of reads and writes between a disk and the buffer cache. It also reports the rates of logical reads and writes to and from the buffer cache, as well as buffer cache hit ratios.

For swapping activity, you can monitor swap-in rates, swap outs per second, and context switch rates.

sar -v reports the current size, maximum size, and number of overflows of various system tables, including the process table, inode table, and system file table.

Listing 4-7 Output from sar showing queue lengths.

 # sar -q 5 5 HP-UX cadbury B.10.20 A 9000/871    03/15/99 20:44:03 runq-sz %runocc swpq-sz %swpocc 20:44:08     1.0      20     0.0       0 20:44:13     0.0       0     0.0       0 20:44:18     0.0       0     0.0       0 20:44:23     1.0      10     0.0       0 20:44:28     0.0       0     0.0       0 Average      1.0       6     0.0       0 #

swapinfo

swapinfo reports system paging or swapping activity, and memory utilization. On some implementations of UNIX, it is called swap. This command is useful for showing swap space usage and configuration. It displays for each swap type and device the kilobytes (K) available, kilobytes used, kilobytes free, and percentage used. If you have insufficient memory, you may see lots of pages being swapped or high utilization of the swap device. An example using swapinfo is shown in Listing 4-8.

For device swap areas, reserve is the number of 1K blocks reserved for filesystem use by ordinary users. For device swap areas, this value is always "-". Checking swapinfo periodically may help you to schedule additions to your swap capacity.

sysdef

The sysdef command, available on HP-UX, reports on a system's tunable kernel parameters. For each kernel parameter, this command shows the current value, value at boot time, and minimum and maximum values allowed for the parameter, as demonstrated in Listing 4-9. This command can be used both to monitor whether the system kernel is configured properly and to track whether certain kernel resource usage is at or approaching its configured limit. You can also use this command, together with ioscan, to track kernel configuration changes.

Listing 4-8 Output from the swapinfo command shows system paging activity.

 # swapinfo          Kb     Kb     Kb   PCT  START/      Kb TYPE  AVAIL   USED   FREE  USED   LIMIT RESERVE  PRI NAME dev  524288  12488 511800   2%       0       -    1 /dev /vg00/lvol2 reserve       -  246876 -246876 memory   404396  207844  196552   51% v

Listing 4-9 Showing current values of kernel-tunable parameters.

 #sysdef NAME                  VALUE     BOOT     MIN-MAX      UNITS      FLAGS acctresume                4        -    -100-100                     - acctsuspend               2        -    -100-100                     - allocate_fs_swapmap       0        -        -                        - bufpages              10714        -       0-         Pages          - create_fastlinks          0        -        -                        - dbc_max_pct              50        -        -                        - dbc_min_pct               5        -        -                        - default_disk_ir           0        -        -                        - dskless_node              0        -       0-1                       - eisa_io_estimate        768        -        -                        - eqmemsize                15        -        -                        - file_pad                 10        -       0-                        - fs_async                  0        -       0-1                       - hpux_aes_override         0        -        -                        - maxdsiz               16384        -     256-655360   Pages          - maxfiles                120        -      30-2048                    - maxfiles_lim           1024        -      30-2048                    - maxssiz                2048        -     256-655360   Pages          - maxswapchunks           256        -       1-16384                   - maxtsiz               16384        -     256-655360   Pages          - maxuprc                  75        -       3-                        - maxvgs                   10        -        -                        - msgmap              2555904        -       3-                        - nbuf                   5772        -       0-                        - ncallout                316        -       6-                        - ncdnode                 150        -        -                        - ndilbuffers              30        -       1-                        - netisr_priority          -1        -      -1-127                     - netmemmax          14356480        -        -                        - nfile                  1034        -      14-                        - nflocks                 200        -       2-                        - ninode                  500        -      14-                        - no_lvm_disks              0        -        -                        - nproc                   300        -      10-                        - npty                     60        -       1-                        - nstrpty                  60        -        -                        - nswapdev                 10        -       1-25                      - nswapfs                  10        -       1-25                      - public_shlibs             1        -        -                        - remote_nfs_swap           0        -        -                        - rtsched_numpri           32        -        -                        - sema                      0        -       0-1                       - semmap              4128768        -       4-                        - shmem                     0        -         0-1                     - shmmni                  200        -      3-1024                     - streampipes               0        -          0-                     - swapmem_on                1        -           -                     - swchunk                2048        -  2048-16384     kBytes          - timeslice                10        --1-2147483648     Ticks          - unlockable_mem         2158        -          0-      Pages          -

timex

The timex command can be used to measure and report, in seconds, the elapsed time, user CPU time, and system CPU time spent executing a given command. The command to be executed is given on the timex command line. This command reports process accounting data for the command and all of its children, as well as the total system activity during execution of the command. The timex command can give you a crude idea of the impact of a command on the rest of the system.

top

The top command is useful for monitoring the system CPU and memory loads. It also lists the most active processes on the system. top output is displayed in the terminal window and is updated every five seconds, by default.

top shows CPU resource statistics, including load averages (job queues over the last 1 minute, 5 minutes, and 15 minutes), the number of processes in each state (sleeping, waiting, running, starting, zombie, stopped ), the percentage of time spent in each processor state (user, nice, system, idle, interrupt, and swapper ) per processor on the system, as well as the average for each processor in a multiprocessor system.

For memory utilization, top shows virtual and real memory in use, the amount of active memory, and the amount of free memory.

At the process level, top lists the top processes, based on their CPU usage. The process data displayed by top includes the PID, process size (text, data, and stack), resident size of the process (K), process state (sleeping, waiting, running, idle, zombie, or stopped), the number of CPU seconds consumed by the process, and the average CPU utilization of the process. This command can be used to identify processes that may be using large amounts of CPU or memory. Note that top can also be a quick way to check the number of processors on your system. Listing 4-10 shows the output for a four-processor system.

Listing 4-10 Output from the top command showing process activity.

 System: gsyview1                 Fri Feb 12 13:40:24 1999 Load averages: 0.08, 0.11, 0.16 616 processes: 614 sleeping, 2 running Cpu states: CPU LOAD USER  NICE  SYS   IDLE  BLOCK SWAIT  INTR  SSYS  0  0.30 0.0%  0.0%  1.3% 98.7%  0.0%   0.0%  0.0%  0.0%  1  0.00 0.0%  0.0%  0.7% 99.3%  0.0%   0.0%  0.0%  0.0%  2  0.01 0.0%  0.0%  0.2% 99.8%  0.0%   0.0%  0.0%  0.0%  3  0.02 0.4%  0.0%  7.9% 91.8%  0.0%   0.0%  0.0%  0.0% -       -          avg 0.08 0.0%  0.0%  2.6% 97.4%  0.0%   0.0%  0.0%  0.0% Memory: 25754K (2356K)real, 27864K (6144K)virtual, 27838K free  Page# 1/42 CPU TTY    PID USERNAME PRI NI   SIZE    RES STATE    TIME %WCPU %CPU COMMAND  3 pts/4 12555 jsymons  187 20 25992K   568K run      0:02  7.84  5.48 top  0 rroot    19 root     100 20     0K     0K sleep 1449:04  1.05  1.05 netisr  1 rroot   494 root     154 20   216K   284K sleep 1479:50  1.03  1.02 syncer  0 rroot     3 root     128 20     0K     0K sleep  960:56  1.00  0.99 statdaemo  0 rroot  1432 root      20 20  8120K  6956K sleep  842:38  0.61  0.61 cmcld  3 rroot    38 root     138 20     0K     0K sleep  336:22  0.32  0.31 vx_iflush  1 rroot     7 root     -32 20     0K     0K sleep  321:07  0.25  0.25 ttisr  3 rroot   934 root     154 20  6100K  1436K sleep  297:15  0.22  0.22 rpcd  1 rroot    40 root     138 20     0K     0K sleep  193:38  0.16  0.16 vx_inacti  0 rroot 26626 root     154 20   868K   880K sleep  245:15  0.15  0.15 opcle  2 rroot 26587 root     154 20  2580K  1348K sleep  125:22  0.07  0.07 opcmsga  1 rroot    39 root     138 20     0K     0K sleep   88:15  0.07  0.07 vx_ifree_  3 rroot    22 root     100 20     0K     0K sleep  159:58  0.06  0.06 netisr  1 rroot 26586 root     154 20  8468K  1752K sleep   53:47  0.06  0.06 opcctla

uname

The uname command can be used to display configuration information about your system. This information includes the operating system name, machine model, and operating system version.

You may want to gather this information and store it for later use. This may be useful if you are trying to keep all of your systems on the same release of the operating system, for example.

uptime

The uptime command is probably the most commonly used command to check system resources. This command shows the current time, length of time the system has been up, number of users logged on, and the average number of jobs in the run queue for the last 1, 5, and 15 minutes.

Using uptime with the -w option shows a summary of the current activity on the system for each user. As shown in Listing 4-11, you can see the login time, CPU usage, and command activity for each user.

vmstat

The vmstat command provides good information about system resources, including virtual memory and CPU usage, and is useful for detecting whether you are low on memory or swap space.

Listing 4-11 Output from the uptime command showing paging activity.

 uptime -w  12:49pm  up 3 days, 2:19, 5 users, load average: 0.49, 0.56, 0.56 User     tty           login@  idle  JCPU PCPU  what jsymons  console      12:32pm 74:17             /usr/sbin jsymons  ttyp7        12:18pm                   uptime -w

For monitoring real and virtual memory, vmstat shows page faults and paging activity, including reclaimed pages and swapping rates.

For the CPU, you can see more detailed information with vmstat than that provided by iostat. vmstat shows faults, including device interrupts, system calls, and context switches. vmstat also includes the breakdown of CPU utilization by user, system, and idle time.

For processes, vmstat shows the number of processes in various states, including the following: currently in the run queue, blocked on an I/O operation, and swapped out to disk.

The statistics that you see vary depending on the command option that you specify. By specifying a time interval, you can have vmstat run continuously, so that you can see how the values vary over time. As shown in Listing 4-12, using the -s option prints paging- related activity.

who

The who command tells you who is logged in to the system, and how long each user has been connected. This command can be useful if a performance problem arises, because you can quickly determine whether an increase in the number of concurrent users has occurred. It can also be useful in checking for security intrusions, because you may notice an unexpected user.

Listing 4-12 Output from the vmstat command showing paging activity.

 $ vmstat -s 5431 swap ins 5431 swap outs 1376 pages swapped in 426 pages swapped out 9704169 total address trans. faults taken 2159795 page ins 9236 page outs 136606 pages paged in 21451 pages paged out 2064504 reclaims from free list 2097094 total page reclaims 773 intransit blocking page faults 6040874 zero fill pages created 3925703 zero fill page faults 1457303 executable fill pages created 76804 executable fill page faults 0 swap text pages found in free list 735656 inode text pages found in free list 185 revolutions of the clock hand 105428 pages scanned for page out 16850 pages freed by the clock daemon 50286274 cpu context switches 90662460 device interrupts 2732863 traps 229976779 system calls

I l @ ve RuBoard