Using System Instrumentation | UNIX Fault Management: A Guide for System Administrators

I l @ ve RuBoard

Standards for network and system management, such as Simple Network Management Protocol (SNMP) and Desktop Management Interface (DMI), have been developed to help make management easier. They provide industry-standard ways to build instrumentation and to interface into the instrumentation. SNMP is used to access Management Information Bases (MIBs), and DMI is used to access Management Information Formats (MIFs).

Standard MIBs and MIFs define the metrics that can be instrumented by any vendor. Vendor-specific MIBs and MIFs provide vendor-specific instrumentation. This section looks at some of the system instrumentation available through each of these standards, which can be used to obtain information about your application.

Many tools already exist for accessing this instrumentation. Several vendors offer browsers and monitoring capabilities that use a common interface for accessing instrumented objects from different hardware platforms and operating systems. For example, the common enterprise management frameworks, such as HP Network Node Manager (NNM), include a MIB Browser tool to access MIB data. They may also include tools that can be used to monitor MIB data on remote systems from the enterprise management platform. Toolkits also exist that provide an interface in which users can write their own tools to monitor or track this information. Other toolkits enable users to create their own instrumentation.

SNMP

A MIB is a standard way of representing information of a certain category. For example, MIB-II provides useful information about a system, such as the number of active TCP connections, system hardware and version information, and so forth. OpenView IT/O, discussed later in this chapter, provides a MIB Browser that helps you to discover which MIBs are available and to see the information being provided by each MIB. With the MIB Browser tool, you can check the value of anything contained in a MIB. If you find a MIB that contains some useful fields, you can use the MIB Browser to gather that data from the target system. The resulting data is displayed onscreen in the MIB Browser's output window. By browsing through available MIBs and querying values of selected MIB fields, you can gather specific information needed to monitor systems and troubleshoot problems.

The SNMP interface provides access to objects stored in various MIBs. MIB-II is a standard MIB that has been implemented on most Unix systems. On HP-UX systems, the HP-UNIX MIB defines various metrics for monitoring system resources. Other vendors, such as Sun, have vendor-specific MIBs that provide similar information. You can find complete MIB definitions in Appendix A.

The HP-UNIX MIB contains information about each of the processes running on the system. Useful information includes the PID, CPU utilization, priority, and resident set size . If numerous processes are active on a system, searching through the process table in the MIB may be difficult. If you want to find information only for a specific process, you may prefer to write a small program that uses the pstat interface, covered later in this section, which provides the same information as the MIB process table, as well as additional fields. However, the MIB is a useful source for collecting data about multiple processes, especially if you do not want to write your own application.

If you are using MC/ServiceGuard to protect your application, then you may also want to use the HP MC/ServiceGuard MIB, which contains information about each configured MC/ServiceGuard application, or package. Information about a package includes the name , description, current status, system on which it is currently running, processes making up the package, and alternate nodes for the package. Note that the MIB does not show any EMS resource dependencies for a package. You have to use the MC/ServiceGuard command cmviewcl “v to obtain this additional information.

The Network Services MIB and Relational Database Management System MIB are used for database applications. From these MIBs, you can learn the database version, database status, length of time the application has been running, and number of current users. Some resource and performance information is also available.

DMI

System resource information can also be retrieved by using DMI, which is another standard for storing and accessing management information. Management information is represented in a text file in the MIF. It is divided into components , each of which has a Service Provider (SP) that is responsible for providing DMI information to the management applications that request it.

Several system platforms, including HP-UX, provide instrumentation for the System MIF and Software MIF. Appendix A contains a complete listing of these MIFs.

Similar to MIB-II, the System MIF can be used to get generic system information (such as how long it has been running) and system contact information. It includes the system name, boot time, contact information, uptime, number of users, and some information about the filesystem and disks.

The Software MIF provides information about the software products and product bundles installed on a system, and can be a useful tool after the discovery of a problem with an application. By using a MIF Browser, you can examine the Software MIF to see whether a problem might have been caused by a bad patch or modified file. The MIF contains revision information, including creation and modification times, for each product. Version information can be checked to see whether a compatibility problem exists. Finally, the product's vendor information is provided, in case you need to contact the product's support personnel.

Vendors such as Hewlett-Packard are working to provide enterprise-wide repositories of software configuration information. Tools are needed to compare application versions on different systems and update systems to the same revision level.

`pstat`

Occasionally, you may want to write your own application program to get access to information about specific processes. This interface is referred to as pstat. Your program can generate output for only the interesting processes, which can be much easier than wading through top or ps output looking for important information.

Listings 7-2 and 7-3, respectively, show a program that obtains pstat statistics and the output from running the program. The program displays information for the process(es) matching the names that you enter. The pstat interface can also show overall system information, as demonstrated in Listing 7-2.

For the example in Listing 7-3, the concern is that an application named memhog is using too much memory, so the program is being used to check memhog's memory usage at various times, and to track the overall system load. This can be more convenient than running other tools, such as top or GlancePlus, and sifting through the output.

Listing 7-2 Program code using pstat.

 /* pstat example */ #include <stdio.h> #include <stdlib.h> #include <sys/time.h> #include <sys/types.h> #include <sys/socket.h> #include <netinet/in.h> #include <sys/errno.h> #include <sys/pstat.h> convert_status (status,stat) char *status; long stat; {    switch (stat) {    case 1: strcpy (status,"sleeping");            break;    case 2: strcpy (status,"running");            break;    case 3: strcpy (status,"stopped");            break;    case 4: strcpy (status,"dead (zombie)");            break;    case 5: strcpy (status,"other");            break;    case 6: strcpy (status,"idle");            break;    } } main() {    struct pst_status pst;    struct pst_dynamic dyn;    struct pst_static mystatic;    struct timeval temp;    struct timezone dummy;    char target[80];    char status[14];    int indx = 0;    int duration;    long prm_id;    unsigned long now;    gettimeofday(&temp,&dummy);    printf ("Enter process name:");    scanf ("%s",target);    now = temp.tv_sec;    while ( pstat_getproc(pst,sizeof (struct pst_status),                          1,indx) > 0) {       if (strcmp(target,pst.pst_ucomm)==0) {          duration = now - pst.pst_start;          convert_status(status,pst.pst_stat);          printf ("PID %8d  Status: %s\n",pst.pst_pid,status);          printf ("Started %d seconds ago\n",duration);          printf ("Real pages(DATA %d  TEXT %d  STACK %d)\n",                  pst.pst_dsize,pst.pst_tsize,pst.pst_ssize);          printf ("RSS %d   MAX RSS(hwm) %d\n",                  pst.pst_rssize,pst.pst_maxrss);          printf ("Number of swaps: %d\n",pst.pst_nswap);       }       indx = pst.pst_idx+1;    }  /* end while loop */    printf ("Overall System Info:\n");    pstat_getstatic(mystatic,sizeof(struct pst_static),                     0,0);    pstat_getdynamic(dyn,sizeof(struct pst_dynamic),                     0,indx);    printf ("Run queue len(1- %lf,",dyn.psd_avg_1_min);    printf ("5- %lf,",dyn.psd_avg_5_min);    printf ("15- %lf)\n",dyn.psd_avg_15_min);    printf ("Physical Memory: %d\n",mystatic.physical_memory);    printf ("Active Real Memory: %d  Free pages: %d bytes\n",            dyn.psd_arm,dyn.psd_free*mystatic.page_size); }

Listing 7-3 Program output showing pstat information.

 # ./pstatex Enter process name:memhog PID    24549  Status: running Started 247 seconds ago Real pages(DATA 1955  TEXT 3  STACK 3) RSS 1979   MAX RSS(hwm) 1979 Number of swaps: 0 Overall System Info: Run queue len(1- 0.000000,5- 0.000000,15- 0.000000) Physical Memory: 393216 Active Real Memory: 0  Free pages: 0 bytes #

The example in Listing 7-3 shows only memory information, but a variety of additional metrics are also available. Here is a summary of the information available for each process:

User ID, PID, effective user ID, real and effective group ID, and parent PID
Number of real pages used for data, text, or stack
Number of real pages used for shared memory, memory mapped files, and so forth
Number of virtual pages used for data, text, stack, shared memory, and so forth
Priority and nice value
Terminal device ID
Process group and PRM group ID
Address of process in memory
User and system time spent executing
Time process started
Process status and status flags
Processor last used by process
Command line and executable base name for process
CPU time used
Current and high-water mark of resident set size
Number of swaps, page faults, and page reclaims
Number of signals or socket messages received
Number of socket messages sent
Scheduling policy of process
Session ID
File ID of process' root directory, current directory, and executable
Highest file descriptor currently opened
Number of characters read and written

More information about the process and system metrics available from this interface can be found in the system include file, /usr/include/sys/pstat.h.

I l @ ve RuBoard