Identifying Process Hangs | Real World Mac Maintenance and Backups

Now that we have covered the building blocks of processes and threads, it is time to address process hangs and their potential causes. There is little hope that killing an offending process with -9 (sigkill) will lead to discovering the root cause of a process hang. Neither will rebooting the OS unless you are dealing with a stale file handle. Furthermore, these steps will not prevent these anomalies from reoccurring. However, by applying the knowledge of how processes are created and how resources are used, the root cause can be identified.

When a process appears hung, the first step toward a solution is determining certain critical information about the task. Using the ps command, start by determining the task's state, threads, priority, parent, and wait channel. In addition, identify the cumulative CPU time and the initial start time of the task. A holistic approach is needed because although a single ps command is a good start, it will not deliver all the needed information.

Let us first determine whether the process is hung because sometimes a process appears to be blocked when it actually is in the middle of a computation or non-blocking I/O. If cumulative CPU time constantly grows, the task's state will most likely be R. In this state, the process is on the run queue and does not have a wait channel. Monitor its cumulative CPU time. If the process remains in the run queue, it might be performing some calculation that takes a while. Even the fastest computers in the world take a while to calculate an infinite loop! Nevertheless, note that a process in the run state could be the normal operation of that program, an application "feature," or a driver problem.

If an offending process is consuming system resources at an extraordinary rate and starving production applications, killing the offending process is justified if the process can be killed. However, sometimes a process cannot be killed. When a process has exhausted its timeslice, it is put to sleep() with a given priority. When the priority of the process falls below PZERO, it is in an uninterruptible state and cannot be signaled; however, signals can be queued, and for some operations, this is normal. For others, where the program has hung and never returns, the cause is usually located in the driver or hardware. If the process has a state of D (blocked on I/O), it is uninterruptible and cannot be killed. For example, a process accessing a file over a failed hard NFS mount would be in a state of D while attempting to stat() a file or directory.

Uninterruptible processes usually take place when entering I/O calls, at which point the process has called into the kernel, which is in driver code, during which the process cannot receive signals from user space. In this state, a command cannot be signaled even by a SIGKILL (kill -9). It is important to note that signals are queued if not ignored by a sigmask and executed after the code returns from kernel space. Some signals cannot be masked; see the signal man page for more details.

Here is an excerpt from the signal man page:

... Using a signal handler function for a signal is called "catching the signal". The signals SIGKILL and SIGSTOP cannot be caught or ignored. ...

A zombie process is another process that a user cannot kill. These processes, however, should not be consuming any CPU cycles or memory resources other than the overhead of having the task structure in the kernel's Virtual Address Space (VAS). The main goal of troubleshooting a zombie process is determining why the parent died without reaping its children. In short, you should focus on why and how the parent dies.

Listed next are process state codes pulled right out of the source code.

./linux/fs/proc/array.c /*  * The task state array is a strange "bitmap" of  * reasons to sleep. Thus "running" is zero, and  * you can test for combinations of others with  * simple bit tests.  */ static const char *task_state_array[] = {         "R (running)",          /*  0 */         "S (sleeping)",         /*  1 */         "D (disk sleep)",       /*  2 */         "Z (zombie)",           /*  4 */         "T (stopped)",          /*  8 */         "W (paging)"            /* 16 */ }; ...

In Scenario 8-1, we demonstrate an instance in which a process cannot be killed.

Scenario 8-1: Troubleshooting a Process That Does Not Respond to kill

A user begins rewinding a tape but realizes that the wrong tape is in the drive. The user tries to kill the job but must wait for the process to finish.

Why?

The mt command has made an ioctl call to the SCSI tape driver (st) and must wait for the driver to release the process back to user space so that use signals will be handled.

# mt -f /dev/st0 rewind # ps -emo state,pid,ppid,pri,size,stime,time,comm,wchan | grep mt D  9225  8916  24 112 20:46 00:00:00 mt             wait_for_completion [root@atlorca2 root]# kill -9 9225 [root@atlorca2 root]# echo $?   # This produces the return code for the previous command.  0 = success 0 [root@atlorca2 root]# ps -elf | grep 9225 0 D root     9225 8916  0  24   0   -    112 wait_f 20:46 pts/1 00:00:00 mt -f /dev/st0

The mt command has entered a wait channel, and after the code returns from the driver, the signal will be processed.

Let's check the pending signals:

cat ../9225/status Name:   mt State:  D (disk sleep) Tgid:   9225 Pid:    9225 PPid:   8916 TracerPid:      0 Uid:    0       0      0     0 Gid:    0       0      0     0 FDSize: 256 Groups: 0 1 2 3 4 6 10 VmSize:    2800 kB VmLck:        0 kB VmRSS:      640 kB VmData:      96 kB VmStk:       16 kB VmExe:       32 kB VmLib:     2560 kB SigPnd: 0000000000000100 <-- SigPnd is a bit mask which indicates the value of the pending signal. Each byte accounts for 4 bits. In this case, the pending signal has a value of 9, so the first bit on the 3rd byte is set. This algorithm is detailed in linux/fs/proc/array.c under the render_sigset_t() function. The following table illustrates this function. Signal    : 1 2 3 4 . 5 6 7 8 . 9 10 11 12 . 13 14 15 16 bit value : 1 2 4 8 . 1 2 4 8 . 1  2  4  8 . 1  2  4  8 kill -3 yields bit mask 0000000000000004 kill -9 yields bit mask 0000000000000100 ShdPnd: 0000000000000100 SigBlk: 0000000000000000 SigIgn: 0000000000000000 SigCgt: 0000000000000000 CapInh: 0000000000000000 CapPrm: 00000000fffffeff CapEff: 00000000fffffeff

Troubleshooting the hung process involves these steps:

1.	Identify all the tasks (threads) for the program.
2.	Assess the hanging process. Is it easily reproducible?
3.	Assess the other things going on. What else is the machine doing? Check load and other applications' response time.

The following scenario demonstrates a way of troubleshooting a process that periodically hangs and then continues.

Scenario 8-2: Troubleshooting a Hanging Web Browser

A user complains that her machine is working great except for her Web browsing application. When accessing Web sites, the browser hangs for a few minutes every so often. The user has installed several different versions and tried other browsers to no avail.

You ask her the following question: Has it ever worked? The reply is "Yes . . . several days ago it was fine."

For the sake of simplicity, we attempt to find the problem with a light Web browser with little overhead.

# strace -f -F -r -T -o /tmp/lynx.strace_2 lynx http://www.hp.com

Using the vi editor and greping for network calls, such as poll(), we can identify what seems to be a problem right away:

:g/poll/p 3660       0.000085 poll([{fd=0, events=POLLIN}], 1, 0) = 0 <0.000020> 3660       0.000186 poll([{fd=0, events=POLLIN}], 1, 0) = 0 <0.000008> 3660       0.000049 poll([{fd=3, events=POLLIN}], 1, 5000) = 0 <5.005154> 3660       0.000043 poll([{fd=3, events=POLLIN}], 1, 5000) = 0 <5.009763> 3660       0.000042 poll([{fd=3, events=POLLIN}], 1, 5000) = 0 <5.008875> 3660       0.000043 poll([{fd=3, events=POLLIN}], 1, 5000) = 0 <5.009264> 3660       0.000042 poll([{fd=3, events=POLLIN}], 1, 5000) = 0 <5.009216> 3660       0.000043 poll([{fd=3, events=POLLIN, revents=POLLIN}], 1, \            5000) = 1 <0.001146> 3660       0.000081 poll([{fd=0, events=POLLIN}], 1, 0) = 0 <0.000017> 3660       0.000088 poll([{fd=0, events=POLLIN}], 1, 0) = 0 <0.000008> 3660       0.000088 poll([{fd=0, events=POLLIN}], 1, 0) = 0 <0.000022>

We see that some poll() calls took over five seconds each. That would explain the Web browser hanging and taking a long time to browse sites.

Focusing on the trace, we see the following:

3660      0.000254 socket(PF_INET, SOCK_DGRAM, IPPROTO_IP) = 3 <0.000044> 3660      0.000095 connect(3, {sa_family=AF_INET, sin_port=htons(53), \           sin_addr=inet_addr("15.50.74.40")}, 28) = 0 <0.000017> 3660      0.000108 send(3, "\245`\1\0\0\1\0\0\0\0\0\0\3www\2hp\3com\0\0\           34\0\1", 28, 0) = 28 <0.000404> 3660      0.000476 gettimeofday({1097369839, 928119}, NULL) = 0 <0.000005> 3660      0.000049 poll([{fd=3, events=POLLIN}], 1, 5000) = 0 <5.005154> 3660      5.005262 send(3, "\245`\1\0\0\1\0\0\0\0\0\0\3www\2hp\3com\0\0\           34\0\1", 28, 0) = 28 <0.000426>

Checking the man page on poll, we see that it is waiting for an event to take place on a file descriptor.

# ls -al /proc/3660/fd/3 -> socket:[52013]

This confirms a network issue. After reviewing the /etc/resolv.conf file, we see that 15.50.64.40 is the nameserver.

The user contacted her IT department and found that the nameserver configuration for her network had changed. Switching to a different nameserver in the resolv.conf file alleviated the five-second poll() call and resolved the problem.

Commands commonly used to troubleshoot a hung process include the following:

ps Concentrate on pid, ppid, state, comm, time, pri, size, and wchan flags.
lsof Determine the open files on the system.
pstree Focus on processes and how they relate to each other.
strace Flags most commonly used: -f -F -r -T -o <outfile>.
man pages Believe it or not, the manual helps.
source code Use it to determine what the application is doing.
/proc filesystem It offers a wealth of information.

In Scenario 8-3, we show a process that appears to be hung but is not.

Scenario 8-3: Troubleshooting an Apparent Process Hang

In this scenario, a user's goal is to create a file that takes data, automatically compresses it, and sends it to a new file. To perform this task, the user creates a named pipe and issues gzip, redirecting input from the pipe to a new file. The odd part is that the gzip process seems to hang, and the user cannot find the gzip process when searching ps -ef.

So you devise an action plan: Re-create the event and trace the process involved.


1.	Create a named pipe. $ mknod /tmp/named_pipe p $ ll /tmp/named_pipe prw-r--r-- 1 chris chris 0 Oct 9 16:53 /tmp/named_pipe \|
2.	Acquire the current process ID. $ echo $$ # note that the current PID = the shell 5032
3.	From the same shell window, start the `gzip` process on the named pipe. $ gzip < /tmp/named_pipe > /tmp/pipe.out.gz
4.	Find the process with a parent of 5032. $ ps -emo pid,ppid,state,comm,time,pri,size,wchan \| grep 5032 5236 5032 S bash 00:00:00 30 1040 pipe_wait Notice that the command name is `bash`, and it is in the sleep state, sleeping on wait channel `pipe_wait`. Yet `gzip` was the command executed.
5.	In another shell window, start a trace on the parent before executing the `gzip` command. $ strace -o /tmp/pipe.strace -f -F -r -T -v -p 5032 Process 5032 attached - interrupt to quit ........Parent shell \ process Process 5236 attached ....................................The gzip \ process being forked Process 5032 suspended As mentioned earlier, `fork()` essentially creates a process structure by copying the parent. Until `execve()` executes the binary, the new executable is not loaded into memory, so `ps -ef \| grep gzip` does not show the process. In this case, the `gzip` process waits for something to be sent to the pipe before executing `gzip`.

6.	A review of the trace explains why the `ps -ef \| grep gzip` command does not show the process. PID Time call() ... 5032 0.000079 fork() = 5236 <0.000252>.........."GZIP was executed at command line" 5032 0.000678 setpgid(5236, 5236) = 0 <0.000008> 5032 0.000130 rt_sigprocmask(SIG_SETMASK, [RTMIN], NULL, 8) = 0 <0.000007> ... 5032 0.000074 waitpid(-1, <unfinished ...>..........."man waitpid: -1 means wait on child" 5236 0.000322 --- SIGSTOP (Stopped (signal)) @ 0 (0) --- 5236 0.000078 getpid() = 5236 <0.000006> 5236 0.000050 rt_sigprocmask(SIG_SETMASK, [RTMIN], NULL, 8) = 0 <0.000007> 5236 0.000067 rt_sigaction(SIGTSTP, {SIG_DFL}, {SIG_IGN}, 8) = 0 <0.000009> 5236 0.000060 rt_sigaction(SIGTTIN, {SIG_DFL}, {SIG_IGN}, 8) = 0 <0.000007> 5236 0.000057 rt_sigaction(SIGTTOU, {SIG_DFL}, {SIG_IGN}, 8) = 0 <0.000007> 5236 0.000055 setpgid(5236, 5236) = 0 <0.000008> 5236 0.000044 rt_sigprocmask(SIG_BLOCK, [CHLD TSTP TTIN TTOU], [RTMIN], 8) = 0 <0.000007> 5236 0.000071 ioctl(255, TIOCSPGRP, [5236]) = 0 <0.000058> 5236 0.000102 rt_sigprocmask(SIG_SETMASK, [RTMIN], NULL, 8) = 0 <0.000007> 5236 0.000060 rt_sigaction(SIGINT, {SIG_DFL}, {0x8087030, [], SA_RESTORER, 0x4005aca8}, 8) = 0 <0.000007> 5236 0.000075 rt_sigaction(SIGQUIT, {SIG_DFL}, {SIG_IGN}, 8) = 0 <0.000007> 5236 0.000057 rt_sigaction(SIGTERM, {SIG_DFL}, {SIG_IGN}, 8) = 0 <0.000007> 5236 0.000058 rt_sigaction(SIGCHLD, {SIG_DFL}, {0x80776a0, [], SA_RESTORER, 0x4005aca8}, 8) = 0 <0.000007> 5236 0.000262 open("/tmp/named_pipe", O_RDONLY\|O_LARGEFILE) = 3 <141.798572> 5236 141.798719 dup2(3, 0) = 0 <0.000008> 5236 0.000051 close(3) = 0 <0.000008> 5236 0.000167 open("/tmp/pipe.out.gz", O_WRONLY\|O_CREAT\|O_TRUNC\|O_LARGEFILE, 0666) = 3 <0.000329> 5236 0.000394 dup2(3, 1) = 1 <0.000007> 5236 0.000042 close(3) = 0 <0.000008> 5236 0.000127 execve("/usr//bin/gzip", ["gzip"] So 141.79 seconds after opening, the named pipe data was received, `evecve()` executed `gzip`, and the data was compressed and redirected to the file `/tmp/pipe.out.gz`. Only at this point would the `gzip` process show up in the `ps` listing. So what was initially thought to be a hung process is simply a sleeping process waiting on data.
7.	Now `ps -ef \| grep gzip` works. $ ps -ef \| grep 5236 chris 5236 5032 0 17:01 pts/4 00:00:00 gzip