Hack69.Find Resource Hogs with Standard Commands


Hack 69. Find Resource Hogs with Standard Commands

You don't need fancy, third-party software or log analyzers to find and deal with a crazed user on a resource binge.

There are times when users will consume more than their fair share of system resources, be it CPU, memory, disk space, file handles, or network bandwidth. In environments where users are logging in on the console (or invoking the login utility by some other means), you can use pam_limits,or the ulimit utility to keep them from going overboard.

In other environments, neither of these is particularly useful. On development servers, for example, you could be hosting 50 developers on a single machine where they all test their code before moving it further along toward a production rollout. Machines of this nature are generally set up to allow for things like cron jobs to run. While it's probably technically possible to limit the resources the cron utility can consume, that might be asking for trouble, especially when you consider that there are many jobs that run out of cron on behalf of the system, such as makewhatis and LogWatch.

In general, the developers don't want to hog resources. Really, they don't. It makes their work take longer, and it causes their coworkers to unleash a ration of grief on them. On top of that, it annoys the system administrators, who they know can make their lives, well, "challenging." That said, resource hogging is generally not a daily or even weekly occurrence, and it hardly justifies the cost of third-party software, or jumping through hoops to configure for every conceivable method of resource consumption.

Usually, you find out about resource contention either through a monitoring tool's alert email or from user email complaining about slow response times or login shells hanging. The first thing you can do is log into the machine and run the top command, which will show you the number of tasks currently running, the amount of memory in use, swap space consumption, and how busy the CPUs are. It also shows a list of the top resource consumers, and all of this data updates itself every few seconds for your convenience. Here's some sample output from top:

 top - 21:17:48 up 26 days, 6:37, 2 users, load average: 0.18, 0.09, 0.03 Tasks: 87 total, 2 running, 83 sleeping, 2 stopped, 0 zombie Cpu(s): 14.6% us, 20.6% sy, 0.0% ni, 64.1% id, 0.0% wa, 0.3% hi, 0.3% si Mem: 2075860k total, 1343220k used, 732640k free, 216800k buffers Swap: 4785868k total, 0k used, 4785868k free, 781120k cached   PID  USER PR  NI   VIRT    RES SHR S %CPU %MEM TIME+ COMMAND  3098 jonesy 25   0   4004     1240     956     S   8.7 0.1 0:11.42 hog.sh 30033 jonesy 15   0   6400     2100    1656     S   0.7     0.1     0:02.57 sshd  8083 jonesy 16    0   2060    1064     848     R   0.3     0.1     0:00.06 top 1 root   16    0   1500     516     456     S   0.0     0.0     0:01.91 init 

As you can see, the top resource consumer is my hog.sh script. It's been running for about 11 seconds (shown in the TIME+ column), has a process ID of 3098, and uses 1240K of physical memory. A key field here is the NI field. This is referred to as the nice value. Users can use the renice utility to give their jobs lower priorities, to help ensure that they do not get in the way of other jobs scheduled to be run by the kernel scheduler. The kernel runs jobs based on their priorities, which are indicated in the PR field. As an administrator in the position of trying to fix problems without stepping on the toes of your usership, a first step in saving resources might be to renice the hog.sh script. You'll need to run top as root to renice a process you don't own. You can do this by hitting R on your keyboard, at which point top will ask you which process to reprioritize:

 top - 21:19:07 up 26 days, 6:38, 2 users, load average: 0.68, 0.26, 0.09 Tasks: 88 total, 4 running, 82 sleeping, 2 stopped, 0 zombie Cpu(s): 19.6% us, 28.9% sy, 0.0% ni, 49.8% id, 0.0% wa, 1.0% hi, 0.7% si Mem: 2075860k total, 1343156k used, 732704k free, 216800k buffers Swap: 4785868k total, 0k used, 4785868k free, 781120k cached PID to renice: 3098  PID  USER PR  NI   VIRT    RES SHR S %CPU %MEM TIME+ COMMAND 3098 jonesy 25   0    4004     1240     956     R   14.3    0.1     0:22.37 hog.sh 

Typing in the process ID and pressing Enter will cause top to ask you what value you'd like to nice the process to. I typed in 15 here. On the next refresh, notice the change in my script's statistics:

 top - 21:20:22 up 26 days, 6:39, 2 users, load average: 1.03, 0.46, 0.18 Tasks: 87 total, 1 running, 84 sleeping, 2 stopped, 0 zombie Cpu(s): 1.3% us, 22.3% sy, 13.6% ni, 61.5% id, 0.0% wa, 0.7% hi, 0.7% si Mem: 2075860k total, 1343220k used, 732640k free, 216800k buffers Swap: 4785868k total, 0k used, 4785868k free, 781120k cached   PID  USER PR  NI   VIRT    RES SHR S %CPU %MEM TIME+ COMMAND 3098 jonesy 39   15   4004     1240     956     S   12.0    0.1     0:31.34 hog.sh 

Renicing a process is a safety precaution. Since you don't know what the code does, you don't know how much pain it will cause the user if you kill it outright. Renicing will help make sure the process doesn't render the system unusable while you try to dig for more information.

The next thing to check out is the good old ps command. There are actually multiple ways to find out what else a given user is running. Try this one:

 $  ps ef | grep  jonesy  jonesy  28820     1  0 Jul31 ? 00:00:00 SCREEN jonesy   28821 28820  0 Jul31 pts/3 00:00:00 /bin/bash jonesy   30203 28821  0 Jul31 pts/3 00:00:00 vim XF86Config jonesy   30803     1  0 Jul31 ?     00:00:00 SCREEN jonesy   30804 30803  0 Jul31 pts/4 00:00:00 /bin/bash jonesy   30818     1  0 Jul31 ?     00:00:00 SCREEN -l jonesy   30819 30818  0 Jul31 pts/5 00:00:00 /bin/bash 

This returns a full listing of all processes that contain the string jonesy. Note that I'm not selecting by user here, so if some other user is running a script called "jonesy-is-a-horrible-admin," I'll know about it. Here I can see that the user jonesy is also running a bunch of other programs. The PID of each process is listed in the second column, and the parent PID (PPID) of each process is listed in the third column. This is useful, because I can tell, for example, that PID 28821 was actually started by PID 28820, so I can see here that I'm running an instance of the bash shell inside of a screen session.

To get an even better picture that shows more clearly the relationship between child and parent processes, try this command:

 $ ps fHU  jonesy  

This will show the processes owned by user jonesy in hierarchical form, like this:

 UID PID PPID C STIME TTY TIME CMD jonesy 25760 25758 0 15:34 ?  00:00:00 sshd: jonesy@notty jonesy 25446 25444 0 Jul29 ?  00:00:06 sshd: jonesy@notty jonesy 20761 20758 0 16:28 ?  00:00:03 sshd: jonesy@pts/0 jonesy 20812 20761 0 16:28 pts/0    00:00:00   -tcsh jonesy 12543 12533  0 12:11 ?  00:00:00 sshd: jonesy@notty jonesy 12588 12543 0 12:11 ?  00:00:00  tcsh -c /usr/local/libexec/sft jonesy 12612 12588 0 12:11 ?        00:00:00    /usr/local/libexec/sftp-serv jonesy 12106 12104 0 10:49 ?  00:00:01 sshd: jonesy@pts/29 jonesy 12135 12106 0 10:49 pts/29  00:00:00   -tcsh jonesy 12173 12135 0 10:49 pts/29  00:00:01     ssh livid jonesy 10643 10641 0 Jul28 ?  00:00:07 sshd: jonesy@pts/41 jonesy 10674 10643 0 Jul28 pts/41  00:00:00   -tcsh jonesy 845 10674 0 15:49 pts/41  00:00:06     ssh newhotness jonesy 7011 6965 0 10:15 ?  00:01:39 sshd: jonesy@pts/21 jonesy 7033 7011 0 10:15 pts/21   00:00:00  -tcsh jonesy 17276 7033 0 11:01 pts/21  00:00:00    -tcsh jonesy 17279 17276 0 11:01 pts/21  00:00:00      make jonesy 17280 17279 0 11:01 pts/21   00:00:00        /bin/sh -c bibtex paper; jonesy 17282 17280 0 11:01 pts/21  00:00:00          latex paper jonesy 17297 7033 0 11:01 pts/21  00:00:00    -tcsh jonesy 17300 17297 0 11:01 pts/21  00:00:00      make jonesy 17301 17300 0 11:01 pts/21  00:00:00        /bin/sh -c bibtex paper; jonesy 17303 17301  0 11:01 pts/21   00:00:00         latex paper     jonesy 6820 6816    0 Jul28 ?        00:00:03 sshd: jonesy@notty     jonesy 6209 6203    0 22:15 ?        00:00:01 sshd: jonesy@pts/31     jonesy 6227 6209 0 22:15 pts/31   00:00:00   -tcsh 

As you can see, I have a lot going on! These processes look fairly benign, but this may not always be the case. In the event that a user is really spawning lots of resource-intensive processes, one thing you can do is renice every process owned by that user in one fell swoop. For example, to change the priority of everything owned by user jonesy to run only when nothing else is running, I'd run the following command:

 $ renice 20 -u jonesy 1001: old priority 0, new priority 19 

Doing this to a user who has caused the system load to jump to 50 or so can usually get you back down to a level that makes the system usable again.

8.2.1. What About Disk Hogs?

The previous commands will not help you with users hogging disk space. If your user home directories are all on the same partition and you're not enforcing quotas, anything from a runaway program to a penchant for music downloads can quickly fill up the entire partition. This will cause common applications such as email to stop working altogether. If your mail server is set up to mount the user home directories and deliver mail to folders in the home directories, it won't be amused!

When a user calls to say email is not working, the first command you'll want to run is this one:

 $ df h Filesystem Size Used Avail Use% Mounted on fileserver:/export/homes 323G 323G 0G 100% /.autofs/u 

Well, that's a full filesystem if I ever saw one! The df command shows disk usage/free disk statistics for all mounted filesystems by default, or for whatever filesystems it receives as arguments. Now, to find out the identity of our disk hog, become root, and we'll turn to the du command:

 # du s B 1024K /home/* | sort n 

The du command above produces a summary (-s) for each directory under /home, presenting the disk usage of each directory in 1024K (1 MB) blocks. We then pipe the output of the command to the sort command, which we've told to sort it numerically instead of alphabetically by feeding it the n flag. With this output, you can see right away where the most disk space is being used, and you can then take action in some appropriate fashion (either by contacting the owner of a huge file or directory, or by deleting or truncating an out-of-control log file [Hack #51].

8.2.2. Bandwidth Hogging

Users who are hogging network bandwidth are rarely difficult to spot using the tools we've already discussed. However, if the culprit isn't obvious for some reason, you can lean on a core fundamental truth about Unix-like systems that goes back decades: everything is a file.

You can probe anything that can be represented as a file with the lsof command. To get a list of all network files (sockets, open connections, open ports), sorted by username, try this command:

 $ lsof i -P| sort k3 

The i flag to lsof says to select only network-related files. The -P flag says to show the port numbers instead of trying to map them to service names. We then pipe the output to our old friend sort, which we've told this time to sort based on the third field or "key," which is the username. Here's some output

 sshd 1859 root 3u IPv6 5428 TCP *:22 (LISTEN) httpd 1914 root 3u IPv6 5597 TCP *:80 (LISTEN) sendmail 16643 root 4u IPv4 404617 TCP localhost.localdomain: 25 (LISTEN) httpd 1914 root 4u IPv6 5598 TCP *:443 (LISTEN) dhcpd 5417 root 6u IPv4 97449 UDP *:67 sshd 24916 root 8u IPv4 4660907 TCP localhost.localdomain: 6010 (LISTEN) nmbd 7812 root 9u IPv4 161622 UDP *:137 snmpd  25213 root 9u IPv4 4454614 TCP *:199 (LISTEN) sshd   24916 root 9u IPv6 4660908 TCP localhost:6010 (LISTEN) COMMAND  PID USER FD TYPE DEVICE SIZE NODE NAME 

These are all common services, of course, but in the event that you catch a port or service here that you don't recognize, you can move on to using tools such as an MRTG graph [Hack #79], ngrep, tcpdump, or snmpget/snmpwalk [Hack #81] to try to figure out what the program is doing, where its traffic is headed, how long it has been running, and so on. Also, since lsof shows you which processes are holding open which ports, problems that need immediate attention can be dealt with using standard commands to renice or kill the offending process.



Linux Server Hacks (Vol. 2)
BSD Sockets Programming from a Multi-Language Perspective (Programming Series)
ISBN: N/A
EAN: 2147483647
Year: 2003
Pages: 162
Authors: M. Tim Jones

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net