18.3 Stuck Daemons and Deliveries

Some of the most frustrating problems are due to background daemons that don't do what they're supposed to do. Fortunately the daemontools package makes daemon debugging relatively straightforward.

18.3.1 Daemons Won't Start, or They Start and Crash Every Few Seconds

Starting a daemon under svscan and supervise is simple in concept, although the details can bite you. The super-daemon is started at system boot time by running /command/svscanboot. It runs svscan to control daemons and the useful but obscure readproctitle, which takes any error messages from svscan and puts them into its command area so that ps will show it.^[1]

^[1] This odd way of displaying error messages is intended to work even in the presence of serious configuration screwups like disks that should be mounted but aren't and directories that are supposed to be writable but aren't.

Every five seconds svscan looks at all of the subdirectories of /service and starts up a supervise process on any that don't have one running. In the usual case that the subdirectory in turn has a subdirectory called log, it starts a second supervise process in the subdirectory and pipes the output from the first process to the second.

When supervise starts up a daemon, it runs the file run in the daemon's directory. That file has to be a runnable program that either is or, more commonly, exec's the daemon itself. That means that run has to have its execute bits set and, if it's a shell script, start with #!/bin/sh so that it's runnable. If either of those isn't the case, there is a failed attempt to start the daemon every five seconds. A ps l that shows readproctitle should reveal the error messages and give hints about what needs to be fixed.

The run script generally sets up the program environment and then exec's the actual daemon. If you become super-user and type ./run, the daemon should start. If that works, the daemon still doesn't start, and you don't use full program paths in the run file, the problem is most likely that the search path that supervise uses isn't the same as the one you're using. Look at /command/svscanboot to see the search patch that it uses. Most notably, it does not include /var/qmail/bin unless you edit the file yourself to include it.

18.3.2 Nothing Gets Logged

Sometimes the daemon runs but nothing's going into the log files. This generally is due to either file protection problems or an incorrect set of multilog options. The usual way to run multilog is to create a subdirectory called main in which it rotates log files. It's safer to run daemons as a user other than root, so when possible, use qmaill, the qmail log user. A common error is to forget to change the ownership of the log file directory to qmaill (or whatever the log user is). When multilog starts successfully, it creates a current log file in the directory, so if there's no main/current, the most likely problem is directory ownership or protection.

If multilog is running but there's nothing logged, the most likely problems are that the daemon isn't sending anything to log, or that multilog's options are telling it to discard everything. Because the daemon and the logger are connected with a regular Unix pipe, only messages sent to the daemon's standard output go to the logger. In particular, anything sent to standard error shows up in readproctitle, not the log. If, as is usually the case, you want to log the errors a daemon reports, just redirect the error output to the standard output in the run script with the standard shell redirect 2>&1. (That redirect is at the end of just about every run script example in this book.)

If the daemon is a program originally intended to run as a standalone daemon rather than under daemontools, it probably sends its reports to syslog, not to standard output or standard error. In most cases, there is an option to send messages to stdout or stderr.

If you are using multilog options to select what to log, be sure that you're selecting what you think you are. In particular, its pattern language resembles shell wildcards but is in fact considerably weaker because it doesn't move ahead or back up on a failed match. (Patterns do resemble shell wildcards closely enough that they should always be quoted to keep the shell from messing with them.) The pattern must match the whole line, and stars stop matching the moment they see the following character in the pattern. If a pattern is, say, +'+*: status: *', it will match one: status: two, but it will not match one: two: status: three, because the star will stop at the first colon and won't look for the second one. If the pattern didn't have the star at the end, it wouldn't match anything useful because it wouldn't match any lines with anything after the status:. In practice, most log file messages have a pretty simple syntax, and it's not hard to come up with adequate patterns if you keep in mind the limitations of the pattern-matching language. For debugging, start with no patterns to be sure that the stream of messages going into the log files contains what you expect, then add one or two patterns at a time and restart multilog with svc -t and see what's going into main/current each time until it looks right.

18.3.3 Daemons Are Running but Making No Progress

One of the most baffling problems occurs when the daemon seems OK, the logger seems OK, but the daemon's not doing anything. What's wrong? Usually the problem is that the disk to which the log files are written has filled up or is mounted read-only. Because multilog is designed not to lose any log data, if it can't write to the disk, it just waits and retries until it can. This means that the pipe between the daemon and multilog fills up and the daemon stalls waiting to be able to write to the pipe. The solution is to delete some files and fix whatever it was that filled up the disk so it doesn't happen again. If the disk is full of files written by various multilog loggers, adding or adjusting s and n options to set the maximum size and number of log files can help.

18.3.4 Mail Rejected with Stray Newline Reports

The SMTP spec says that the way that each line of text in an SMTP session ends is with a carriage return/line feed pair (0d 0a in hex or \r\n in C.) Some buggy MUAs and MTAs only try to send mail that contains linefeeds with no preceding carriage return. Qmail's SMTP daemon normally rejects such mail with a log message like Stray newline from 10.2.3.4 because there's no way to tell whether the bare linefeed is just missing a carriage return or it's some kind of malformed binary data.

If you're seeing stray newline entries in your logs and you're reasonably sure that they're being sent by MTAs or MUAs that intend them to be handled as an end-of-line, use the fixcrio program from the ucspi-tcp package to placate the SMTP daemon. Modify the run script for qmail-smtpd so that it pipes mail through fixcrio, as shown in Example 18-1:

Example 18-1. SMTP daemon that forgives stray newlines

 1. #!/bin/sh  2. limit datasize 3m  3. exec tcpserver \  4.    -u000 -g000 -v -p -R \  5.      0 25 \  6. /usr/local/bin/fixcrio | /var/qmail/bin/qmail-smtpd" 2>&1

Line 6 is the modified one, starting up fixcrio and qmail-smtpd. When fixcrio runs, it passes the input and output of qmail-smtpd through pipes so it can add missing carriage returns in front of newlines as needed. In the longer run, see if you can persuade your correspondents to upgrade their SMTP clients to newer, less buggy versions.