13.8 Deadlocks and Other Pesky Problems

Team-FLY

Programs that use synchronization constructs have the potential for deadlocks that may not be detected by implementations of the POSIX base standard. For example, suppose that a thread executes pthread_mutex_lock on a mutex that it already holds (from a previously successful pthread_mutex_lock ). The POSIX base standard states that pthread_mutex_lock may fail and return EDEADLK under such circumstances, but the standard does not require the function to do so. POSIX takes the position that implementations of the base standard are not required to sacrifice efficiency to protect programmers from their own bad programming. Several extensions to POSIX allow more extensive error checking and deadlock detection.

Another type of problem arises when a thread that holds a lock encounters an error. You must take care to release the lock before returning from the thread, or other threads might be blocked.

Threads with priorities can also complicate matters. A famous example occurred in the Mars Pathfinder mission. The Pathfinder executed a "flawless" Martian landing on July 4, 1997, and began gathering and transmitting large quantities of scientific data to Earth [34]. A few days after landing, the spacecraft started experiencing total system resets, each of which delayed data collection by a day. Several accounts of the underlying causes and the resolution of the problem have appeared, starting with a keynote address at the IEEE Real-Time Systems Symposium on Dec. 3, 1997, by David Wilner, Chief Technical Officer of Wind River [61].

Program 13.17 strerror_r.c

Async-signal-safe, thread-safe versions of strerror and perror .

 #include <errno.h> #include <pthread.h> #include <signal.h> #include <stdio.h> #include <string.h> static pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER; int strerror_r(int errnum, char *strerrbuf, size_t buflen) {    char *buf;    int error1;    int error2;    int error3;    sigset_t maskblock;    sigset_t maskold;    if ((sigfillset(&maskblock)== -1)         (sigprocmask(SIG_SETMASK, &maskblock, &maskold) == -1))       return errno;    if (error1 = pthread_mutex_lock(&lock)) {       (void)sigprocmask(SIG_SETMASK, &maskold, NULL);       return error1;    }    buf = strerror(errnum);    if (strlen(buf) >= buflen)       error1 = ERANGE;    else       (void *)strcpy(strerrbuf, buf);    error2 = pthread_mutex_unlock(&lock);    error3 = sigprocmask(SIG_SETMASK, &maskold, NULL);    return error1 ? error1 : (error2 ? error2 : error3); } int perror_r(const char *s) {    int error1;    int error2;    sigset_t maskblock;    sigset_t maskold;    if ((sigfillset(&maskblock) == -1)         (sigprocmask(SIG_SETMASK, &maskblock, &maskold) == -1))       return errno;    if (error1 = pthread_mutex_lock(&lock)) {       (void)sigprocmask(SIG_SETMASK, &maskold, NULL);       return error1;    }    perror(s);    error1 = pthread_mutex_unlock(&lock);    error2 = sigprocmask(SIG_SETMASK, &maskold, NULL);    return error1 ? error1 : error2; } 

The Mars Pathfinder flaw was found to be a priority inversion on a mutex [105]. A thread whose job was gathering meteorological data ran periodically at low priority. This thread would acquire the mutex for the data bus to publish its data. A periodic high-priority information thread also acquired the mutex, and occasionally it would block, waiting for the low-priority thread to release the mutex. Each of these threads needed the mutex only for a short time, so on the surface there could be no problem. Unfortunately, a long-running, medium-priority communication thread occasionally preempted the low-priority thread while the low-priority thread held the mutex, causing the high-priority thread to be delayed for a long time.

A second aspect of the problem was the system reaction to the error. The system expected the periodic high-priority thread to regularly use the data bus. A watchdog timer thread would notice if the data bus was not being used, assume that a serious problem had occurred, and initiate a system reboot. The high-priority thread should have been blocked only for a short time when the low-priority thread held the mutex. In this case, the high-priority thread was blocked for a long time because the low-priority thread held the mutex and the long-running, medium-priority thread had preempted it.

A third aspect was the test and debugging of the code. The Mars Pathfinder system had debugging code that could be turned on to run real-time diagnostics. The software team used an identical setup in the lab to run in debug mode (since they didn't want to debug on Mars). After 18 hours, the laboratory version reproduced the problem, and the engineers were able to devise a patch. Glenn Reeves [93], leader of the Mars Pathfinder software team, was quoted as saying "We strongly believe in the 'test what you fly and fly what you test' philosophy." The same ideas apply here on Earth too. At a minimum, you should always think about instrumenting code with test and debugging functions that can be turned on or off by conditional compilation. When possible, allow debugging functions to be turned on dynamically at runtime .

A final aspect of this story is timing. In some ways, the Mars Pathfinder was a victim of its own success. The software team did extensive testing within the parameters of the mission. They actually saw the system reset problem once or twice during testing, but did not track it down. The reset problem was exacerbated by high data rates that caused the medium-priority communication thread to run longer than expected. Prelaunch testing was limited to "best case" high data rates. In the words of Glenn Reeves, "We did not expect nor test the 'better than we could have ever imagined' case." Threaded programs should never rely on quirks of timing to work ”they must work under all possible timings.

Team-FLY


Unix Systems Programming
UNIX Systems Programming: Communication, Concurrency and Threads
ISBN: 0130424110
EAN: 2147483647
Year: 2003
Pages: 274

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net