22.11 Performance Studies | UNIX Systems Programming: Communication, Concurrency and Threads

Team-FLY

This section provides guidelines for doing a performance study and points out common pitfalls. We focus on the problem of comparing the performance of thread-per-request and worker-pool implementations for servers that do no disk I/O. You are asked to evaluate connection time and response times for the two approaches and to assess the influence of message size on the results. While this book is about UNIX, not performance evaluation, performance-based tuning is often necessary in such systems. In our experience, many excellent programmers do not have a good sense of what to measure, how to measure it, and what they have actually measured after doing the performance study.

22.11.1 Baseline measurements

All real computer performance studies face the same problem ”a large number of hard-to-control variables whose influence on the result is highly nonlinear. Therefore, it is essential to understand the factors that might affect the results before starting to measure.

The first rule of performance measurement is to establish a baseline before varying any parameters. Do you expect the results to be on the order of seconds? Milliseconds? Microseconds? How much will the results vary from measurement to measurement? What influences variability besides the experimental parameters that you are explicitly varying?

Since you are trying to measure the difference in performance between two different strategies, a natural baseline is the time for exchanging a single message stream of the same type as will be used in testing the threaded servers. For example, you might take the reflectclient.c of Program 18.4 and the reflectserver.c of Exercise 18.15 as a starting point for your preliminary measurements. Measure the connection times and times to send and receive messages of different sizes in order to establish the baseline or control for comparing threaded servers. These measurements give a lower bound on the times and the variability of the measurements in the environment that you are working in. Establishing the baseline is an important step in understanding your measurements.

Exercise 22.15

We modified the reflecting client of Program 18.4 to measure the time to establish a connection to the reflection server of Exercise 18.15 and to send and receive a 1K message. The client and server were running on two Sun Microsystems Ultra-10 machines with 440 MHz processors that were connected by 100 Mbit/sec Ethernet through a switch. The first run gave a connect time of 120 ms and a round trip response time of 152 ms. Subsequent runs gave connect times of around 3 ms and round trip times of about 1 ms. Can you explain these results?

Answer:

A quick look at u_connect and u_accept suggested that DNS lookup was probably the culprit in the long first initial times. The u_connect function calls name2addr before calling connect . After return from accept , u_accept also contacts DNS to obtain the hostname of the client. Once the names are in the local DNS cache, retrieval is much faster. These results suggest that UICI should probably be modified for measuring timing.

22.11.2 Sources of variability

Clearly, the underlying system variability that you observe in single-threaded measurements confounds your ability to distinguish performance differences between the two threading approaches. You can reduce variability by carefully selecting the conditions under which you take measurements. If you have control over the machines in question, you can make sure that no one else is using those machines during your measurements. In many situations, however, you do not have sufficient control of the resources to restrict access. Two other steps are essential in obtaining meaningful answers. First, you should record the conditions under which you performed the measurements and make sure that they did not change significantly over the course of the experiments. Second, when the confounding factors vary significantly over time or you can't quantify how much they are varying, you need to take many more measurements over extended periods to be sure that your numbers are valid.

Exercise 22.16

How might system load contribute to the variability of single-threaded client server communication?

Answer:

Relevant system load parameters are CPU usage, memory usage and network subsystem usage. If the virtual memory system does not have enough pages to accommodate the working sets of the processes running on the system, the system will spend a lot of time swapping disk pages in and out. All network communication on a host passes through the same subsystems, so other processes that are doing network I/O or disk I/O compete for subsystem resources.

Exercise 22.17

Investigate the tools for measuring system load on your system. How can you use these tools to characterize the environment for your measurements?

Answer:

System load is hard to control unless you have control over the machines on which the clients and server are running. At a minimum, you should record the system loads immediately before and after your measurements. For long-running measurements, you should periodically record the system load during the run. The UNIX uptime command supplies information about system load. You might also investigate vendor-specific tools such as Sun Microsystems' perfmeter . The rstatd (1M) service allows remote access to system performance information.

22.11.3 Measurement errors

Measurement errors result from side effects whose times are significant compared with the event times that are to be measured (e.g., printing in the timing loop).

Exercise 22.18

We measured the time to execute a single

 fprintf(stderr, "this is a test");

displaying to the screen in unbuffered form on the system described in Exercise 22.15. The single fprintf took about .25 ms, while an fprintf that outputted five double values took about .4 ms. However, we found that the time for 10,000 executions of the first print statement was highly variable, ranging from 1 to 10 seconds. Give possible explanations for the variability.

Answer:

Although standard error is not buffered from the user perspective, the actual screen device driver buffers output to match the speed of the output device, as the buffer fills up and empties, the time to return from fprintf varies significantly.

Given that the request-response cycle for a 1K packet is about 1 ms for the system and that we are trying to measure additional overhead incurred by threading, the time to execute extraneous print statements can be significant. The sprintf statements may also incur significant overhead for formatting strings. To do careful measurements, you should avoid all printing in timing loops . The next two examples show two common timing-loop errors.

Exercise 22.19

What timing errors occur in the following pseudocode for measuring the connection and response times of a server? What happens if you omit the last assignment statement?

 get time1 connect to the server get time 2 output time2 - time1 loop    write request to the server    read response from the server    get time3    output time3 - time2    time2 = time3

Answer:

The output of time3 - time2 occurs between the measurement of two successive time3 values, hence this statement is in the timing loop. The program should also not output time2 - time1 between the connect and the first write. A better approach would be to save the times in an array and output them after the measurements are complete. If you omit the time2 = time3 statement, all times are measured from the beginning of the session. The estimates for the request-response cycle won't mean anything. If you want to measure the time for the total response, move the statement to get the ending time outside the loop.

Exercise 22.20

Would outputting to disk during the timing be better or worse than outputting to screen?

Answer:

The outcome is a little hard to predict, but either way it cannot be good. Disk access times are on the order of 10 ms. However, a disk write does not actually go directly to the disk but is usually buffered or cached. If the disk is not local but mounted through NFS, the output introduces network traffic as well as delay. For I/O that must be done during the measurements in such an environment, it is better to use /tmp , which is likely to be located on a local disk.

Exercise 22.21

What is wrong with measuring the sending of the request and the receiving of the response individually, such as in the following?

 get time1 write request get time2 read response get time3 sendtime = time2 - time1 receivetime = time3 - time2

Answer:

The sendtime is not the time for the message to reach its destination, but the time to copy the information from the user's variable to system buffers so that the network subsystem can send it. This copying time is usually not meaningful in the context of client-server performance.

Printing inside the timing loop can also occur in the server, as illustrated by the pseudocode in the next example. Direct screen output by the threads has the effect of synchronizing all the threads (the effect gets worse when there are a lot of threads) on each request-response, eliminating parallelism. Use flags and conditional compilation to handle debugging statements.

Exercise 22.22

Why does the following pseudocode for a server thread using thread-per-request present a problem for timing measurements?

 loop until error:    read request    write response    output a message summarizing the response close connection

Answer:

The output statement, although executed by the server, is effectively in the client's timing loop. Print statements on the server side have the added problem of implicitly synchronizing the threads on a shared device.

Another inefficiency that can affect timing is the use of an unnecessary select statement in the worker threads. You do not need to use select for request-response situations unless you must control timeouts.

Exercise 22.23

What is wrong with the following code segment for writing a block of size BLKSIZE followed by reading a block of the same size?

 if (r_write(communfd, buf, BLKSIZE)) < 0)     perror("Failed to write"); else if (r_read(communfd, buf, BLKSIZE) < 0)     perror("Failed to read");

Answer:

The r_write function calls write in a loop until the entire BLKSIZE buffer is written. The r_read function only executes one successful read , so the entire BLKSIZE response may not be read. Thus, a client driver that uses a single r_read call may not correctly time this request-response, particularly for large packets on a wide area network. Worse, the next time the client times a request-response for the connection, it will read the response from the previous request.

22.11.4 Synchronization

The thread-per-request server does not require explicit synchronization, so in theory synchronization isn't an issue for this server. However, implicit synchronization can occur even for thread-per-request whenever threads share a common resource, such as the screen. Avoid print statements in your server except for debugging or for warning of a serious error condition. Debugging statements should always be enclosed in a conditional compilation clause.

Example 22.24

The following statement is compiled in the program because DEBUG has been defined.

 #define DEBUG 1 #ifdef DEBUG    fprintf(stderr, "Sending the message....\n"); #endif

To eliminate fprintf , comment out the #define statement or remove it entirely. In the latter case, you can redefine DEBUG by using the -D option on compilation.

The synchronization issues for the worker pool are more complex. The three common implementations for the worker-pool model have different synchronization characteristics. In the most straightforward implementation, each worker thread blocks on accept . This mechanism relies on the availability of a thread-safe accept function with synchronization handled by the library function itself. POSIX specifies that accept should be thread-safe, but not all OS implementations provide a reliable thread-safe accept . A second implementation of worker pool protects accept with a mutex lock, as illustrated schematically in Example 22.25.

Example 22.25

In the following pseudocode for a worker-pool implementation, the mutex lock effectively forms a barrier allowing one thread at a time to pass through and block on accept .

 loop   mutex lock (if error, output message to log, clean up and exit)   accept (if error, release lock and continue)   mutex unlock (if error, output message to log, clean up and exit)   process request (if error, output message to log, clean up and continue)

The pseudocode of Example 22.25 indicates what to do in case of error. A common problem occurs in not releasing the lock properly if an error occurs on accept . In this case, the system deadlocks because no other worker can acquire the mutex lock.

The buffer implementation of the worker pool is prone to other performance bottlenecks. For example, if the master producer thread executes pthread_cond_broadcast rather than pthread_cond_signal when it puts an item in the buffer, all waiting threads will be awakened and have to contend for the mutex that controls the items. This implementation puts a significant synchronization load on the server, even for moderate numbers of workers. Producers should avoid broadcasting on slots, and consumers should avoid broadcasting on items.

22.11.5 Just plain errors

You can't rely on timing results from a program that doesn't work correctly. It is important to catch return values on all library functions, including thread calls. Use the lint utility on your source and pay attention to the output. In particular, do not ignore the implicitly assumed to return int message, suggesting that you are missing header files.

Because the threads are executing in the environment of their parent, threaded servers are prone to memory leaks that are not a problem for servers that fork children. If a thread calls pthread_exit without freeing buffers or closing its communication file descriptor, the server will be saddled with the remnants for the remainder of its lifetime.

Exercise 22.26

What memory leaks are possible in the following code?

 loop    malloc space for communfd    if malloc fails       quit    accept a client connection    if accept fails       continue    create a thread to handle the communication    if the thread create fails,       continue

Answer:

If accept fails, the space for the communication file descriptor leaks. If the thread create fails, the server leaves an open file descriptor as well as allocated memory.

Exercise 22.27

What assumptions does the following code make in casting communfd ?

 int communfd if ((communfd = u_accept(listenfd, client, MAX_CANON)) == -1)    return -1; if (pthread_create(&tid, null, process_request, (void *)communfd))    return -1;

Answer:

The code implicitly assumes that an int can be correctly cast to void * , an assumption that may not be true for all machines.

Memory leaks for threaded servers can occur if any path of execution doesn't free resources. The thread-per-request threads must free any space that they allocated or that was allocated on their behalf by their parent thread before creation. In addition, they must close the communication file descriptor even if an error occurred.

The worker-pool implementations do not need to allocate memory space for the communication file descriptors, and often they allocate buffers only once. However, the explicit synchronization introduces its own quagmire of error possibilities. Using a single mutex lock for mutual exclusion on the buffer and for tracking items and slots can result in incorrect or extremely delayed synchronization. Failure to synchronize empty slots can result in the server overwriting file descriptors before they are consumed.

Another resource management problem can occur in thread-per-request. When a thread exits, it leaves state and must be waited for unless it is a detached thread. These "zombie" threads are a leak for a long-running server. Finally, you should think seriously about the legitimate causes for a server to exit. In general, a client should not be able to cause a server to exit. The server should only exit if an irrecoverable error due to resources (memory, descriptors, etc.) would jeopardize future correct execution. Remember the Mars Pathfinder!

22.11.6 What to measure?

In most computer performance studies there are too many parameters to vary simultaneously ”so usually you can't run exhaustive tests. If you could, the results would be hard to handle and make sense of. The specific problem that we are considering here has relatively few variables for a performance problem, but even it is complex. Random testing of such a problem generally does not produce insight, and you should avoid it except for debugging. As a first step in formulating testable hypotheses, you should write down the factors that might influence the performance, their probable effect, plausible limits for their sizes, and how these tests should compare with baseline tests.

Example 22.28

The performance of the thread-per-request server without disk I/O depends on the number of simultaneous requests, the duration of these requests , and the I/O that must be performed during the processing of the request and the response. While the I/O costs probably depend on both the number of messages that are exchanged and their sizes, to first order the total number of bytes exchanged is probably the most important cost. Plausible limits are just that ”guesses. One might guess that a server should be able to handle 10 simultaneous streams without a problem. Whether it could handle 100 or a 1000 simultaneous streams is anyone 's guess, but these ranges give a starting point for the measurements.

Exercise 22.29

Give performance factors for the worker pool implemented with a mutex lock protecting accept .

Answer:

The factors specified in Example 22.28 are relevant. In addition, the number of threads in the worker pool relative to the number of simultaneous connections should also be important.

The preceding examples and exercises suggest that the most important control variable is the number of simultaneous connections that a server can handle. To measure the server capacity, you will need to be able to control the number of simultaneous connections offered by your driver programs. The client-driver program of Section 22.4 offers parallel loads. Such a client driver running on a single machine might reasonably offer 5 or 10 parallel streams, but is unlikely to sustain 100 parallel streams. Suppose you want to test your server with 10 and 100 parallel streams. A reasonable approach to generating the 100 parallel streams might be to have 10 different hosts generate 10 streams each.

Exercise 22.30

Describe the load offered by the client-driver program of Section 22.4 if it forks 10 children that each make 10 connections. Suppose each connection consists of 10 request/response pairs of 100 bytes each.

Answer:

The number of connections per child is far too low to offer a sustained load of 10 simultaneous streams. Forking the 10 children takes sufficiently long and the request streams are sufficiently short that some of the first children will finish or nearly finish before the later children start execution.

Many beginning analysts typically do not take enough measurements to make their studies meaningful and do not account for transient behavior. One approach to eliminating transients is for the loading programs to sustain the load longer than needed and discard the beginning and the end of the record. You can decrease or eliminate the amount that needs to be discarded by synchronizing the children before starting. Children of a single parent can call sigsuspend . The parent can then send a wake-up signal to the process group . For clusters of driver processes running on different machines, the parents can listen for a synchronization server, whose sole job is to initiate connections to the parent drivers. Section 22.5 describes the approach in detail.

To pick parameter values that make sense, you must understand the relationship of the processes/connections/messages values. The number of processes roughly corresponds to the number of parallel connections that are established. However, this assumes steady state. If each client process makes only two connections and sends two messages on each connection, some client processes will probably complete before the client finishes forking all the child processes. The actual length of a run needed to accurately estimate performance is a statistical question beyond the scope of this text. Roughly , the larger the variability in the values, the more measurements you need.

Generally, if the number of threads in the worker pool is greater than the number of request streams, you would expect a worker pool to consistently outperform thread-per-request because it should have less overhead. If the number of request streams exceeds the number of workers, thread-per-request might do better, provided that the system has enough resources. Therefore, if the main variable is offered load, be sure to vary the number of simultaneous request streams from 1 to a value well beyond the number of worker-pool threads. Look, too, for discontinuities in behavior as the number of request streams approaches and exceeds the number of worker-pool threads.

For parameters that influence the system in a highly nonlinear way, it is often useful to measure a few widely separated values. For example, to understand the influence of message size on the performance, you might decide to measure the response as a function of offered load for two different message sizes. Choosing message sizes of 32 bytes and 64 bytes to compare does not give meaningful results because each of these messages always fits into a single physical packet. Although one message is twice as big as the other, the messages are essentially the same size as far as the network is concerned . The network headers on these messages might be comparable to the data in size. You would get more useful information by picking message sizes of 512 bytes and 20 kilobytes, typical sizes for a simple web page and an image, respectively. In addition to being physically meaningful, these sizes exercise different characteristics of the underlying network protocols. A 512-byte message should traverse the network in a single packet even on a wide area network. The 20K message is larger than the typical 8K slow-start limit for TCP, so its transmission should experience some congestion control, at least on a wide area network.

22.11.7 Data analysis and presentation

Simple statistical measures such as the mean, median and standard deviation are useful characterizations of behavior. The median is less sensitive to outliers and is often used in network measurements. In general, medians should be smaller and more stable than means for these distributions. If your medians don't reflect this, you probably are not computing the statistics correctly. If your medians and means are consistently different by an order of magnitude, you should worry! Also, when combining results from multiple clients, don't take the median of the medians and present that as the median.

Think about how to analyze the data before designing an output format. If you plan to import the data into a spreadsheet, your output format should be spreadsheet-friendly so that you don't have to manually edit the data before analysis. You may want to output the results in multiple formats, for example, as tables without intermediate text so that the values fall into columns . Combine the numbers from all the client processes for final analysis. Standard deviation or quartiles are good indications of data variability.

You should also consider whether a table of results conveys the message better than a graph. Tables work well when the test consists of a few measurements or if some results are close together while others vary significantly, You can present more than one version if the results are meaningful.

Team-FLY