Algorithm: Achieving Performance Through Design Choices | Performance Tuning for Linux Servers

All the compiler optimizations and hand-coded tuning methods in a programmer's toolbox do not matter as much in a program's performance as the proper choice of solution for a given problem. The rest of this section examines a sample network-based program and some of the design choices that can lead to a well-performing application.

Problems and Solution Possibilities

To understand some of the choices, we will describe a sample program that will demonstrate some of the issues concerning programming and optimal performance. We will look at a typical problem to understand some performance issues.

Programmers working on network-based solutions must deal with a number of issues, some of which we will demonstrate here by focusing on a simplified multithreaded file I/O Internet server.

We intend to demonstrate the use of threading and sockets, show reasonably high-performance programming techniques, and measure the performance of the resulting Internet page server. Note that we are spelling internet with a lowercase i here because we are not processing HTTP.

A ProblemHow Fast Can We Connect, Create a File, Read It, and Disconnect?

Our program will consist of two parts: a client part and a server part. The server will initialize itself and await a command from a client. The server will have the following usage:

 server [-t nnn] [-pooling] [ip_address]:portnum

where t indicates the number of threads to use (the default is 1). The pooling option instructs the server to use a pool of threads rather than create and destroy a thread for each incoming work item.

The client has the following usage:

 c [-t nnn] filesize nnn blksize mmm qsize jjj   [ip_address]:portnum

Here, we can instruct the client to use nnn tHReads and create files of size nnn using a block size of mmm. Finally, we instruct the client to do this jjj times. Both client and server use the TCP/IP address [ip_address]:portnum. The meaning here is that if the ip_address is unspecified, the program (server or c) uses the local machine address.

Thus, the following command starts the server program, allows the use of up to eight worker threads, and listens on port 4001:

 server t 8  :4001

The command

 c t 4 qsize 1000 filesize 2m blksize 8192 :4001

connects to the server located at port 4001 on the local machine to perform 1,000 operations, each consisting of a file create, writing 2 million (2*1024*1024) bytes, reading 2 million bytes (8192 bytes at a time), sending 2 million bytes (8K at a time) over the TCP/IP connection, and, finally, closing the file and deleting it. By specifying a complete TCP address as follows:

 c t 4 qsize 1000 filesize 2m blksize 8192 10.0.0.4:4001

the client program can be run from any computer that can communicate via TCP/IP with the server program.

The server and client could be written in most of the languages listed earlier and others not listed. However, rather than digress into the reasons why one language is or is not appropriate, let me just say that the C and C++ languages give access to the system entry APIs in Linux (the system calls). These "system call" APIs provide operating system services used by high-performance programs. We will direct our attention to some the system call primitives documented in section 2 of the manual.

The Program

Our program's responsibility will be to accept the command-line parameterization, set up to accept socket connections and distribute the resulting requests across the available threads. Further, we will design the server portion and the client portion to simply be parts of the same source code. When compiled, if the resulting executable is named server, it performs the server actions; otherwise, it performs client actions.

The server portion requires several threads. The first thread initializes all other threads and then listens for incoming requests and queues them. The second thread schedules work found in the queue. The third thread cleans up threads that terminate.

Pseudocode for the first server thread would look like the code shown here:

 Server Main   Parse options   Create listening socket   Start cleanup thread   Start threadscheduler thread   Forever {      Accept new socket     If quit break     Queue new socket   }   exit

Here, the actual socket file descriptor is the object that is queued. Two other threads must be described. The cleanup thread has pseudocode, as shown here:

 Thread Cleanup   While threadactive {      Locate active thread     "wait" for thread // cleans up   }   exit

The thread scheduler has pseudocode, as shown here:

 Thread scheduler   While qsize > 0 {      Locate inactive thread     Start inactive thread with work item   }   exit

The thread scheduler sets a thread to work on the queued socket. Surrounding the basic work to be done is the mild protocol, which accepts a text command from the client (over the socket), parses the command, executes the actual work, and finally sends a "DONE" message back to the client. Code that wraps around the actual worker thread is shown here:

 Thread wrapper   Accept work assignment from socket   Parse text command     Mark myself as "active"       Do real work     Mark myself as "inactive"   Reply to socket with an "DONE" message

We have described the server-side pseudocode that supports an almost arbitrary threaded work item. Before going into the details of what is involved in using Linux and gcc to create a program, let me also describe the client that issues the "commands" to this server and produces the timed results.

The client must be able to cause all this work to happen and report the timing of the results. If the server is multithreaded, on a multiprocessor, for a given number of threads, the results should improve as we add processors. Our server program will demonstrate that by scaling as we add processors. The client pseudocode has the following command-line behavior:

 client t 8 qsize 1000 filesize 2m blksize 8192 :4001

The command-line options have the same meaning as for the server. When the client receives the last "DONE" message from a server thread, it prints a summary timing of the entire operation. The client pseudocode is shown here:

 Client Parse command line options Put all work into a queue Start each thread; each will pull something off the queue  until the queue is empty.

We will demonstrate some of the performance choices we must make to write a program on Linux. From this brief description, we will go into more detail on the programming aspects and how to measure the resulting performance. The client thread has the pseudocode shown here:

 Client Thread While there is work to be done Get work item of work queue Send text string describing work to server Read and check date from server until done. Wait for "DONE" Send 'Q' Close socket

The client performs all the timing and reports the results.

Designing the Code

The choice of C or C++ rather than other languages is due to the personal experiences of the author.

Others who have experience in other languages might use FORTRAN or Python. However, more than the individual experiences of one person should be considered. Most programs are part of a product and require multiple people to write. Programming language selection should be based on the experiences of the participants in the programming team, the supportability of the language, and the debugging tools available. C and C++ are safe choices. On Linux, the C compiler that comes with Linux is gcc, the GNU Compiler Collection.

We will go through the various issues one by one and include snapshots of code. The complete program is included in the appendix.

The Server

Our program is designed as a single source file. When invoked, it asks whether it is named server. If it is, it performs the server functions. If it is not named server, it behaves like a client. The point of including the entire source in a single program is to simplify the building of the program and copying it to other systems. It is not necessarily the best choice. The question is asked as shown here:

 char *p; if(strchr(av[1],"/"))     p = strrchr(av[1],"/"); else     p = av[1]; if(equal(p,"server"))     ServerMain(); else     ClientMain();

This coding style is a convenience. The example brings us to another simplification that really represents personalization of code. Many programmers have idiosyncratic mechanisms that are included in every program they write. There are also reasons to define shortcuts of common names for things that are different on different platforms. In multiple-person developments, the shortcuts can interfere with rapid understanding of the code if they become too plentiful. This shortcut list is reasonably small, but to have a group of people embrace it would require a discussion and consensus. Following are the shortcuts we used:

 #   define SLASHC       '/' #   define SLASHSTR     "/" #   define SOCKTYPE     int #   define SLEEP(x)     sleep(x) #   define Errno        errno #   define BADSOCK      -1 #   define LCK          pthread_mutex_t #   define SEMA_T       sem_t       // (man sem_init) #   define YIELD        Yield() #   define SOCKET        int #   define SOCKERR      -1 #   define EXITTHREAD() return #   define INT64        long long #   define UINT64       unsigned long long     typedef pthread_t   THREAD_T; #   define TVAL         struct timeval #   define equal        !strcmp #   define equaln       !strncmp

It can be noted that converting these shortcuts to other platforms that support C/C++ is trivial.

Timing

The tstart(), tend(), and tval() routines provide a mechanism for recording time in the microsecond range. These routines are implemented by using the gettimeofday() routine on Linux. The documentation for gettimeofday() is accessed with the following command:

 man getimeofday

Our version here is reentrant. tstart() and tend() record their values in a location specified by an input pointer. The design supports reentrancy, which is important when designing a multiple-thread application.

One issue with timing routines is the possibility that during a measurement session, someone or some program changes the system time. If that happens, you could get results that are not repeatable. For that reason, we make two recommendations:

Turn off all NTP (Network Time Protocol) servers. Using netstat i, you can determine the list of open ports on the local machine. Make sure that port 123 is not in use. If it is, stop the service that is using it. If an NTP server is not running, there is little likelihood it will change the system time.
Run performance measurement tests multiple times to ensure that the results are repeatable.

With these two guidelines, it is unlikely that you will get into a lot of trouble.

The timing routines work like a stopwatch. To time something, the following sequence is used:

 TVAL ts,te; double t; tstart(&ts); do_something(); tend(&te); t = tval(&ts, &te); printf("do_something() took %8.5f seconds.\ n",t);

The code for the timing routines looks like this:

 void tstart(struct timeval *t) {     gettimeofday(t, NULL); } void tend(struct timeval *t) {     gettimeofday(t,NULL); } double tval(struct timeval *tv1, struct timeval *tv2) {     double t1, t2;     t1 =  (double)tv1->tv_sec +           (double)tv1->tv_usec/(1000*1000);     t2 =  (double)tv2->tv_sec +            (double)tv2->tv_usec/(1000*1000);     return t2-t1; }

Using the gettimeofday() system call allows resolution to one millionth of a second on an Intel-based Linux machine.

Sockets

Linux supports Berkeley sockets. Our program takes an input command-line parameter and converts it to an IP address and port number, suitable for use with the sockets library. For connection (client) or listening (server), both an IP address and a port number are required. The TcpParse(), ipaddress(), and portnum() routines convert an ASCII string to an IP address or port number. All three require reasonably precise input. If any encounters an error, it prints an error message and causes the program to exit.

Our discussion won't be a tutorial on sockets. We assume you are familiar with the basics of programming sockets using the AF_INET family of protocols. We also assume our interests here are in stream-oriented sockets as opposed to datagrams (TCP versus UDP).

There are two ways to use sockets. The first is to listen for incoming connections. The second is to initiate an outgoing connection. On Linux, sockets can be used between two threads on the same machine or can be directed to a program on another machine; the distinction is only at the command-line level, where the IP address and port number of the server software are located. For our testing and demonstration purposes, we confine ourselves to a single machine. When the program is fully described, we make test runs between two different machines.

The server portion of our program creates a socket and listens for connections. It does this in a listen and accept thread, whose only responsibility is to start new thread work. Server code creates a socket, performs a listen() on it, and then accepts() new connections. The accept action creates a new socket that is handed off to a thread to perform work. The newly created socket is a bidirectional communication channel (a full socket). The following shows the code for the listen and accept routine:

 void listen_and_accept(SOCKTYPE *sock) {     int rc;     SOCKTYPE sock3;     static SOCKTYPE sock1;     static struct sockaddr_in addr1;     struct sockaddr_in addr2;     int addr2len;     static int first = 1;     if(first) {          addr1.sin_family      = AF_INET;         addr1.sin_addr.s_addr = naddr;   // global         addr1.sin_port        = port;    // global         sock1 = socket(AF_INET, SOCK_STREAM, 0);         if(sock1 == BADSOCK) {              printf("socket FAILED: err=%d\ n",Errno);             exit(1);         }         rc = bind(sock1,(const struct sockaddr *)&addr1,                  sizeof(addr1));         if(rc == SOCKERR) {              printf("bind FAILED: err=%d\ n", Errno);             exit(1);         }         rc = listen(sock1,5);         if(rc) {              printf("Listen FAILED: err=%d\ n", Errno);             exit(1);         }         first = 0;     }     addr2len = sizeof(addr2);     sock3 = accept(sock1, (struct sockaddr *)&addr2,                 (socklen_t *)&addr2len);     if(sock3 == BADSOCK) {          printf("Accept FAILED: err=%d\ n", Errno);         exit(1);     }     *sock = sock3; }

In this code segment, we have decided that all unexpected results should cause a program termination. Without detailed analysis of why each error might occur, this method makes for a predictable program, notwithstanding errors. Continuing in the presence of any of the preceding errors is difficult; passing a return value back to the calling routine that a failure occurred simply compounds the problems by making the caller attempt to discover what went wrong. By printing a message and exiting the program immediately, discovering an unexpected result focuses programming efforts precisely where the problem occurred. We have found over the years that correctness and debugability are more important than performance, and our code tries to reflect this point of view.

Servers listen and accept while clients connect. The following shows another wrapper program to perform the details of establishing a connection:

 extern int econnrefuseretries; SOCKTYPE clientconnect(int *per_client_refusecnt) {     SOCKTYPE sock2;     struct sockaddr_in addr1;     int refusedcount;     addr1.sin_family      = AF_INET;     addr1.sin_addr.s_addr = naddr;     addr1.sin_port        = port;     sock2 = socket(AF_INET, SOCK_STREAM, 0);     if(sock2 == BADSOCK) {          printf("socket FAILED: err=%d\ n", Errno);         exit(1);     }     refusedcount = 0;     while(connect(sock2, (struct sockaddr *)&addr1,                   sizeof(addr1))) {          int err;         err = Errno;         if(err != ECONNREFUSED) {              printf("connect FAILED: err=%d\ n",Errno);             exit(1);         }         SLEEP(2); // Be polite         *per_client_refusecnt++;         if(refusedcount++ >= econnrefuseretries) {              printf("connect FAILED: ");             printf("after %d ECONNREFUSED attempts\ n",                 econnrefuseretries);             exit(1);         }     }     return sock2; }

These two code segments bear some discussion. The listen_and_accept() routine is written to be used as simply as possible; it either returns a usable socket or prints an error and exits. The clientconnect() routine either returns a usable socket or prints an error and exits. It processes the ECONNREFUSED error return in an attempt to deal with a server that is too busy to accept the socket request. That happens to any listen and accept loop when the queue of requests in the operating system is longer than the default of 5. For instance, if a server issues a listen_and_accept(), receives an incoming socket, and then simply goes to sleep, succeeding incoming requests are queued by the operating system until there are five of them. The sixth connection request is refused, and the ECONNREFUSED error is reported by the client issuing a connect request (clientconnect() in our case). Our code simply sleeps for 2 seconds and tries again. We have allowed, by default, four retries. The number can be changed by editing the program and recompiling. Alternatively, it would be trivial to add a new option that allows the value to be set on the command line.

We have described the actions required to create socket connections. The details of socket connections are seldom the source of performance problems. It is what happens after the socket is created that becomes interesting.

Threads

Because our program uses threads and demonstrates thread pooling, we must describe how threads work and our usage of them. We don't claim that any of the following code is the best-performing code. We do claim that the code demonstrates how threads can be used in both a pooled and nonpooled manner, and that the code is reasonably good. No doubt, improvements are possible.

Recalling from earlier, our description of the server main program where a scheduler and a cleanup thread were created, we will show these two modules. There is a third module we haven't yet mentioned. It is the thread that listens for new work to queue. The work is the newly created socket, and it is the socket that is queued. Our worker program reads a text string from the socket containing the file size and block size to use in doing the work. Other experimentations with this code could replace our worker thread with one written for almost any purpose whatever. To summarize, the following takes place:

The Listener listens for new connections and inserts work (the newly created socket) into the work queue.
The thread scheduler waits for work in the queue and starts a thread passing the socket file descriptor to the thread. The scheduler can be run using pooled threads or in the mode where it creates a new thread for each work item (-pooling command-line option).
The work thread does all the work and finally closes the socket.
The thread cleanup awaits thread deaths and cleans up after them.

Surrounding the worker thread are the fixtures to report back to the client the completion of the task.

The thread listener is coded as shown here:

 void ServerMain() {     SOCKTYPE sock;     initq(&workq);     newThread( (void(*)(void *)) threadScheduler,0, &schedulerT);     newThread( (void(*)(void *)) threadCleanup,0, &cleanupT);     newThread( (void(*)(void *)) threadDbg, 0, &dbgT);     // Server waits in a listen/accept sequence and hands off     // request to a thread. Threads are either dynamically     // created or there is a pool.     //     // listen_and_accept creates sockets     //     for(;;) {          listen_and_accept(&sock);         enqueue(&workq, sock);         vsema(&workq.sema);     } }

The thread scheduler reads information from a queue. The queue is protected by synchronization primitives, which we discuss later in the chapter. The protections allow multiple threads to update the queue without making the queue metadata inconsistent. The thread scheduler is coded as shown here:

 void threadScheduler() {     SOCKTYPE sock;     int rc;     //     // threadScheduler decrements availableThreads,     // threadCleanup   increments availableThreads.     //     for(;;) {          schedulerstate = 1;         psema(&workq.sema);         schedulerstate = 2;         psema(&availableThreads);         schedulerstate = 3;         if(dequeue(&workq, (int *)&sock)) {              schedulerstate = 4;             rc = threadStart(sock); // starting a thread             if(rc == -1) {                  printf("threadStart FAILED: maxthreads=%d\ n",                     maxthreads); fflush(stdout);                 exit(1);             }         }         else {              printf("Workq Sema count wrong\ n");             exit(1);         }     } }

Thread cleanup is coded as shown here:

 void threadCleanup() {     int j = 0;     //     // For pooled threads, this loop will last forever.     //     for(;;) {          cleanupstate = 1;         psema(&ActiveT);         cleanupstate = 2;         for(j = 0; j < maxthreads; j++) {              if(threads[j].exists) {                  threadWait(&threads[j].thrd);                 threads[j].exists = 0;                 threads[j].active = 0;                 nexists--;                 vsema(&availableThreads);             }         }     } }

These three modules require some background discussion. The listener module simply accepts new sockets and queues them. It is simple, and the only thing to think about is that it must queue the work consistently; therefore, it uses the enqueue() routine. The thread scheduler must create threads or allocate existing (pooled) threads. Threads are created with a pthread_create() call. Because our program wants to start a thread that is either already created (a pooled thread) or create a fresh one, we have chosen to abstract the thread-starting process with the threadStart() routine. tHReadStart()uses the pooling global variable to determine whether to use a pooled thread or to simply start a new thread. The pooling global defaults to no pooling and is a command-line option.

The command line tells the server and client how many threads to use. The default is one. The server runs no more than the number of threads specified on the command line at one time. This works in the same way for the client. Each server thread performs its work and ends. A pooled thread ends by marking itself as inactive and blocking on a lock. An unpooled thread simply returns, thus destroying the thread. For nonpooled threads, the cleanup routine is essential if the system is not to run out of resources. It is the cleanup routine that returns the resources used by nonpooled threads to the system.

We add one further observation about our thread code. Writing threaded code that schedules work using pooled threads is not exactly trivial. Determining why things are not working properly can be time-consuming. We demonstrate here one method of making the debugging task simpler. Our thread management routines (scheduler and cleanup) each have a global state variable. The state variables represent the current state of the routine and are accessible from all threads in the program. The variables are changed just prior to any function call that could block. By printing these two state variables, we can determine exactly where in the code the scheduler and cleanup threads are currently blocked. We wrote a trivial debug thread that can print the values of these state variables, and together with the state variables the debugging task was significantly accelerated.

Synchronization

The scheduler uses synchronization primitives (semaphores) to block when there is nothing to do. Semaphores are used as a synchronized counting mechanism where, if there are things to do, the number of things to do is reflected in the value of the semaphore. If the count is greater than zero, the psema() routine decrements the count by one and returns to the calling program. If the value of the semaphore is zero, the psema() routine blocks. The vsema() operation increments the semaphore and never blocks. Our psema() and vsema() routines are based on the semaphores(3) interfaces defined in section 3 of the manual (man 3 semaphores). Our interfaces take a name that enables us to debug them with more claritywe can print the name of the semaphore we are examining by inserting appropriate print statements. We use one semaphore to count the number of available threads (either pooled or nonpooled), one to count the number of tasks (jobs, work items, or whatever we want to call them), and one to count the number of active threads.

The semaphore interfaces are defined as shown here:

 void initsema(PVSEMA *s, int initvalue, int hi, char *name) {   if(sem_init(&s->pv,0,initvalue) == -1) {      printf("sem_init() FAILED: sema=<%s> err=%d\ n",name,Errno);       exit(1);   }   s->name = name; } void psema(PVSEMA *s)    // decrements or blocks if s==0 {   sem_wait(&s->pv); } void vsema(PVSEMA *s)    // increments {   int rc;   rc = sem_post(&s->pv);   if(rc == -1) {      printf("sem_post(<%s>) FAILED: err=%d\ n",Errno);     exit(1);   } }

Finally, there are the queuing operations. Their job is to protect the data structures describing the queue of jobs coming in. The queue is operated with enqueue() and dequeue() operations. To support the queuing and dequeuing of data, lck and unlck implement locks on specified objects (the work queue in this case). We will show the enqueue routine and the associated dequeue routine.

You might ask why there are two counting mechanisms. The semaphore counting mechanism is specifically for counting. Because the numbers of threads (active and available) are strictly numbers, the only mechanism we need is counting. The queuing routines were written for more general queuing where objects could be queued. In this case, they require more than a simple count. For our particular demonstration here, they have been detuned to queue only integers (socket file descriptors); thus, they are similar in function to the semaphore operations.

That said, the following shows the queue initialization and the enqueuing routines:

 void initq(queue_t *q) {     initsema(&q->sema, 0, workqsize, qsema);     q->val = (int *)Malloc(workqsize*sizeof(int));     memset(q->val,'\ 0',workqsize*sizeof(int));     q->qmax = workqsize;     initlck(&q->lck, workqname);     q->head = q->tail = 0; } // // This version stops queuing when it bumps // into its own tail. // int enqueue(queue_t *q, int val) {     int ret = 1;     int h;     lck(&q->lck);     h = q->head + 1;     if(h == q->qmax)             h = 0;     if(h != q->tail) {          q->val[h] = val;         q->head = h;         q->qcnt++;     }     else         ret = 0;     unlck(&q->lck);     return ret; }

The dequeue routine is similar.

The locking (lck and unlck) routines are built from Posix thread mutexes. Mutexes are efficient thread synchronization primitives defined in the Posix thread library. They support mutual exclusion between threads, but not between processes. A Posix mutex supports three kinds of mutex:

PTHREAD_MUTEX_FAST_NP
PTHREAD_MUTEX_RECURSIVE_NP
PTHREAD_MUTEX_ERRORCHECK_NP

The FAST variant does not allow reentry into the mutex by the calling thread. Thus, if thread A locks using a pthread_mutex_lock and it attempts to lock the same mutex a second time, it blocks. Depending on the program's design, this might produce a deadlock. The RECURSIVE kind allows multiple pthread_mutex_lock() calls within a thread. However, for each pthread_mutex_lock() call, there must be a corresponding pthread_mutex_unlock() call. Finally, the ERRORCHECK kind returns an error if thread A attempts to lock a mutex more than once.

Our lock and unlock calls use the FAST kind because our design is such that if a thread calls a mutex lock more than once, it is an error in our logic. Debugging such a design is more easily accomplished if the ERRORCHECK kind is substituted during development. The following shows the lock and unlock primitives:

 void initlck(LCK *l, char *name) {     //     // Linux default is a "fast" mutex. A "fast" mutex     // locks when the same thread calls it twice.     //     pthread_mutex_init(l,NULL); } void lck(LCK *l) {     int err;     err = pthread_mutex_lock(l);     if(err != 0) {          printf("pthread_mutex_lck FAILED: err=%d\ n",err);         exit(1);     } } void unlck(LCK *l) {     pthread_mutex_unlock(l); } int islocked(LCK *l) {     return (pthread_mutex_trylock(l) != EBUSY) ; }

Of these primitives, only the mutexes from the thread library are based on the mutual exclusion instructions of the native processor. They should be significantly faster than interprocess synchronization primitives, because frequently no system call is required to execute. Contrast that with a semaphore operation that must maintain a counter that is visible to all processes. To either increment the counter or decrement it, the interface must issue a system call necessitating a transition into the operating system proper. For this overhead reason, mutexes are generally recognized as delivering high performanceor, to put it another way, to take less time to execute.

Using the timing primitives described previously, it is trivial to make a program that executes millions of calls to the interface and prints the time it takes to do it. One of the authors has done this and published the results for Red Hat 7.0. The results are reproduced here:

Interface	Linux 2.4.2 (microseconds per call)
SRV5_Semaphores	1.828
Posix Semaphores	0.487
Pthread_mutex	0.262

This measurement was done in a ThinkPad 600X (650MHz, 512MB memory). (SVR5 semaphores are a second variety of semaphores, older in design, and, obviously, as shown in the table, a bit slower.) Documentation for SVR5 semaphores can be seen with the man semop command. Documentation for the Posix semaphores can be found with man sem_init, and documentation for the Posix thread mutexes can be seen using the man pthread_mutex_init command.

FILE I/O

Our file I/O is quite simple. We want to create a file of arbitrary length, write data into it, read it back, and check the data. As we read the data from the file, it is sent over the internet to the client who requested it. We invent data on the fly. Basically, we write blocks of data into a file until it reaches the requisite size. We also write the page number or the block number into each block. Thus, each block is self-identifying to some extent. The following shows the code within the worker thread that creates, writes, reads, closes, and removes the file:

 int fd; fd = open(namebuf, O_RDWR|O_CREAT, S_IRWXU); if(fd == -1) {    printf("open <%s> FAILED: err=%d\ n",namebuf,Errno);   exit(1); } //writeFile(); pageno = 0; bytesleft = fsz; while(bytesleft > 0) {    cnt = (bytesleft < bsz) ? bytesleft : bsz;   if(cnt > 4)     memcpy(buf, &pageno, sizeof(pageno));   if(write(fd, buf, cnt) != cnt) {      printf("write %d bytes FAILED: err=%d\ n",Errno);     exit(1);   }   pageno++;   bytesleft -= cnt; } //readFile(); lseek(fd, 0L, SEEK_SET); bytesleft = fsz; rpageno = 0; while(bytesleft > 0) {    cnt = (bytesleft < bsz) ? bytesleft : bsz;   if(read(fd, buf, cnt) != cnt) {      printf("read %d bytes FAILED: err=%d\ n",Errno);     exit(1);   }   if(cnt > 4) {      if(0 != memcmp(buf, &rpageno, sizeof(rpageno))) {        printf("Read Compare ERROR: rpageno=%d",rpageno);       printf(" buf[0] = %x %x %x %x\ n",         buf[0]&0xFF,         buf[1]&0xFF,         buf[2]&0xFF,         buf[3]&0xFF);       exit(1);     }   }   rc = Send(sock, buf, cnt, 0);   if(rc != cnt) {      printf("th[%d]: SERVER: Send FAILED: rc=%d err=%d\ n",         th, rc,Errno);     exit(1);   }   rpageno++;   bytesleft -= cnt; } //closeFile(); if(close(fd) == -1) {    printf("close FAILED: err=%d\ n",Errno);   exit(1); } //deleteFile(); if(unlink(namebuf) == -1) {    printf("unlink <%s> FAILED: err=%d\ n",namebuf,Errno);   exit(1); }

As you can see from this example, the code is straightforward. After writing all the data to the file, the program uses lseek() to return to the beginning of the file, where it begins reading the file. Each block of the file is checked for correctness (trivial check, admittedly) and then is transmitted to the client program (the client program also checks the data). There is nothing complex about this code, other than the fact that the file size and the block size can be parameterized.

The Client

The client portion of the code uses some of the facilities previously discussed. Whether or not the server configures itself to use pooled threads, the client simply starts the number of threads specified on the command line and waits until all the work is completed. Each client thread continues in a loop, taking a single item of the work queue, sending it to the server, receiving all the data the server transmits back, checking the data as it arrives, and finally closing the socket. The client then proceeds to get another item of the work queue, starting the same process all over again. Each client thread continues these operations until the work queue is empty. As each client determines that the queue is empty, it exits.

This particular design methodology represents an asynchronous approach to problem solving. A thread is dedicated to each work item. If any particular work item takes a longer amount of time, the remaining threads continue emptying the queue. Using this methodology, computing fractal pictures could easily be optimized where some pixels take millions of iterations to complete and some take less than 100. The difficult pixels would occupy a thread, while many trivial pixels (fractal points with a small number of iterations) would be completed by the remaining threads.

The client code loops look like this:

 while(workqcnt > 0) {    lck(&workqcntL);     if(workqcnt == 0) {        unlck(&workqcntL);       break;     }     workqcnt--;   unlck(&workqcntL);   sock = clientconnect(&tp->refusecnt);   rpageno = 0; // Send command to server.   sprintf(tp->cbuf, "F,%d,%d", filesize,fileblksz);   rc = Send(sock, (char *)tp->cbuf, CBUFSIZE, 0);   if(rc != CBUFSIZE) {      printf("\ tCLIENT[%d]: CBUF Send FAILED: err=%d\ n",       th,Errno);     exit(1);   }   tp->sndbytes += CBUFSIZE; // set buffer size using thread safe reMalloc()   if(fileblksz > threads[th].bufmax) {      threads[th].buf = (char *)reMalloc(threads[th].buf,                                          fileblksz);     threads[th].bufmax = fileblksz;   }   threads[th].bufsiz = fileblksz;   buf = threads[th].buf; // Receive filesize bytes from Server and check contents.   bytesleft = filesize;   while(bytesleft > 0) {      cnt = (bytesleft < fileblksz) ? bytesleft : fileblksz;     rc = Recv(sock, buf, cnt, 0);     if(rc == SOCKERR) {        printf("CLIENT: th[%d]: Recv failed: rc=%d err=%d\ n",           th, rc,Errno);       exit(1);     }     else if(rc == 0)       break;     if(rc > 4) {        if(0 != memcmp(&rpageno, buf, 4)) {          printf("CLIENT: compare error on pageno %d",rpageno);         printf(" buf[0] = %x %x %x %x\ n",           buf[0]&0xFF,           buf[1]&0xFF,           buf[2]&0xFF,           buf[3]&0xFF);         exit(1);       }     }     tp->rcvbytes += rc;     rpageno++;     bytesleft -= rc;   } // Wait for "DONE" from Server   rc = Recv(sock, (char *)tp->cbuf, 4, 0);   if(rc != 4 || !equaln(tp->cbuf, "DONE", 4)) {      printf("\ tCLIENT[%d]: Recv 'DONE' FAILED: rc=%d err=%d\ n",       th,rc,Errno);     fflush(stdout);     exit(1);   }   tp->rcvbytes += 4; // Send 'Q'   tp->cbuf[0] = 'Q';   rc = Send(sock, (char *)tp->cbuf, 1, 0);   if(rc != 1) {      printf("\ tCLIENT[%d]: Send 'Q' FAILED: rc=%d err=%d\ n",            th,rc,Errno);     fflush(stdout);     exit(1);   }   tp->sndbytes += 1;   rc = CLOSESOCK(sock);   if(rc != 0) {      printf("\ tCLIENT[%d]: close socket %d FAILED: Errno=%d\ n",            th,sock,Errno);     exit(1);   } }

Code Discussion

A number of different synchronization mechanisms have been used here. Counting semaphores are used to simply count available resources. The counter blocks when none is available. Both the thread scheduler and cleanup thread use a semaphore to know when something needs to be done. An earlier cleanup design simply looked for active threads every 2 seconds. Although the performance difference is probably negligible, the resulting design using semaphores leaves the system completely idle when there is nothing to do. (psema, vsema, and initsema are based on Posix semaphores.)

Locking primitives are used to count the number of work items in the client. They are based on a memory variable and a critical section lock. This design was devolved from the queuing primitives described in the next paragraph. The desire to queue millions of items suggested that each should take no memory. Therefore, this interface was derived to support decrementing a counter as the mechanism for dequeuing an object. (initlck, lck, and unlck are based on Posix thread mutexes.)

Finally, the third version of synchronization used is to queue objects. A Posix pthread mutex is used to guard an actual memory queue, each element of which can contain a single integer. The integer in our case is a socket file descriptor received from the listen_and_connect() routine in the server's main thread. (initq, enqueue, and dequeue are based on the previously described locking primitives, which in turn are based on Posix pthread mutexes.)

Our design is asynchronous. The server consumes no CPU cycles if there is nothing to do. The client either has something to do or it exits. The client's responsibility is to pass all the work to be done to the server, wait until the server completes all the work, and finally print timing and performance results.

Compilation Options

After we have written our program, we want to compile it and run it. Our program is called srv3.cpp. To compile it, the following command line is used:

 g++ -O2 Wall srv3.cpp lpthread o server && cp server c

The g++ command is used to assure we are compiling using the strong typing of C++. This particular program uses almost no C++ features, but the strong typing of C++ is used. The command line above the Wall option instructs the compiler to print all warnings. Demanding the strictest possible conformance to excellent programming standards is guaranteed to produce code that requires less debugging and less support. One of the authors has seen software projects remove unknown bugs from programs simply by changing the compilation option to more emit warnings and changing the code to remove the warnings. If a program of any size compiles and executes properly and has never endured the removal of all warnings, we challenge you to go through the effort once. If after the effort you aren't convinced that bugs were removed, we would be quite surprised.

The Wall command line is intended to produce two executables. The first is a program called server, and the second is a program called c. The first is our server program, and the other is our client program. (We didn't name it client for fear of colliding with an existing program that might be called client.)

A useful option to the GNU C compiler is the v option, which causes g++ to print all the intermediate steps it takes to produce the executables. When using the verbose option to g++, g++ does not instruct the linker to also produce verbose output. To do that, the following addition is required:

 -Xlinker verbose

Thus, the following produces the most output (into a file called xx). From it you can see what the compiler and linker are doing:

 g++ -O2 Wall v Xlinker verbose srv3.cpp lpthread o server 2>xx

Libraries

We compiled our program using the Posix thread dynamic link library by specifying -lpthread on the command line. We could have used statically linked libraries. As installed, we could not compile our program using static libraries. That said, why would we want to?

Static versus dynamic libraries is a question whose answer is surprising. Dynamic link libraries have the following benefits:

GPL independence. If your programs are linked to dynamic link libraries, they are not encumbered with GPL (GNU Public Library) provisions. The GPL license requires all who embed GPL code in a program to make available the source code. The current understanding is that statically linking a program is a form of embedding GPL code in a program. Therefore, development teams that produce programs statically linked with GPL could also be required to publish the source code of the entire program. This reason alone is generally enough to eliminate the thought of statically linking programs.
Dynamic linking allows bugs to be fixed independent of the program. If a bug shows up in a dynamic link library, simply shipping a new library can fix your program. Due to the heightened awareness of security issues, updated dynamic link libraries are quite likely to happen.
Dynamic linking produces smaller programs. Programs dynamically linked contain only stubs of APIs needed to execute the program. Clearly, stubs are much smaller than the actual code to implement an API or even an entire suite of functions.
An increase in portability. The emerging Linux Standards Base specification for Linux operating systems requires applications to be dynamically linked to ensure proper use of local system services. A static version of a system library may no longer work properly on future revisions of the OS.

The reasons seem compelling, but the other side of the picture leaves the issue open to design. Here are some counterpoints:

GPL independence can also be achieved by simply buying the appropriate libraries to use with your product. That becomes difficult with system libraries and may not be possible. The Posix threading library is a case in point where obtaining a GPL free version might be exceedingly difficult.
Static libraries mean that when a bug is fixed, your program is unaffected. When your program uses a single API in a library containing possibly thousands of APIs, the likelihood of the library's being updated is high. Each update to the library is a risk to your program over which you have no control. Such risks are worth investigating.
Because the size of disk drives has increased much faster than program size, on-disk footprint is less of an issue.

The general recommendation is to use dynamic link libraries where possible, using static linking for libraries that may not be found on the destination platform and that cannot be distributed with your application.