7.3 Parameter Studies

One straightforward application of task parallelism is the "parameter study", in which the same sequential program is run multiple times with different sets of input parameters. Since the program to be run is sequential, there is no communication among the workers, and the manager can be a simple script that communicates with the workers by means of the arguments to the sequential program and its standard output. We can start the workers with ssh and collect the output by using the popen system call, which returns a file descriptor we can select on and read the remote process's stdout from.

Although both the algorithm we use and its implementation are general, we present here a concrete example. We explore the parameter space of compiler options for the default Linux C compiler gcc. The man page for gcc conveniently lists in one place all the options that can be passed to gcc to cause it to produce faster machine code. Here is an excerpt from the man page:

     Optimization Options         -fcaller-saves -fcse-follow-jumps -fcse-skip-blocks         -fdelayed-branch -felide-constructors         -fexpensive-optimizations -ffast-math -ffloat-store         -fforce-addr -fforce-mem -finline-functions         -fkeep-inline-functions -fmemoize-lookups         -fno-default-inline -fno-defer-pop         -fno-function-cse -fno-inline -fno-peephole         -fomit-frame-pointer -frerun-cse-after-loop         -fschedule-insns -fschedule-insns2         -fstrength-reduce -fthread-jumps -funroll-all-loops         -funroll-loops -0 -02 -03

For the matrix-matrix multiply program we are going to test with, which has no function calls, only some of these look useful. Here is a subset of the above list containing optimization flags that might have an effect on the speed of the program:

     -fexpensive-optimizations     -ffast-math     -ffloat-store     -fno-peephole     -fschedule-insns     -fschedule-insns2     -fstrength-reduce     -funroll-all-loops     -funroll-loops     -0     -02     -03

Since there are twelve switches that can be either present or absent, there are 2¹² possible combinations. These are not completely independent, since some switch settings imply others, especially the three -0 flags, but we will ignore thus fact for the sake of simplifying our example, and just try all 4096 combinations, Indeed, which switch settings are redundant in the presence of others should be deducible from our results!

Our plan will be to take a simple test program and compile it with all possible switch combinations and run it, reporting back the times. Since we have 4096 jobs to run, the use of a cluster will make a big difference, even if the individual tasks are short.

For our test program, we will use a straightforward matrix-matrix multiply program, shown in Figure 7.10. It multiples two 300 300 matrices, timing the calculation, this may not be the highest performing way to do this, but it will do for our purposes. The program echoes its command line arguments, which it does not otherwise use; we will use them to help us record the arguments used to compile the program.

 #include <stdio.h> #include <sys/time.h> #include <unistd.h> #define SIZE 300 main(int argc, char *argv[]) {     double a[SIZE][SIZE], b[SIZE][SIZE], c[SIZE][SIZE];     int i, j, k;     struct timeval tv;     double starttime, endtime;     for (i = 0; i < SIZE; i++)         for (j = 0; j < SIZE; j++)             a[i][j] = (double) (i + j);     for (i = 0; i < SIZE; i++)         for (j = 0; j < SIZE; j++)             b[i][j] = (double) (i + j);     for (i = 0; i < SIZE; i++)         for (j = 0; j < SIZE; j++)             c[i][j] = 0.0;     gettimeofday( &tv, ( struct timezone * ) 0 );     starttime = tv.tv_sec + (tv.tv_usec / 1000000.0);     for (i = 0; i < SIZE; i++) {         for (j = 0; j < SIZE; j++) {             for (k = 0; k < SIZE; k++) {                 c[i][j] = c[i][j] + a[i][k] * b [k][j];             }         }     }     gettimeofday( &tv, ( struct timezone * ) 0 );     endtime = tv.tv_sec + (tv.tv_usec / 1000000.0);     printf("%f seconds for", endtime - starttime);     for (i = 1; i < argc; i++)         printf(" %s", argv[i]);     printf("\n"); }

Figure 7.10: Matrix-matrix multiply program

Our worker programs will just be the sequence

     gcc <flags> -o matmult matmult.c     matmult

and the manager will start them with ssh, on hosts whose names are in a file. The other argument to our manager is a file of possible arguments. It contains exactly the twelve lines listed above. The manager just steps through the numbers from 0 up to the total number of runs (in our case 4096) treating each number as a binary number where a 1 bit represent the presence of the compiler switch corresponding to that position in the binary number. Thus we will run through all possible combinations.

The overall plan is to loop through the parameter space represented by the binary numbers represented by the binary numbers from 0 to 2¹². If there is a free host (no worker is working there) we assign it the next task; if not we select on the sockets that are open to currently working workers. When one of them reports back, we add it back to the list of free hosts. At the end, after all the work has been assigned, we still have to wait for the last tasks to complete.

Let us step through the code in Figure 7.11 in detail. First we read in the list of hosts (initial value of the list freeHosts) and the list of possible arguments (parmList). We initialize the set of sockets to select on to empty since there are no workers yet, and create an empty Python dictionary (fd2host) where we will keep track of busy hosts and the connections to them. We set numParmSets to the number of subtasks, which we can calculate from the number of possible compiler flags in the input file. Then we enter the main loop, which runs until we have assigned all the work and there are no outstanding workers working. If there is a subtask still to do and a free host to do it on, we construct the parameter list corresponding to the next task (in ParmSet), and pick the first host from the list of free hosts, temporarily removing it from the list. We then build a string containing the specification of the subtask. The Popen3 command forks a process that runs the ssh program locally, which runs the gcc-matmult sequence remotely. The ssh's, and therefore the remote processes, run in parallel.

 #!/usr/bin/python from sys    import argv from popen2 import Popen3 from select import select, error hostFile = open(argv[1]) parmsFile = open(argv[2]) freeHosts = [ line.strip() for line in hostFile.readlines() ] parmList = [ line.strip() for line in parmsFile.readlines() ] lenParmList = len(parmList) socketsToSelect = [] fd2host = {} numParmSets = 2 ** lenParmList pcnt = 0 while pcnt < numParmSets  or  socketsToSelect:     if pcnt < numParmSets  and  freeHosts:         parmSet = []         for i in range(0,lenParmList):             bit = 1 << i             if bit & pcnt:                 parmSet.append(parmList[lenParmList-i-1])         host = freeHosts[0]         freeHosts.remove(host)         cmd = ("ssh -l lusk -x -n %s 'gcc %s -o matmult matmult.c; " +                "matmult %s'") % (host,' '.join(parmSet),' '.join(parmSet))         runner = Popen3(cmd)         runfd = runner.fromchild         socketsToSelect.append(runfd)         fd2host[runfd] = host         pcnt += 1     else:         readyFDs = 0         (readyFDs,None,None) = select(socketsToSelect,[],[],30)         for fd in readyFDs:             line = fd.readline()             if line:                 print '%s on %s' % (line.strip(),fd2host[fd])             else:                 freeHosts.append(fd2host[fd])                 socketsToSelect.remove(fd)                 fd.close()

Figure 7.11: Manager for parameter study

We set runfd to the stdout of the ssh, which collects the stdout from the matmult. Each line of stdout will contain the time followed by the compiler flags used. Then we add this fd to the list of sockets available for selecting on and enter into the list fd2host the host attached to that fd.

If there are no free hosts or we have already assigned all the subtasks, then we select on the sockets connected to busy workers. When one those sockets becomes active, it means that the associated worker has set us a line of output. We read it and print it, or if the read fails (the worker exited, which sends an EOF on the socket), we close that socket, take it out of the list of sockets to select on, and add the corresponding host to the list of free hosts, since it can mow be assigned another subtask.

The manager exits once all the subtasks have been done and all the workers have completed. If we run this with

     parmstudy.py hostfile parmfile | sort -n

then we will get the best combinations at the top of the list. Try it!