Parallel Computing on Heterogeneous Networks, by Alexey Lastovetsky
ISBN 0-471-22982-2 Copyright 2003 by John Wiley & Sons, Inc.
CORBA is widely used to develop and integrate highly complex distributed technical applications in industries as diverse as health care, telecommunications, banking, and manufacturing. CORBA is supported on almost every combination of hardware and operating system in existence, and is available from a large number of vendors, CORBA supports a large number of programming languages; CORBA-based distributed applications fall within the client/server programming paradigm. A typical CORBA-based application server provides a number of operations that can be invoked by remote clients.
A particularly important quality of the service, provided by the remote server, is the execution time of the remote operations. The total time of execution of a remote operation includes the time of communication between the client and server and the time of computation on the server side. In computationally intensive operations the maximal effect comes from an acceleration of the computations on the server side of the hardware, which is typically a heterogeneous network of diverse computers.
There are two ways that a speedup of execution is usually achieved in remote operations on a network of computers. The first is to balance the workload of available computers. This means that the CORBA implementation starts up the server providing the requested operation on the computer that is the fastest at the moment of the receipt of the request. The second is a multithreaded implementation of the remote operation and its execution on a shared memory multiprocessor computer if the latter is available.
Yet another way for efficient implementation of a remote operation is its parallel execution on the network of computers. Theoretically, this could provide speedup independent of the availability of a multiprocessor computer. In practice, this option is not used for two good reasons. The first is the lack of experience-based integration of distributed memory parallel computing into CORBA-based distributed applications. The second, and more important, is that the technologies and tools for parallel computing on heterogeneous networks are only in their early stages, and so are not as mature and widespread as those for homogeneous multiprocessors.
In this section we show that computationally intensive remote operations in CORBA-based distributed applications can be easily and greatly accelerated with a help of tools for parallel computing on heterogeneous clusters. IONA’s Orbix is used as a particular implementation of CORBA, and the mpC language is used for parallel implementation of such remote operations.
Consider a distributed application dealing with a chain of supermarkets. Let us assume that each cash register in the supermarkets sends information about all baskets of items purchased by customers to a central information center where the information is accumulated and stored.
A single basket of items is stored in the form of a file record, so that each individual file contains data about a fixed number of baskets. All of the files when compiled together show the full information about the contents of customer baskets during a certain period of time. That data dump is used for extraction of diverse useful information, and any supermarket can contact the central information center for one or another (piece of) information.
That service is implemented by a CORBA-based application server providing a set of corresponding remote operations. In particular, a supermarket can inquire of the central information center about the optimal distribution of items over a given number of sections. A fragment of the CORBA IDL specification of the full service relevant to this particular request is as follows:
typedef short Item; typedef sequence<Item> Basket; typedef sequence<Item> Section; typedef sequence<Section> Distribution; interface central_office { void BasketOfItems(in Basket b); Distribution getDistribution(in short number_of_sections); void Hello(); void Bye(); ... };
Say a client invokes operation BasketOfItems to add a basket to the server’s data store. Each physical item in the basket is represented by its numerical code, so that the basket is represented by a sequence of codes of purchased items.
To get an optimal distribution of items over a given number of sections, the client invokes the operation getDistribution, passing the number of sections as an input parameter. Each individual section is represented by a sequence of items belonging to the section so that each item appears in one and only one section. The operation returns a sequence of sequences of items representing the requested distribution.
To initialize a session with the remote server, the client invokes operation Hello; to finalize the session, it invokes operation Bye.
Operation getDistribution is computationally intensive and implements the following algorithm:
First it reads all files one by one and computes a vector, S, and a matrix, P, representing the mappings S : I N and P : I x I N respectively, where
I is a set of all stock items,
N is a set of positive integers,
S(i) is the total number of baskets containing item i, and
P(i,j) is the total number of baskets containing both item i and item j.
Then it uses the mappings to divide set I into m nonintersecting subsets I_{0},..., I_{m1}, where m is the total number of sections and each item of m is that most frequently bought, i_{0},..., i_{m1, }and so will head a separate section. Distribution of the remaining items over the sections can be described by the following pseudocode:
delete i_{0}, ... ,i_{m-1} from I for(k=0; I is not empty; k=(k+1)%m) { find i so that P(i,i_{k})==max{P(I,i_{k})} add i to I_{k} delete i from I }
Intuitively this algorithm tries to include in each section at least one very popular item surrounded by the items that most often accompany this popular choice in customers’ baskets. It is assumed that such a distribution will stimulate customers to buy items of secondary necessity and reduce the total shopping duration. As the data store is very large and input–output operations are relatively slow, the execution time of the algorithm is practically equal to the time of computation of the mappings S and P.
Note that a more accurate distribution can be obtained by considering not only the popularity of single items and their pairs but also that of their triplets, fours, and so on. But the consideration leads to significantly more computations and hence much slower algorithms. Thus, although the presented algorithm is computationally intensive, it performs the minimal volume of computation that is necessary to solve the problem.
Both the client and server parts of the described distributed application were originally implemented in Orbix 3 C++ programming system.
The traditional, purely serial implementation of the application server will cause the operation getDistribution to become very slow from the client’s point of view. Yet the application server normally runs on a network of computers whose total performance is quite high. Therefore a parallel implementation of operation getDistribution, which enables it to use effectively all of its available performance potential, could accelerate the operation.
Thus the problem of computation of the mappings S and P can be parallelized by partitioning the entire set of baskets of items into nonintersecting subsets. Each subset is processed independently. The resulting mappings are obtained by a simple combination of the mappings computed for each individual subset.
Let us assume that the data store consists of a big number of files, each containing the same number of basket records. Let m be the total number of stock items. Then the algorithm of parallel computing the mappings S and P on a heterogeneous cluster of p processors can be summarized as follows:
The entire set of n files is partitioned into p nonintersecting subsets. There is one-to-one mapping between these subsets and the processors. The number of files in the ith subset is proportional to the relative speed of the ith processor.
The ith processor computes an m-element vector of the popularity of single items, S_{i}, and an m x m matrix of the popularity of pairs of items, P_{i}, by processing its subset of the entire set of files.
Vectors S_{i} and matrices P_{i} are gathered to the host-processor.
The host-processor computes
the resulting vector of the popularity of single items, S, as a sum of vectors S_{i},
the resulting matrix of the popularity of pairs of items, P, as a sum of vectors P_{i},
The parallel algorithm was easily implemented in mpC. As the mpC language is a strict extension of ANSI C, the corresponding parallel mpC code is obtained by a very slim modification of the original serial C code used in the Orbix C++ implementation of the application server.
The key fragments of the parallel mpC code appear as follows:
nettype ParallelDataMining(int p, int f[p]) { coord I=p; node {I>=0: f[I];}; }; ... repl p, num_of_files, *files; repl double *speeds; ... recon TestCode(); MPC_Get_processors_info(&p, speeds); Partition(p, speeds, files, num_of_files); { net ParallelDataMining(p, files) pdm; ... }
In this code the network type definition describes a performance model of the implemented parallel algorithm. It introduces the following components:
The name ParallelDataMining of the network type.
A list of parameters, including the integer scalar parameter n and the vector parameter f of n integers.
A coordinate variable, I, ranging from 0 to n-1.
Finally, it associates abstract processors with this coordinate system and declares relative volumes of computations to be performed by each of the processors. It is assumed that ith element of vector f is equal to the number of files processed by ith abstract processor.
The execution of the recon statement is that all physical processors running the program execute in parallel some relevant test code, and the time elapsed by each of the real processors is used to refresh the estimation of its performance. The library function MPC_Get_processors_info returns the number of available physical processors (in the variable p) and their relative speeds (in the array speeds). Based on the number and relative performances of the actual processors, function Partition computes how many files of the data store each actual processor will handle. After this call, files [i] will be equal to the number of files processed by ith actual processor.
The next key line of the code defines the abstract network pdm of the type ParallelDataMining with actual parameters p—the actual number of physical processors, and files—an integer array of p elements containing actual numbers of files to be processed by the processors. The remaining computations and communications will be performed on this abstract network. The mpC programming system maps the abstract processors of pdm to the real parallel processes of the program. This mapping is based, on the one hand, on the current estimation of the speed of physical processors and, on the other hand, on the performance model of the parallel algorithm. The programming system does the mapping at runtime and tries to minimize the execution time of the parallel algorithm. In order to guarantee that all physical processors will be involved in computations, we assume that the mpC program has been started so that each physical processor runs precisely one process of the program.
The remaining modifications of the original Orbix implementation of the application server are minor and rather technical. They are aimed at smooth integration of the mpC parallel environment into the Orbix distributed environment.
Code implementing operations Hello and Bye is modified to initialize and finalize the mpC programming environment respectively. Code implementing operation getDistribution is additionally modified to enable the passing of input data (the number of sections) from the Orbix framework of the application server to the mpC inserted component, and output data (the recommended distribution) from the mpC component back to the Orbix layer. The input data are passed to the mpC program as an external argument, and a temporary file is used for passing the result computed by the mpC program to the main Orbix body of the application server.
In general, the modifications integrating the mpC parallel application into the Orbix distributed application are fairly obvious and easy to make. Although a deeper integration of the two technologies, say, on the language level, may be possible, this does not seem to be a reasonable course of action. A deeper integration might prove to be more sophisticated, but it would not provide any visible improvement in the quality of services compared to the light integration scheme we used above.
Now we look at some results of experiments with supermarket chain applications.
A small network of workstations was used in the experiments. The client ran on an IBM RS6000 workstation, the serial application server ran on a 4-processor Sun E450 workstation, and the parallel application server ran on a network of two 4-processor Sun E450 workstations and one 6-processor HP 9000/K570 workstation. Table 10.1 shows relative speeds of these computers obtained by executing the serial test code specified by the recon statement of the mpC program.
Workstation Number | Model | Number of Processors | Relative Speed |
---|---|---|---|
1 | Sun E450 | 4 | 2658 |
2 | Sun E450 | 4 | 3280 |
3 | HP 9000/K570 | 6 | 6067 |
The data store consisted of 60 files each containing 8000 basket records. Up to 100 different stock items could appear in a single basket. The client code invoked the remote operation getDistribution to get an optimal distribution of the 100 items over five sections and measured the execution time of the operation. This time obtained for different configurations of the application server are presented in Table 10.2.
Workstations in Execution of Remote Operation | Mode of the Remote Operation | Execution Time (s) |
---|---|---|
1 | Serial | 332 |
1 | Parallel | 168 |
1, 2 | Parallel | 96 |
1, 2, 3 | Parallel | 42 |
There were four configurations that only differed in the way of execution of the operation getDistribution:
The original serial version of this operation was executed on a Sun workstation.
The mpC parallel version of the operation was executed on the same 4-processor Sun workstation.
The mpC version was executed on the cluster of the two 4-processor Sun workstations.
The mpC version was executed on the cluster of the two 4-processor Sun workstations and one 6-processor HP workstation.
As is obvious, the parallel configurations of the application server provided better performance.