ADVANTAGES OF STRUCTURE PARALLELISM

data mining: opportunities and challenges
Chapter V - Parallel and Distributed Data Mining through Parallel Skeletons and Distributed Objects
Data Mining: Opportunities and Challenges
by John Wang (ed) 
Idea Group Publishing 2003
Brought to you by Team-Fly

Table 1 reports some software cost measures from our experiments, which we review to underline the qualities of the structured approach: fast code development, code portability, and performance portability.

Table 1: Software development costs for Apriori, DBSCAN and C4.5: Number of lines and kind of code, development times, best speedup on different target machines
 

APRIORI

DBSCAN

 

Sequential code

2900 lines, C++

10138 lines, C++

 

Kind of parallelization

SkIE

SkIE

 

Modularization, l. of code

630, C++

493, C++

 

Parallel structure, l. of code

350, SkIE-CL, C++

793, SkIE-CL, C++

 

Effort (man-months)

3

2,5

 

Best      CS2
speed-up      COW
and              
(parallelism)              

20      (40)
9.4      (10)
3.73      (4)

-
6      (9)
-

 

 

C4.5


Sequential code

8179 lines, non-ANSI C, uses global variables


Kind of parallelization

SkIE

SkIE + Shared Tree

MPI

Modularization, l. of code

977, C, C++

977, C, C++

1087, C, C++

Parallel structure, l. of code

303, SkIE-CL

380, SkIE-CL, C++

431, MPI, C++

Effort (man-months)

4

5

5

Best speed-up CS2

2.5 (7)

5 (14)

-

and (parallelism) COW

2.45 (10)

-

2.77 (9)

Development Costs and Code Expressiveness

When restructuring the existing sequential code to parallel, most of the work is devoted to making the code modular. The amount of sequential code needed to develop the building blocks for structured parallel applications is reported in Table 1 as modularization, separate from the true parallel code. Once modularization has been accomplished, several prototypes for different parallel structures are usually developed and evaluated. The skeleton description of a parallel structure is shorter, quicker to write and far more readable than its equivalent written in MPI. As a test, starting from the same sequential modules, we developed an MPI version of C4.5. Though it exploits simpler solutions (Master-Slave, no pipelined communications) than the skeleton program, the MPI code is longer, more complex and error-prone than the structured version. On the contrary, the speed-up results showed no significant gain from the additional programming effort.

Performance

The speed-up and scale-up results of the applications we have shown are not all breakthrough, but comparable to those of similar solutions performed with unstructured parallel programming (e.g., MPI). The Partitioned Apriori is fully scalable with respect to database size, like count-distribution implementations. The C4.5 prototype behaves better than other pure task-parallel implementations. It suffers the limits of this parallelization scheme, due to the support of external objects being incomplete. We know of no other results about spatial clustering using our approach to the parallelization of cluster expansion.

Code and Performance Portability

Skeleton code is by definition portable over all the architectures that support the programming environment. Since the SkIE two-level parallel compiler uses standard compilation tools to build the final application, the intermediate code and the run-time support of the language can exploit all the advantages of parallel communication libraries. We can enhance the parallel support by using architecture-specific facilities when the performance gain is valuable, but as long as the intermediate code complies with industry standards the applications are portable to a broad set of architectures. The SMP and T3E tests of the ARM prototype were performed this way, with no extra development time, by compiling on the target machine the MPI and C++ code produced by SkIE. These results also show a good degree of performance portability.

Brought to you by Team-Fly


Data Mining(c) Opportunities and Challenges
Data Mining: Opportunities and Challenges
ISBN: 1591400511
EAN: 2147483647
Year: 2003
Pages: 194
Authors: John Wang

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net