STRUCTURED PARALLEL DATA MINING ALGORITHMS | Data Mining: Opportunities and Challenges

data mining: opportunities and challenges

Chapter V - Parallel and Distributed Data Mining through Parallel Skeletons and Distributed Objects
Data Mining: Opportunities and Challenges
by John Wang (ed)
Idea Group Publishing 2003


	Brought to you by Team-Fly

This second part of the chapter is devoted to the analysis of three parallel DM applications developed using the SkIE environment, also including a distributed-tree external object library in the case of C4.5. The descriptions of the problems are self-contained, but we also suggest looking for more results and literature references in the papers of Carletti and Coppola (2002) and Coppola and Vanneschi (2002). We will use the symbol to denote the input database. is viewed as a square table of N records, which are the rows, and where the number of fields in a record is a. By horizontal representation (or partitioning), we mean that the input is represented (or partitioned) using a row-wise database layout, keeping each record as a whole unit. In vertical representation, the input is instead represented in terms of columns of data, each containing the values of corresponding fields from different input records. An example of the two data representations is given in the next section about association rules.

Association Rule Mining: Partitioned Apriori

The problem of association rule mining (ARM) was proposed back in 1993, and its classical application is market basket analysis. From a sell database, we want to detect rules of the form , meaning that a customer who buys both objects A and B also buys C with some minimum probability. We refer the reader to the complete description of the problem given in Agrawal, Mannila, Ramakrishnan, Toivonen, and Verkamo (1996), while we concentrate on the computationally hard subproblem of finding frequent sets. In the ARM terminology the database is made up of transactions (the rows), each one consisting of a unique identifier and a number of boolean attributes from a set I. The attributes are called items, and a k-itemset contained in a transaction r is a set of k items that are true in r. The support σ(X) of an itemset X is the proportion of transactions that contain X. Given , the set of items I, and a fixed real number 0 < s < 1, called minimum support, the solution of the frequent set problem is the collection of all itemsets that have at least that minimum support. The support information of the frequent sets can be used to infer all the valid association rules in the input. The power set (I ) of the set of items has a lattice structure, which is naturally defined by the set inclusion relation. In Figure 2, we see a simple representation of a lattice of all subsets of {A,B,C,D}, with the itemset ABD and its subsets evidenced. A level in this lattice is a set of all itemsets with equal number of elements. The minimum support property is anti-monotonic in this lattice, i.e., it is preserved over decreasing chains:

click to expand
Figure 2: Different representations for boolean transaction data. The lattice of itemsets, with the subsets of frequent set ABD put in evidence.

To put the frequent set problem in the right perspective, we must remember that the support is taken to be the probability of itemsets. Two interestingness measures for association rules (in the form , where A and B are generic itemsets) are defined from itemset support, the confidence of a rule, which is the conditional probability of B given A, and the support of a rule. It is easy to use the information about frequent sets to compute all association rules that satisfy minimum significance requirements.

Sequential Apriori

Computing the support count for a single itemset requires a linear scan of . The database is often in the order of gigabytes, and the number of potentially frequent itemsets is 2^|I|, which easily exceeds the available memory. To efficiently compute the frequent sets, their structure and properties have to be exploited. We classify algorithms for ARM according to their lattice exploration strategy. Sequential and parallel solutions differ in the way they arrange the exploration, in the fraction of the itemset lattice that they actually have to explore, and in how they distribute the data structures to minimize computation, I/O, and memory requirements. In the following we will essentially restrict the attention to the Apriori algorithm and its direct evolutions.

Apriori (see Figure 3) builds the lattice level-wise and bottom-up, starting from the 1-itemsets and using the fact that non-frequent itemsets cannot have frequent supersets as a pruning heuristic. From each level L_k of frequent itemsets, a set of candidates C_k+1 is derived. For the sake of conciseness we give a declarative definition of the generate and computesupport procedures in Figure 3, but clearly the actual algorithm implementation is critical for performance. The candidates are all the itemsets that satisfy the anti-monotonic property, used as a filtering heuristic. The support for all the current candidates is verified in a single scan of the input, extracting the next level of frequent itemsets L_k+1 from the set of candidates. The Apriori algorithm is a breakthrough with respect to a naive approach (the amount of work done for support computation is greatly reduced), but some issues arise when applying it to huge databases. A linear scan of is required for each level of the solution. The computation can thus be slow, depending on the maximum length of frequent itemsets in the input data. Apriori is based on the assumption that itemsets in C_k are much fewer than all the possible k-itemsets, but this is often false for k=2,3, because the pruning heuristic is not yet fully effective. If C_k is large or doesn't even fit in memory, computing the support values for the candidates becomes quite hard.

click to expand
Figure 3: Apriori pseudo-code for frequent itemsets.

Related Work

Some design choices mainly distinguish sequential and parallel algorithms for ARM. In the following, we mention the distinguishing features of the different algorithms.

The way data and candidates are partitioned and distributed. The data and candidate distribution schemes obviously interact with and constrain parallel work decomposition.
The lattice exploration approach. Most Apriori evolutions work level-wise, but the lattice structure of the frequent itemsets can be also exploited using random or deterministic sampling and graph properties to quickly gain knowledge of the solution structure.
The horizontal or vertical approach to database representation. The transaction records may be the basic elements, or we may use TID lists (lists of all identifiers of transactions to which each item belongs). Different implementation solutions depend on this choice; see for instance Savasere, Omiecinski and Navathe (1995), and Zaki (2000).

In the following, we will analyze the main parallel techniques. A more comprehensive report about them can be found in Joshi, Han, Karypis and Kumar (2000).

Following Agrawal and Shafer (1996), we can classify the parallel implementations of Apriori into three main classes, Count, Data, and Candidate Distribution, according to the interplay of the partitioning schemes for the input and the C_k sets. Count Distribution solutions horizontally partition the input among the processors, distributing the database scan for support computation. In order to compute global candidate support counts for each new level of the solution, all the processors have to synchronize and exchange support information with each other. Data Distribution solutions keep the C_k set distributed, allowing the management of huge amounts of candidates. The downside of this approach is the need to send local data to most of the processors, if not to fully replicate the database, to let each processor compute the correct support of its own candidates. The Candidate Distribution strategy tries to coordinately partition the candidate sets and the input in order to minimize data communication and balance the workload. The approximate information needed to choose a good distribution is gathered in the first steps of the algorithm, or by analyzing a sample of the data.

The three previous solutions share the level-by-level approach of Apriori. This usually involves keeping in memory a set of candidate itemsets with their support counts, using data structures like the hash tree to encode part of the itemset lattice. Each node of the hash tree is labeled with an item, and the path to a node at level k corresponds to a k-itemset. Frequent itemsets are kept in the tree, exploiting the lexicographic order of items and using hash tables at the nodes and additional pointers to reach an average access time proportional to the depth of the tree.

When switching to a vertical data representation, the Apriori internal operations can be rewritten into TID-lists intersections (see the example in Figure 2). The TID-list for an itemset is the intersection of the lists of its items, and the support count is the length of the list. List intersections can be computed efficiently, and also offer the advantage of easily changing the lattice exploration strategy from breadth first to a more general one. Information on support counts of a few long itemsets can be exploited, in the first part of the algorithm, to quickly compute an approximation of the frontier of the set of frequent sets. Zaki (2000) applied similar techniques to the partitioning of 2-itemset TID lists in order to enhance the locality of data access in parallel ARM computation.

Parallel Partitioned Apriori

We studied the partitioned algorithm for ARM introduced in Savasere et al. (1995), which is a two-phase algorithm. The data are horizontally partitioned into blocks that fit inside the available memory, and frequent sets are identified separately in each block, with the same relative value of s. It is easy to show that a frequent itemset in must be locally frequent in at least one of the partitions (the converse is not true, not all the itemsets that are locally frequent are also frequent in the original dataset ).

The algorithm, shown in Figure 4, is divided in two phases, the first one being to solve the frequent set problem working in memory, separately for each block. Here parallelism is exploited from all the independent computations. The union of the frequent sets for all the blocks is a superset of the globally frequent sets. In a general case, the information gathered so far is partial. It contains a certain amount of false positives (itemsets that are only locally frequent in some of the blocks) and contains incomplete support information for those frequent itemsets that are not locally frequent in every partition (they are local false negatives in some blocks).

click to expand
Figure 4: Pseudo-code of partitioned ARM with mapping to parallel modules.

The second phase is a linear scan of to compute the correct support counts for all the elements in the approximate solution, and to discard false positives. Here parallelism on different blocks can be exploited too, if we sum support information once at the end of Phase II. As in the work of Savasere et al., we obtain the frequent sets with only two I/O scans. Phase II is efficient, and so the whole algorithm, if the approximation built in Phase I is not too coarse, i.e., the amount of false positives is not overwhelming. The degenerate behavior happens if the data distribution is too skewed, which we can often avoid by a preliminary step of random data permutation. On the other hand, data skew cannot generally be avoided if the input partitions are too small with respect to .

We have applied the two-phase partitioned scheme without the vertical representation described in the original work. This sacrificed some advantages of the vertical representation, but allowed us to reuse an existing sequential implementation of Apriori as the core of the first phase. This parallel-partitioned scheme is more asynchronous and efficient than parallel Count Distribution, because it avoids both I/O and global communication before each new level of the solution. Nevertheless, our implementation asymptotically behaves like Count Distribution with respect to the parameters of the algorithm. It is quite scalable with the size of , but cannot deal with huge candidate or frequent sets, i.e., it is not scalable with lower and lower values of the s support parameter. Its essential limits are that both the intermediate solution and a block of data have to fit in memory, and that too small a block size causes data skew. The clear advantages are that almost all work is done independently on each partition, with two I/O passes, and that the scheme can also exploit parallel I/O. We will now show that the algorithm structure can be easily mapped to a skeleton composition, producing an efficient implementation.

Skeleton Implementation

The structure of the Partitioned algorithm is clearly reflected in the skeleton composition we have used, which is shown in Figure 5 and Figure 6. The two phases are connected within a pipe skeleton. Since there is no parallel activity between them, they are in fact mapped on the same set of processors. Phase II, if is huge, can be easily and usefully parallelized over separate partitions, too. The common internal scheme of the two phases is a three-stage pipeline. The first module within the inner pipe reads the input and controls the computation, generating a stream of tasks that are actually partitions of the input data. The second module is a farm containing p seq modules running the Apriori code. The internal results of Phase I are hash-tree structures containing all the locally frequent sets. The third module contains sequential code to perform a stream reduction over these results. They are summed up to produce a hash tree containing the union of the local solutions. Phase II broadcasts the approximate solution to all the workers at the beginning. The worker modules contain a simpler code that only gathers support counts for the elements in the approximate solution. The results are arrays of support values that are added together by a second stream reduction to compute the global support for all the frequent itemsets.

click to expand
Figure 5: SkIE code for the parallel APriori.

click to expand
Figure 6: Parallel structure of Partitioned Apriori.

The skeleton structure described so far allows the application of two different I/O schemes, from which we choose depending on the underlying architecture and on the kind of database interface we want to exploit. It is important to note that the two phases have the same structure and that the second begins after the completion of the first one. We can thus map them on the same set of processors to use the same I/O approach and resources twice. If we cannot assume the availability of parallel, independent access to disk-resident data, we use the sequential modules in the beginning of each phase to read the data partitions and to distribute them to the following modules. This is the case if the interprocess transfer bandwidth is higher than the I/O one (as in some SMP and parallel architectures), or if we have no other choice than a file system or a database server that provides single-point access to the data for architectural or performance reasons. This approach is scalable as long as the single point of I/O is not a bottleneck.

The case is different if we can afford to replicate the data, exploiting massively parallel I/O to local disks, or if we can access the data from multiple points with no heavy performance penalty, by means of a parallel file system, for instance. Data replication may be needed if the network performance is inadequate to that of the processors and local disks, e.g., on a cluster of workstation. We can implement distributed I/O directly in the farm workers, using the first sequential module only as supervisor. We still profit from the load balancing properties of the farm skeleton, which takes care of work distribution and synchronization, but we avoid the physical data transfer among distinct modules. This second approach relies on distributed I/O capabilities, but it is much more scalable.

Classification: C4.5 and Tree Induction-Based Classifiers

Classification is one of the most important tasks in DM. The input database is a set of records called cases, each one having a fixed set of attributes. All the cases have an assigned class label. A classification model is a knowledge model that describes membership of cases to a class in terms of other attribute values. Each attribute, i.e., a column in the data, is either continuous (a numeric value) or categorical (a label). The class attribute is assumed to be categorical. Most classification models are both descriptive and predictive, thus they can classify unlabelled data. We use an inductive process to look for the model. It is a common practice to exploit part of the data, the training set, to generate a model, and use the remaining data, the test set, to evaluate the model by comparing the predicted class with the real one.

Many widely used classifiers are based on decision trees, among them the C4.5 algorithm by Quinlan (1993). A decision tree (see Figure 7) recursively partitions the input set until the partitions consist mostly of data from a same class. The root of the tree corresponds to all the input, and each decision node splits the cases according to some test on their attributes. The leaves are class-homogeneous, disjoint subsets of the input.

click to expand
Figure 7: Example of decision tree.

A path from the root to any given leaf thus defines a series of tests. All cases in the leaf satisfy these tests, and they belong to a certain class (there are only two classes in the example, Yes and Not). The tree describes the structure of the classes, and it is easy to use to predict the class of new, unseen data.

Apart from some implementation issues, the general form of the tree induction process is mainly a divide-and-conquer (D&C) computation, proceeding from the root and recursively classifying smaller and smaller partitions of the input. However, tree induction-based classifiers differ in significant details, like the kind of tests that are used, how they are chosen, what is the stopping criterion of the recursion (the homogeneity condition), and how the data is actually partitioned. The C4.5 algorithm will be the reference for our discussion.

Sequential C4.5

C4.5 is made up of two phases, the building one, which is the actual tree induction process, and the pruning and evaluation phase. We will focus on the former as it is the actual model search, and because its sequential and parallel implementation for huge databases is challenging.

Following the general scheme of tree induction, the building phase proceeds from the root by choosing a new test at each node (step Div1 in Figure 8). C4.5 decision nodes employ tests over a single attribute. Boolean tests of the form x< threshold are used for continuous attributes, while multiple-choice tests create a son node for each different value of a categorical attribute. Each decision node thus splits its data into two or more son nodes (steps Div3 or Div2, respectively). The test for a node is selected by evaluating the Information Gain (IG) cost function over all the possible attributes. The IG is essentially a measure of diversity of a set of values, used to recognize which attribute best separates cases of current node into class-homogeneous subsets.

click to expand
Figure 8: Pseudocode of C4.5, tree-building phase.

Tree construction is recursive (steps Conq1, Conq3), and branch expansion stops at nearly homogeneous, possibly empty partitions (step Conq2). The model search is locally exhaustive, but globally greedy; no backtracking happens in the building phase.

Each node split requires several operations on all the data contained in the local partition, some of them to evaluate the IG, and other ones to split the data according to the selected attribute. The tree itself is a compact knowledge model, but the data partitions for a node can be as large as the whole input. Ensuring efficiency and locality of data accesses is the main issue in building the decision tree. Assuming that the data fit in memory, to evaluate the IG for a categorical attribute A, histograms of the couples (class, A) in the current partition are computed. This operation requires O(n) operations per column, where n is the size of current partition. IG computation for continuous attributes needs the class column to be sorted according to the attribute. The cost of repeated sorting (operations) at each node expansion accounts for most of the running time of the algorithm. Once the split test has been selected, a further step O(n) is required to partition the data accordingly.

The serial algorithm as described is not practically scalable. For out-of-core partitions, the complexity given above is in terms of I/O operations and virtual memory page faults. The in-core algorithm quickly becomes unusable, and explicit external-memory solutions are needed to overcome the limitation. Sequential and parallel classifiers address the problem by using clever techniques to evaluate the IG, by turning to less costly, possibly approximate cost functions, or by decomposing the computation to reduce the amount of useless data transfers. In the original formulation of C4.5, the selection of a continuous attribute for the split in step Div3 also requires a linear search for a global threshold T, which is done over all the input data. This O(N) search clearly breaks the D&C paradigm, both with respect to locality of data accesses and with respect to the expected computation time of subproblems. However, the exact threshold value T is not needed to split a node, because it is used only later, during the evaluation phase. All the thresholds in the generated tree can be computed in an amortized manner at the end of the building phase. As a consequence, locality of data access is enhanced, and the computational cost of split operations lowers from to O (max (N, n log n)) to O(n log n).

Related Work

Several different parallel strategies for classification have been explored in the literature. Three of them can be considered as basic paradigms, which are combined and specialized in real algorithms. Attribute parallelism vertically partitions the data and distributes to different processors the IG calculation over different columns. Data parallelism employs horizontal partitioning of the data and coordinates computation of all processors to build each node. Task parallelism is the independent classification of separate nodes and subtrees. These fundamental approaches may use replicated or partitioned data structures, do static or dynamic load balancing and computation grain optimization.

We have seen that the computation of IG accounts for much of the complexity of C4.5. Some alternative split evaluation functions have been proposed that do not require the data to be sorted and to be memory-resident, but in the following, we will concentrate on the works based on the same definition of information gain used in C4.5. Much of the research effort has been made to avoid sorting the partitions to evaluate the IG and to split the data using a reasonable number of I/O operations or communications. A common variation is to use a vertical representation, each attribute stored in a separate data structure, keeping the columns of data in sorted order. The drawback is that horizontal partitioning is done at each node split, so most of the algorithm design is devoted to split the data according to one column, while maintaining order information in the other ones. Many parallel algorithms expand the classification tree breadth-first and employ a binary tree classification model. Binary splits require some extra processing to form two groups of values from each categorical attribute, but simplify dealing with the data and make the tree structure more regular. The algorithms SLIQ and SPRINT use these two solutions (Shafer, Agrawal & Mehta, 1996). ScalParC (Joshi, Karypis & Kumar, 1998) also builds a binary tree breadth-first, but with a level-synchronous approach. It employs a custom parallel hashing and communication scheme, reducing the memory requirements and the amount of data transfers. ScalParC is memory-scalable and has a better average split communication cost than the former algorithms, even if its worst-case communication cost is O(N) for a whole level of the tree.

Skeleton Parallel C4.5

Our research has focused on developing a structured parallel classifier based on a D&C formulation. Instead of balancing the computation and communications for a whole level, we aim at a better exploitation of the locality properties of the algorithm. A similar approach is the one reported in Sreenivas, AlSabti and Ranka (2001) for the parallel CLOUDS algorithm. The sequential classifier CLOUDS uses a different and computationally more efficient way of evaluating the split gain on numeric attributes, which results in lower I/O requirements than other classifiers. Sreenivas et al. (2001) propose, as a general technique for parallel solution of D&C problems, a mixed approach of data-parallel and task-parallel computation. Substantially, in pCLOUDS, all the nodes above a certain size are computed in a data-parallel fashion by all the processors. The smaller nodes are then classified using a simple task parallelization.

The problem of locality exploitation has also been addressed in Srivastava, Han, Kumar and Singh (1999) with a Hybrid Parallelization. A level-synchronous approach is used here, but as the amount of communications exceeds the estimated cost of data reorganization, the available processors are split in two groups that operate on separate sets of subtrees.

We started from a task parallelization approach instead. Each node classification operation is a task, which generates as subtasks the input partitions for the child nodes. Each processor receives a task and executes one or more recursive calls to the classifying procedure. The resulting partitions that are not homogeneous become new tasks to compute. We used a skeleton composition that allows tasks to loop back through a classifier module (see Figure 9 and Figure 10), which is internally parallel. Each task requires a certain amount of data to be communicated, which in a first implementation is proportional to the size of the node.

click to expand
Figure 9: SkIE code for parallel C4.5.

click to expand
Figure 10: Block Structure of parallel C4.5.

To throttle the computation grain size (i.e., to balance the amount of communications with enough local computation), we vary the amount of computation done. A single task computation may execute a single classification recursive call and return the set of sons of the given node. We can also expand a node to a subtree of more than one level, and return as subtasks all the nodes in the frontier of the subtree. Very small nodes are completely classified locally. To control how deep a subtree is generated for each task (how many recursive calls are executed), we use a task expansion policy, balancing the communication and computation times of the workers. We made a choice similar to that of Sreenivas et al. (2000) in distinguishing the nodes according to their size. In our case, we balance the task communication and computation times, which influence dynamic load balancing, by using three different classes of tasks. The base heuristic is that large tasks are expanded to one level only to increase available parallelism, while small ones are fully computed sequentially. Intermediate size tasks are expanded to incomplete subtrees up to a given number of nodes and within computation time bounds. The actual size and time limits were tuned following the same experimental approach described in our previous work (Carletti & Coppola, 2002).

We verified that if threshold calculation for continuous attributes is delayed until the pruning phase, the distribution of computation load for different tasks in the D&C becomes more regular and can be better exploited by means of application-level parallel policies. In our case, the task selection policy that is most effective is to schedule the tasks in size order, first expanding large tasks that generate more parallelism, then the smaller ones. Note that this overall scheme does not relieve us from the task of resorting data at each node, which has not been addressed yet.

Test Implementations with Skeletons and External Objects

The skeleton structure in Figure 9 implements the recursive expansion of nodes by letting tasks circulate inside a loop skeleton. The anonymous workers in the farm skeleton expand each incoming node to a separate subtree. The template underlying the farm skeleton takes care of load balancing. Its efficiency depends on the available parallelism and the computation to communication ratio, and the sequential code in the workers apply the task expansion policy we described before. The second stage in the pipe is a sequential Conquer process coordinating the computation. C4.5 is a D&C algorithm with a very simple conquer step that simply consists of merging subtrees back into the classification tree. In our implementations, the conquer module takes care of the task selection policy by ordering waiting tasks according to their size.

The structure described so far has been implemented in two variants. In a pure skeleton based version, all the input data were replicated in the workers, while the decision tree structure was local to the Conquer module. Each task consists of index structures that allow the workers to select the data for the node. Similar information has to flow through the Conquer module to allow decision tree building and task scheduling. Regardless of the good exploitation of task parallelism, the scalability of such a simple approach is limited by memory requirements, by communications issues, and by the bottleneck due to tree management.

The situation is different if we employ external objects inside the skeleton program. We have designed a Shared Tree (ST) library, an implementation of a general tree object in shared memory. We have added it to the previous prototype, using it to represent the decision tree, and we have performed some experiments. The ST is a shared, distributed object whose nodes and leaves can contain arbitrary data. Since data locality follows the evolution of the decision tree, in our solution the whole input is held inside the ST, distributed over the frontier of the expanding tree, and is immediately accessible from each process in the application. All the operations required by the algorithm are done in the sequential workers of the farm. They access the shared structure to fetch their input data, then create the resulting subtree and store back the data partitions on the frontier of the tree.

The Conquer module no longer manages the tree structure and the contained data. It only applies the task selection policy, resulting in a clear separation in the code between the sequential computation and the details of parallelism exploitation. A simple priority queue is used to give precedence to larger tasks, leading to a data-driven expansion scheme of the tree, in contrast to the depth-first scheme of sequential C4.5 and to the level-synchronous approach of ScalParC.

The external object Shared Tree lets parallel modules operate on out-of-core data in a virtual shared memory. Following the approach we have described in the first part of the chapter, the next steps are to extend the ST support with methods that implement the most common elementary operations of C4.5 (sort, scan, summarization), using external memory algorithms when needed. Once this is accomplished, the external object will have the option to choose at run-time the best available sharing support (shared/virtually shared memory, parallel file systems, memory mapped I/O, database servers). Such a technique, which is to be supported in the new ASSIST project, can enhance the scalability of parallel applications to out-of-core datasets, exploiting globally available memory resources, without losing the advantages of structured high-level programming.

Clustering: DBSCAN Density-Based Approach

Clustering is the problem of grouping input data into sets in such a way that a similarity measure is high for objects in the same cluster, and low elsewhere. Many different clustering models and similarity measures have been defined to work on various kinds of input data. For the sake of Spatial Clustering, the input data are seen as points in a suitable space R^a, and discovered clusters should describe their spatial distribution. Many kinds of data can be represented this way, and their similarity in the feature space can be mapped to a concrete meaning, e.g., for spectral data to the similarity of two real-world signals. A high dimension a of the data space is quite common and can lead to performance problems (Beyer, Goldstein, Ramakrishnan & Shaft, 1999). Usually, the spatial structure of the data has to be exploited by means of appropriate index structures to enhance the locality of data accesses. Clustering methods based on distances and cluster representatives have the basic limit that the shape of clusters is geometrically biased by the distance measure being used, so clusters whose shape is not convex are easily missed. On the other hand, relying solely on point-to-point distance means using cluster evaluation functions that require computing a quadratic number of distances, making the algorithm unpractical. Density-based clustering identifies clusters from the density of objects in the feature space. Compared to other spatial clustering methods, density-based ones still make use of the concept of distance, but only in a local sense, so that the global shape of clusters is less influenced by the chosen distance measure. One of the advantages of density-based methods is the ability to discover clusters of almost any shape.

Sequential DBSCAN

DBSCAN is a density-based spatial clustering technique introduced in Ester, Kriegel, Sander and Xu (1996), whose parallel form we recently studied. DBSCAN measures densities in R^a by counting the points inside a given region of the space. The key concept of the algorithm is that of core point, a point belonging to a locally dense part of the input set. Having fixed two user parameters ε and MinPts, a core point must have at least MinPts other data points within a neighborhood of radius ε. A suitable relation can be defined among the core points, which allows us to identify dense clusters made up of core points. The points that fail the density test are either assigned to the boundary of a neighboring cluster, or labeled as noise.

To assign cluster labels to all the points, DBSCAN repeatedly applies a simple strategy it searches for a core point, and then it explores the whole cluster it belongs to (Figure 11). The process of cluster expansion performed by the ExpandCluster procedure is quite similar to a graph visit where connected points are those closer than ε, and the visit recursively explores all reached core points. When a point in the cluster is considered as a candidate, we first check if it is a core point; if it has enough neighbors, it is labeled with the current cluster identifier, and its neighbors are also placed in the candidate queue. DBSCAN holds the whole input set inside the R*-tree spatial index structure. The R*-tree is a secondary memory tree, with an ad hoc directory organization and algorithms for building, updating, and searching designed to efficiently access spatial data. The data are kept in the leaves, while interior nodes contain bounding boxes for the son subtrees, used by the management algorithms. Holding some conditions, the R*-tree can answer to spatial queries (which are the points in a given region) with time and I/O complexity proportional to the depth of the tree, O(log N). Since for each point in the input there is exactly one neighborhood retrieval operation, the expected complexity of DBSCAN is O(N log N).

click to expand
Figure 11: Pseudo-code of DBSCAN.

We need the hypothesis that almost all regions involved in the queries are small with respect to the dataset, so that the search algorithm needs to examine only a small number of leaves of the R*-tree. We can assume that the ε parameter is not set to a neighborhood radius comparable to that of the whole dataset. But we have no guarantee that a suitable value for ε exists. It is well known that all spatial data structures lose efficiency as the dimension a of the space grows, in some cases already for a > 10. The R*-tree can be easily replaced with any improved spatial index that supports neighborhood queries, but for a high value of a this could not lead to an efficient implementation anyway. It has been argued in Beyer et al. (1999) and is still a matter of debate, that for higher and higher dimensional data the concept of neighborhood of fixed radius progressively loses its meaning for the sake of spatial organization of the data. As a consequence, for some distributions of the input, the worst-case performance of good spatial index structures is that of a linear scan of the data (Bertchold, Keim & Kriegel, 1996).

Skeleton Parallel DBSCAN

We develop a parallel implementation of DBSCAN that is a practical way to make it scalable with N, when the O (N log N) sequential cost is too high. This can be due to large constant terms in the real running time, or because of the spatial access overhead for a large value of the spatial dimensionality a. Our parallel implementation does not presently aim at being fully scalable with respect to a. We have to modify the sequential algorithms to turn it into a parallel one, but a simple and efficient solution is obtained by applying standard forms of parallelism.

When looking at DBSCAN performance, it is clear from previous analysis that we have to exploit parallelism in the spatial queries. A very simple way to enhance the service time of a large number of operations is to perform them in parallel. In the original formulation of the algorithm, the candidate points, which are the centers of the following spatial queries, are queued as they are generated. We decouple the tasks of checking results, assigning labels and selecting new candidates, from the computation of neighborhoods, which relies on the R*-tree.

The resulting decomposition in a pair of modules is shown in Figure 12. A Master module executes the sequential algorithm, demanding all spatial tree operations to a retrieve module which is internally parallel. This decoupled scheme exposes two kinds of parallelism. There is pipeline parallelism between the Master and the Slave modules, because the Slave can start processing the stream of queries that in the sequential algorithm would have been queued. Moreover, we are able to exploit farm parallelism over this stream of independent tasks (the retrieve module hosts several Slave modules). Since DBSCAN cluster definition is insensitive to the order of cluster expansion, out-of order answers can be immediately used, and we take advantage of the load-balancing properties of the farm template. Two factors make the overall structure effective:

The sequential operations in the Master are not hard and do not actually need to use a spatial access structure. Cluster labels are integers; even if we use a spatial structure for storing them, it doesn't need to be the R*-tree used to store the whole database.
A lot of independent queries are generated when a cluster starts growing in all directions, and the Slaves need no information about cluster labeling to process these queries (we will come back to this assumption). On the other hand, we have made the non-trivial hypothesis that all the Slaves can access the R*-tree structure to perform their task.

click to expand
Figure 12: Parallel decomposition of DBSCAN.

The computation is clearly data-flow like, with queries and answers being the elementary tasks that flow across the modules. The structure described is plainly mapped to a set of skeleton reported in Figure 13 and Figure 14, where we find a pipe of the Master and Slave modules, with a set of Slave-independent modules contained in a farm. A loop skeleton is used to let the answers return to the Master until all clusters have been explored. The amount of computation in the Master module depends on the amount of points returned by the spatial queries, and we have modified their behavior by moving them to separate modules. The process of oblivious expansion of different parts of a cluster exploits parallelism in the Slave module, but may repeatedly generate the same candidates. While in sequential DBSCAN neighbor points are counted, and only unlabeled ones are selected as new candidates, now there is no label information to allow this kind of immediate pruning. But if we send all neighborhood sets for each query each time, we end up in receiving exponentially many results in the Master.

click to expand
Figure 13: The SkIE code of parallel DBSCAN.

click to expand
Figure 14: Block structure of parallel DBSCAN.

A simple filtering heuristics allows the Slaves to send each result once at most. The heuristic is reported in Figure 12 as a modified pseudo-code for a FilteringSlave module. Each sub-Slave can detect a sufficient condition for not resending points already queued as candidates, while still providing the Master with the exact number of neighbors discovered in each query (∣Set_n∣ is this number), which is needed by the clustering algorithm. The effect of this distributed filtering is reported in Figure 15, where we can see that the ratio of duplicates to results is bounded by the degree of parallelism.

click to expand
Figure 15: Average number of points per query answer, versus parallelism and epsilon.

Like its sequential counterpart, the parallel structure we have described for DBSCAN is general with respect to the spatial index, and can be applied to different spatial data structures. It is thus easy to exploit any improvement or customization of the data management that may speed up the computation. Further investigation is needed to evaluate the practical scalability limits of our heuristics at higher degrees of parallelism, and possibly to devise better ones. To improve the filtering ratio, more informed heuristics could exploit non-local information gained by communications among the slaves, or between the Master and the Slave module. Another solution is to use a separate, parallel module to apply global filtering to the results returning to the Master.

A final comment must be made about the availability of the R*-tree to all the components of the Slave module. In our tests, the data structure was actually replicated, which is a consistent waste of disk space. This could be a limit for practical scalability, but the parallel algorithm does not actually need replication. Since the R*-tree is used as a read-only index structure, it can be shared among all the Slaves by means of a networked/parallel file system. The R*-tree is a secondary memory structure, so its access time is not subject to a sharp degradation. A single copy on a networked sequential file system would become a bottleneck if shared among a large number of Slaves. Sharing a single copy across a bounded number of processors could help reduce the amount of replication, and may be the only choice in a DDM setting. Using instead a parallel file system to store the data is not subject to the same limitations, and may actually increase the performance of the accesses, thanks to the fact that the size of parallel file system caches increases with the size of the employed parallel architecture.

Related Work

We compare our approach with the PDBSCAN one described in Xu, Jager and Kriegel (1999). That work also addressed the issue of speeding up the region queries by means of parallel computation. They develop a Master-Slave scheme in which the Slaves independently run a slightly modified version of sequential DBSCAN on separate partitions of the multi-dimensional input space. The data are held in a distributed version of the R*-Tree, which partitions the data pages of the R*-Tree (the leaves) among available processors and fully replicates the index structure (the interior part of the tree). This approach relies on the same constraint we exploited, i.e., that the spatial structure is essentially read-only. Slaves have to communicate with each other to answer spatial queries near partition boundaries, so the distribution of data pages among processors is performed following a Hilbert space-filling curve, a commonplace tool to enhance locality for huge spatial data structures and problems. The Master module is idle during the parallel clustering phase, but the cluster identifiers assigned in distinct partitions are completely unrelated. A merging phase is run after the clustering to map local cluster IDs to global ones. Information collected at run time by the Slaves is used to match the different parts of clusters that span partition boundaries.

In the Master-Slave decomposition of PDBSCAN, the Master has little control on the first part of the computation, and there is no bottleneck, so the application shows a good speed up for a small number of processors. On the other hand, when increasing the degree of parallelism, the amount of communications among the Slaves raises quickly, and it is not clear how high could be the cost of the merging phase an entirely sequential task of the Master processor.

PDBSCAN required the definition of an ad hoc distributed data structure, the dR*-Tree, and its implementation using a low-level approach (C++ and PVM). The dataset is carefully partitioned among the processors, and this is a strong assumption in a distributed computing environment. On the contrary, our solution does not rely on an exact partitioning of the data and is implemented in a high-level fashion reusing the code of the sequential application; it assumes that the data is accessible from every processor, at least in secondary memory. Both solutions seem to perform well on small clusters, with datasets larger than one hundred-thousand points. The distinction between the two approaches is partly apparent, anyway. As we have shown with parallel C4.5, special purpose distributed data structures can be easily integrated into high-level applications by encapsulating them into external objects, and this opportunity is certainly available also for the structured parallel DBSCAN. In our view this is the best way to make the most of the effort put in developing complex solutions.

Test Results

We have actually implemented the described parallel structures using SkIE, and several prototypes have been run on a range of different parallel architectures. We present some test results and we analyze them in terms of speed-up and efficiency. Results over different machines show the characteristics of performance portability of structured parallel applications. We used

Clusters of LINUX workstations on a LAN network;
Beowulf clusters with dedicated Fast Ethernet;
Small SMP workstations; and
Parallel, distributed-memory machines, like the Meiko CS-2 and the Cray T3E.

This is a quite heterogeneous set of parallel architectures. The different number of processors, their technology, the bandwidth and latency of the interconnection network result in variable communication/computation ratios for the same problem on different hardware platforms. The T3E and recent LINUX workstations offer the highest raw CPU speed, but from the point of view of communication versus computation speed, the SMP and the distributed memory multiprocessors offer the highest figures. I/O bandwidth and scalability are crucial to DM applications, and the test platforms offer quite different performances, with the CS-2 and the cluster platforms exploiting a higher degree of I/O parallelism. We note that the T3E platform of our tests had a single RAID disk server, which is a bottleneck at high degree of parallelism. Moreover, slower processor architectures are less sensitive to the I/O speed gap.

The Partitioned Apriori has been tested on the full range of hardware platforms. Test results are reported and discussed in full in previous works. Now we just point out that the application exhibits high efficiency and it is scalable with the size of data. In Figure 16, we see the efficiency on the CS-2 and a Beowulf cluster. The dataset is obtained with the most commonly used test set generator (see Agrawal et al., 1996), and tests with centralized (*) and distributed I/O are reported, with the latter ensuring higher performance. Figure 17 shows speed-up figures for a LINUX cluster at a higher computational load. The completion time on a T3E (Figure 18) evidences the bottleneck of a single disk server at higher degrees of parallelism. Here the I/O time dominates execution time, unless the computational load of the sequential algorithm is very high.

click to expand
Figure 16: Apriori efficiency versus parallelism (* = centralized I/O).

click to expand
Figure 17: Apriori speedup versus parallelism.

click to expand
Figure 18: Parallel Apriori, T3E completion time.

We obtained comparable results from the DBSCAN prototype. A common behavior of these two programs is a performance loss when the input is too small. We can explain this behavior in terms of the startup phase that the farm parallelism has to go through, when computing a stream of tasks, before reaching the steady state. For the Apriori prototype, small data files mean few partitions (we cannot make them small because of data skew), so the steady state of the computation is too short to balance the length of the startup phase. In the case of DBSCAN, the startup phase occurs in the beginning and at the end of the expansion of each cluster, when not enough candidate points are produced to exploit the degree of parallelism in the Slave. Of course, small datasets and small clusters prevent the farm module from ever reaching the steady state.

The skeleton implementation of C4.5 performs well with respect to other task parallel schemes, but it is limited precisely by the task parallel approach. The first target of this design was to experiment with the option of exploiting the aggregate memory, by means of the Shared Tree external object. The addition of the shared structure enhances the performance of the program, whose sequential code still works in main memory. The implementation of out-of-core and low-level parallel functionalities in the external objects is the following step in our research about parallel languages. Merging the task-parallel approach and the data-parallel one within the same program will clearly enhance the performance of this kind of D&C applications.


	Brought to you by Team-Fly