25.11. Chapter Summary

 
[Page 761 ( continued )]

23.7. External Sort

All the sort algorithms discussed in the preceding sections assume that all data to be sorted is available at one time in internal memory such as an array. To sort data stored in an external file, you may first bring data to the memory, then sort it internally. However, if the file is too large, all data in the file cannot be brought to memory at one time. This section discusses how to sort data in a large external file.


[Page 762]

For simplicity, assume that two million int values are stored in a binary file named large-data.dat. This file was created using the following program:

Listing 23.14. CreateLargeFile.java
 1   import   java.io.*; 2 3   public class   CreateLargeFile { 4   public static void   main(String[] args)   throws   Exception { 5  DataOutputStream output =   new   DataOutputStream(  6    new   BufferedOutputStream  7  (   new   FileOutputStream(   "largedata.dat"   )));  8 9   for   (   int   i =     ; i <   2000000   ; i++) 10  output.writeInt((   int   )(Math.random() x   1000000   ));  11 12  output.close();  13 } 14 } 

A variation of merge sort can be used to sort this file in two phases:

PHASE I : Repeatedly bring data from the file to an array, sort the array using an internal sorting algorithm, and output the data from the array to a temporary file. This process is shown in Figure 23.9. Ideally, you want to create a large array, but the maximum size of the array is dependent on how much memory is allocated to the JVM by the operating system. Assume that the maximum array size is of 100000 int values. In the temporary file, every 100000 int values are sorted. They are denoted as S 1 , S 2 , ..., and S k , where S k may contain less than 100000 values.

Figure 23.9. The original file is sorted by pieces.

PHASE II : Merge a pair of sorted segments (e.g., with with and so on) into a larger sorted segment and save the new segment into a new temporary file. Continue the same process until one sorted segment results. Figure 23.10 shows how to merge eight segments.

Figure 23.10. Sorted segments are merged iteratively.
(This item is displayed on page 763 in the print version)

Note

It is not necessary to merge two successive segments. For example, you may merge S 1 with S 5 , S 2 with S 6 , S 3 with S 7 and S 4 with S 8 , in the first merge step.



[Page 763]

23.7.1. Implementing Phase I

Assume MAX_ARRAY_SIZE is declared as a constant 100000. Listing 23.15 gives the method that sorts every 100000 in largedata.dat and stores the sorted segments into a new file named f1.dat. The method returns the number of segments.

Listing 23.15. Creating Initial Sorted Segments
 1  /** Sort original file into sorted segments */  2   private static int   initializeSegments (   int   segmentSize, 3 String originalFile, String f1)   throws   Exception { 4   int   [] list =   new int   [segmentSize]; 5 DataInputStream input =   new   DataInputStream( 6   new   BufferedInputStream(   new   FileInputStream(originalFile))); 7 DataOutputStream output =   new   DataOutputStream( 8   new   BufferedOutputStream(   new   FileOutputStream(f1))); 9 10   int   numberOfSegments =     ; 11   while   (input.available() >     ) { 12 numberOfSegments++; 13   int   i =     ; 14   for   ( ; input.available() >     && i < segmentSize; i++) { 15 list[i] = input.readInt(); 16 } 17 18  // Sort an array list[0..i-1]  19 java.util.Arrays.sort(list,     , i); 20 21  // Write the array to f1.dat  22   for   (   int   j =     ; j < i; j++) { 23 output.writeInt(list[j]); 24 } 25 } 26 27 input.close(); 28 output.close(); 29   return   numberOfSegments; 30 } 

The method declares an array with the max size in line 5, declares a data input stream for the original file in line 6, and declares a data output stream for a temporary file in line 8. Buffered streams are used to improve performance. Assume BFFER_SIZE is a constant 50000.


[Page 764]

Lines 15 “17 read a segment of data from the file into the array. Line 20 sorts the array. Lines 23 “25 write the data in the array to the temporary file.

The number of the segments is returned in line 37. Note that every segment has MAX_ARRAY_SIZE number of elements except the last segment that may have a smaller number of elements.

23.7.2. Implementing Phase II

Each merge step merges two sorted segments to form a new segment. The new segment doubles the number elements. So the number of segments is reduced by half after each merge step. A segment is too large to be brought to an array in memory. To implement a merge step, copy half the number of segments from file f1.dat to a temporary file f2.dat. Then merge the first remaining segment in f1.dat with the first segment in f2.dat into a temporary file named f3.dat, as shown in Figure 23.11.

Figure 23.11. Sorted segments are merged iteratively.

Note

f1.dat may have more than one segment than f2.dat. If so, move the last segment into f3.dat after the merge.


Listing 23.16 gives a method that copies the first half of the segments in f1.dat to f2.dat. Listing 23.17 gives a method that merges a pair of segments in f1.dat and f2.dat. Listing 23.18 gives a method that merges two segments.

Listing 23.16. Copying First Half Segments
 1   private static void   copyHalfToF2(   int   numberOfSegments, 2   int   segmentSize, DataInputStream f1, DataOutputStream f2) 3   throws   Exception { 4   for   (   int   i =     ; i < (numberOfSegments /   2   ) * segmentSize; i++) { 5 f2.writeInt(f1.readInt()); 6 } 7 } 

Listing 23.17. Merging All Segments
(This item is displayed on pages 764 - 765 in the print version)
 1   private static void   mergeSegments(   int   numberOfSegments, 2   int   segmentSize, DataInputStream f1, DataInputStream f2, 3 DataOutputStream f3)   throws   Exception { 4   for   (   int   i =     ; i < numberOfSegments; i++) { 5 mergeTwoSegments(segmentSize, f1, f2, f3); 6 } 7 

[Page 765]
 8  // f1 may have one extra segment, copy it to f3  9   while   (f1.available() >     ) { 10 f3.writeInt(f1.readInt()); 11 } 12 } 

Listing 23.18. Merging Two Segments
 1   private static void   mergeTwoSegments(   int   segmentSize, 2 DataInputStream f1, DataInputStream f2, 3 DataOutputStream f3)   throws   Exception { 4   int   intFromF1 = f1.readInt(); 5   int   intFromF2 = f2.readInt(); 6   int   f1Count =   1   ; 7   int   f2Count =   1   ; 8 9   while   (   true   ) { 10   if   (intFromF1 < intFromF2) { 11 f3.writeInt(intFromF1); 12   if   (f1.available() ==     f1Count++ >= segmentSize) { 13 f3.writeInt(intFromF2); 14   break   ; 15 } 16   else   { 17 intFromF1 = f1.readInt(); 18 } 19 } 20   else   { 21 f3.writeInt(intFromF2); 22   if   (f2.available() ==     f2Count++ >= segmentSize) { 23 f3.writeInt(intFromF1); 24   break   ; 25 } 26   else   { 27 intFromF2 = f2.readInt(); 28 } 29 } 30 } 31 32   while   (f1.available() >     && f1Count++ < segmentSize) { 33 f3.writeInt(f1.readInt()); 34 } 35 36   while   (f2.available() >     && f2Count++ < segmentSize) { 37 f3.writeInt(f2.readInt()); 38 } 39 } 

23.7.3. Combining Two Phases

Listing 23.19 gives the complete program for sorting int values in largedata.dat and storing the sorted data in sortedlargedata.dat.

Listing 23.19. SortLargeFile.java
(This item is displayed on pages 765 - 767 in the print version)
 1   import   java.io.*; 2 

[Page 766]
 3   public class   SortLargeFile { 4   public static final   int MAX_ARRAY_SIZE =   100000   ; 5   public static final   int BFFER_SIZE =   100000   ; 6 7   public static void   main(String[] args)   throws   Exception { 8  // Implement Phase 1: Create initial segments  9    int   numberOfSegments =  10  initializeSegments(MAX_ARRAY_SIZE,   "largedata.dat", "f1.dat");    11 12  // Implement Phase 2: Merge segments recursively  13  merge(numberOfSegments, MAX_ARRAY_SIZE,  14    "f1.dat", "f2.dat", "f3.dat");    15 } 16 17  /** Sort original file into sorted segments */  18   private static int   initializeSegments (   int   segmentSize, 19 String originalFile, String f1)   throws   Exception { 20  // Same as Listing 23.14, so omitted  21 } 22 23   private static void   merge(   int   numberOfSegments,   int   segmentSize, 24 String f1, String f2, String f3)   throws   Exception { 25   if   (numberOfSegments >   1   ) { 26  mergeOneStep(numberOfSegments, segmentSize, f1, f2, f3);  27  merge((numberOfSegments +   1   ) /   2   , segmentSize * 2, f3, f1, f2);  28 } 29   else   {  // rename f1 as the final sorted file  30 File sortedFile =   new   File(   "sortedlargedata.dat"   ); 31   if   (sortedFile.exists()) sortedFile.delete(); 32   new   File(f1).renameTo(sortedFile); 33 } 34 } 35 36   private static void   mergeOneStep(   int   numberOfSegments, 37   int   segmentSize, String f1, String f2, String f3)   throws   38 Exception { 39 DataInputStream f1Input =   new   DataInputStream( 40   new   BufferedInputStream(   new   FileInputStream(f1), BFFER_SIZE)); 41 DataOutputStream f2Output =   new   DataOutputStream( 42   new   BufferedOutputStream(   new   FileOutputStream(f2), BFFER_SIZE)); 43 44  // Copy half number of segments from f1.dat to f2.dat  45  copyHalfToF2(numberOfSegments, segmentSize, f1Input, f2Output);  46 f2Output.close(); 47 48  // Merge remaining segments in f1 with segments in f2 into f3  49 DataInputStream f2Input =   new   DataInputStream( 50   new   BufferedInputStream(   new   FileInputStream(f2), BFFER_SIZE)); 51 DataOutputStream f3Output =   new   DataOutputStream( 52   new   BufferedOutputStream(   new   FileOutputStream(f3), BFFER_SIZE)); 53 54  mergeSegments(numberOfSegments /   2   ,  55  segmentSize, f1Input, f2Input, f3Output);  56 57 f1Input.close(); 58 f2Input.close(); 59 f3Output.close(); 60 } 61 

[Page 767]
 62  /** Copy first half number of segments from f1.dat to f2.dat */  63   private static void   copyHalfToF2(   int   numberOfSegments, 64   int   segmentSize, DataInputStream f1, DataOutputStream f2) 65   throws   Exception { 66  // Same as Listing 23.15, so omitted  67 } 68 69  /** Merge all segments */  70   private static void   mergeSegments(   int   numberOfSegments, 71   int   segmentSize, DataInputStream f1, DataInputStream f2, 72 DataOutputStream f3)   throws   Exception { 73  // Same as Listing 23.16, so omitted  74 } 75 76  /** Merge two segments */  77   private static void   mergeTwoSegments(   int   segmentSize, 78 DataInputStream f1, DataInputStream f2, 79 DataOutputStream f3)   throws   Exception { 80  // Same as Listing 23.17, so omitted  81 } 82 } 

Line 10 creates initial segments from the original array and stores the sorted segments in a new file f1.dat. Line 13 produces a sorted file in sortedlargedata.dat. The merge method

 merge(   int   numberOfSegments,   int   segmentSize, String f1, String f2, String f3) 

merges the segments in f1 into f3 using f2 to assist the merge. The merge method is invoked recursively with many merge steps. Each merge step reduces the numberOfSegments by half and doubles the sorted segment size. After completing one merge step, the next merge step merges the new segments in f3 to f2 using f1 to assist the merge. So the statement to invoke the new merge method is

 merge((numberOfSegments +   1   ) /   2   , segmentSize *   2   , f3, f1, f2); 

The numberOfSegments for the next merge step is (numberOfSegments + 1) / 2 . For example, if numberOfSegments is 5, numberOfSegments is 3 for the next merge step, because every two segments are merged but there is one left unmerged.

The recursive merge method ends when numberOfSegments is 1. In this case, f1 contains sorted data. Rename it to sortedlargedata.dat in line 32.

 


Introduction to Java Programming-Comprehensive Version
Introduction to Java Programming-Comprehensive Version (6th Edition)
ISBN: B000ONFLUM
EAN: N/A
Year: 2004
Pages: 503

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net