All the sort algorithms discussed in the preceding sections assume that all data to be sorted is available at one time in internal memory such as an array. To sort data stored in an external file, you may first bring data to the memory, then sort it internally. However, if the file is too large, all data in the file cannot be brought to memory at one time. This section discusses how to sort data in a large external file.
For simplicity, assume that two million int values are stored in a binary file named large-data.dat. This file was created using the following program:
1 import java.io.*; 2 3 public class CreateLargeFile { 4 public static void main(String[] args) throws Exception { 5 DataOutputStream output = new DataOutputStream( 6 new BufferedOutputStream 7 ( new FileOutputStream( "largedata.dat" ))); 8 9 for ( int i = ; i < 2000000 ; i++) 10 output.writeInt(( int )(Math.random() x 1000000 )); 11 12 output.close(); 13 } 14 } |
A variation of merge sort can be used to sort this file in two phases:
PHASE I : Repeatedly bring data from the file to an array, sort the array using an internal sorting algorithm, and output the data from the array to a temporary file. This process is shown in Figure 23.9. Ideally, you want to create a large array, but the maximum size of the array is dependent on how much memory is allocated to the JVM by the operating system. Assume that the maximum array size is of 100000 int values. In the temporary file, every 100000 int values are sorted. They are denoted as S 1 , S 2 , ..., and S k , where S k may contain less than 100000 values.
PHASE II : Merge a pair of sorted segments (e.g., with with and so on) into a larger sorted segment and save the new segment into a new temporary file. Continue the same process until one sorted segment results. Figure 23.10 shows how to merge eight segments.
Note
It is not necessary to merge two successive segments. For example, you may merge S 1 with S 5 , S 2 with S 6 , S 3 with S 7 and S 4 with S 8 , in the first merge step. |
Assume MAX_ARRAY_SIZE is declared as a constant 100000. Listing 23.15 gives the method that sorts every 100000 in largedata.dat and stores the sorted segments into a new file named f1.dat. The method returns the number of segments.
1 /** Sort original file into sorted segments */ 2 private static int initializeSegments ( int segmentSize, 3 String originalFile, String f1) throws Exception { 4 int [] list = new int [segmentSize]; 5 DataInputStream input = new DataInputStream( 6 new BufferedInputStream( new FileInputStream(originalFile))); 7 DataOutputStream output = new DataOutputStream( 8 new BufferedOutputStream( new FileOutputStream(f1))); 9 10 int numberOfSegments = ; 11 while (input.available() > ) { 12 numberOfSegments++; 13 int i = ; 14 for ( ; input.available() > && i < segmentSize; i++) { 15 list[i] = input.readInt(); 16 } 17 18 // Sort an array list[0..i-1] 19 java.util.Arrays.sort(list, , i); 20 21 // Write the array to f1.dat 22 for ( int j = ; j < i; j++) { 23 output.writeInt(list[j]); 24 } 25 } 26 27 input.close(); 28 output.close(); 29 return numberOfSegments; 30 } |
The method declares an array with the max size in line 5, declares a data input stream for the original file in line 6, and declares a data output stream for a temporary file in line 8. Buffered streams are used to improve performance. Assume BFFER_SIZE is a constant 50000.
Lines 15 “17 read a segment of data from the file into the array. Line 20 sorts the array. Lines 23 “25 write the data in the array to the temporary file.
The number of the segments is returned in line 37. Note that every segment has MAX_ARRAY_SIZE number of elements except the last segment that may have a smaller number of elements.
Each merge step merges two sorted segments to form a new segment. The new segment doubles the number elements. So the number of segments is reduced by half after each merge step. A segment is too large to be brought to an array in memory. To implement a merge step, copy half the number of segments from file f1.dat to a temporary file f2.dat. Then merge the first remaining segment in f1.dat with the first segment in f2.dat into a temporary file named f3.dat, as shown in Figure 23.11.
Note
f1.dat may have more than one segment than f2.dat. If so, move the last segment into f3.dat after the merge. |
Listing 23.16 gives a method that copies the first half of the segments in f1.dat to f2.dat. Listing 23.17 gives a method that merges a pair of segments in f1.dat and f2.dat. Listing 23.18 gives a method that merges two segments.
1 private static void copyHalfToF2( int numberOfSegments, 2 int segmentSize, DataInputStream f1, DataOutputStream f2) 3 throws Exception { 4 for ( int i = ; i < (numberOfSegments / 2 ) * segmentSize; i++) { 5 f2.writeInt(f1.readInt()); 6 } 7 } |
1 private static void mergeSegments( int numberOfSegments, 2 int segmentSize, DataInputStream f1, DataInputStream f2, 3 DataOutputStream f3) throws Exception { 4 for ( int i = ; i < numberOfSegments; i++) { 5 mergeTwoSegments(segmentSize, f1, f2, f3); 6 } 7 8 // f1 may have one extra segment, copy it to f3 9 while (f1.available() > ) { 10 f3.writeInt(f1.readInt()); 11 } 12 } |
1 private static void mergeTwoSegments( int segmentSize, 2 DataInputStream f1, DataInputStream f2, 3 DataOutputStream f3) throws Exception { 4 int intFromF1 = f1.readInt(); 5 int intFromF2 = f2.readInt(); 6 int f1Count = 1 ; 7 int f2Count = 1 ; 8 9 while ( true ) { 10 if (intFromF1 < intFromF2) { 11 f3.writeInt(intFromF1); 12 if (f1.available() == f1Count++ >= segmentSize) { 13 f3.writeInt(intFromF2); 14 break ; 15 } 16 else { 17 intFromF1 = f1.readInt(); 18 } 19 } 20 else { 21 f3.writeInt(intFromF2); 22 if (f2.available() == f2Count++ >= segmentSize) { 23 f3.writeInt(intFromF1); 24 break ; 25 } 26 else { 27 intFromF2 = f2.readInt(); 28 } 29 } 30 } 31 32 while (f1.available() > && f1Count++ < segmentSize) { 33 f3.writeInt(f1.readInt()); 34 } 35 36 while (f2.available() > && f2Count++ < segmentSize) { 37 f3.writeInt(f2.readInt()); 38 } 39 } |
Listing 23.19 gives the complete program for sorting int values in largedata.dat and storing the sorted data in sortedlargedata.dat.
1 import java.io.*; 2 3 public class SortLargeFile { 4 public static final int MAX_ARRAY_SIZE = 100000 ; 5 public static final int BFFER_SIZE = 100000 ; 6 7 public static void main(String[] args) throws Exception { 8 // Implement Phase 1: Create initial segments 9 int numberOfSegments = 10 initializeSegments(MAX_ARRAY_SIZE, "largedata.dat", "f1.dat"); 11 12 // Implement Phase 2: Merge segments recursively 13 merge(numberOfSegments, MAX_ARRAY_SIZE, 14 "f1.dat", "f2.dat", "f3.dat"); 15 } 16 17 /** Sort original file into sorted segments */ 18 private static int initializeSegments ( int segmentSize, 19 String originalFile, String f1) throws Exception { 20 // Same as Listing 23.14, so omitted 21 } 22 23 private static void merge( int numberOfSegments, int segmentSize, 24 String f1, String f2, String f3) throws Exception { 25 if (numberOfSegments > 1 ) { 26 mergeOneStep(numberOfSegments, segmentSize, f1, f2, f3); 27 merge((numberOfSegments + 1 ) / 2 , segmentSize * 2, f3, f1, f2); 28 } 29 else { // rename f1 as the final sorted file 30 File sortedFile = new File( "sortedlargedata.dat" ); 31 if (sortedFile.exists()) sortedFile.delete(); 32 new File(f1).renameTo(sortedFile); 33 } 34 } 35 36 private static void mergeOneStep( int numberOfSegments, 37 int segmentSize, String f1, String f2, String f3) throws 38 Exception { 39 DataInputStream f1Input = new DataInputStream( 40 new BufferedInputStream( new FileInputStream(f1), BFFER_SIZE)); 41 DataOutputStream f2Output = new DataOutputStream( 42 new BufferedOutputStream( new FileOutputStream(f2), BFFER_SIZE)); 43 44 // Copy half number of segments from f1.dat to f2.dat 45 copyHalfToF2(numberOfSegments, segmentSize, f1Input, f2Output); 46 f2Output.close(); 47 48 // Merge remaining segments in f1 with segments in f2 into f3 49 DataInputStream f2Input = new DataInputStream( 50 new BufferedInputStream( new FileInputStream(f2), BFFER_SIZE)); 51 DataOutputStream f3Output = new DataOutputStream( 52 new BufferedOutputStream( new FileOutputStream(f3), BFFER_SIZE)); 53 54 mergeSegments(numberOfSegments / 2 , 55 segmentSize, f1Input, f2Input, f3Output); 56 57 f1Input.close(); 58 f2Input.close(); 59 f3Output.close(); 60 } 61 62 /** Copy first half number of segments from f1.dat to f2.dat */ 63 private static void copyHalfToF2( int numberOfSegments, 64 int segmentSize, DataInputStream f1, DataOutputStream f2) 65 throws Exception { 66 // Same as Listing 23.15, so omitted 67 } 68 69 /** Merge all segments */ 70 private static void mergeSegments( int numberOfSegments, 71 int segmentSize, DataInputStream f1, DataInputStream f2, 72 DataOutputStream f3) throws Exception { 73 // Same as Listing 23.16, so omitted 74 } 75 76 /** Merge two segments */ 77 private static void mergeTwoSegments( int segmentSize, 78 DataInputStream f1, DataInputStream f2, 79 DataOutputStream f3) throws Exception { 80 // Same as Listing 23.17, so omitted 81 } 82 } |
Line 10 creates initial segments from the original array and stores the sorted segments in a new file f1.dat. Line 13 produces a sorted file in sortedlargedata.dat. The merge method
merge( int numberOfSegments, int segmentSize, String f1, String f2, String f3)
merges the segments in f1 into f3 using f2 to assist the merge. The merge method is invoked recursively with many merge steps. Each merge step reduces the numberOfSegments by half and doubles the sorted segment size. After completing one merge step, the next merge step merges the new segments in f3 to f2 using f1 to assist the merge. So the statement to invoke the new merge method is
merge((numberOfSegments + 1 ) / 2 , segmentSize * 2 , f3, f1, f2);
The numberOfSegments for the next merge step is (numberOfSegments + 1) / 2 . For example, if numberOfSegments is 5, numberOfSegments is 3 for the next merge step, because every two segments are merged but there is one left unmerged.
The recursive merge method ends when numberOfSegments is 1. In this case, f1 contains sorted data. Rename it to sortedlargedata.dat in line 32.