8.5 ISAM (Indexed Sequential Access Method) Files

ISAM is a trick that attempts to allow random access to variable length records in a sequential file. This is a technique employed by IBM on its mainframe data bases in the 1960s and 1970s. Back then, disk space was very precious (remember why we wound up with the Y2K problem?), and IBM's engineers did everything they could to save space. At that time disks held about five megabytes or so, were the size of washing machines, and cost tens of thousands of dollars. You can appreciate why they wanted to make every byte count. Today, database designers have disk drives with hundreds of gigabytes per drive and RAID^[3] devices with dozens of these drives installed. They don't bother trying to conserve space at all ("Heck, I don't know how big the person's name can get, so I'll allocate 256 bytes for it!"). Nevertheless, even with large disk arrays, saving space is often a wise idea. Not everyone has a terabyte (1,000 gigabytes) at their disposal, and a user of your application may not appreciate your decision to waste their disk space. Therefore, techniques like ISAM that can reduce disk storage requirements are still important today.

ISAM is actually a very simple concept. Somewhere, the program saves the offset to the start of every record in a file. Because offsets are four bytes long, an array of double words will work quite nicely.^[4] Generally, as you construct the file you fill in the list (array) of offsets and keep track of the number of records in the file. For example, if you were creating a text file and you wanted to be able to quickly locate any line in the file, you would save the offset into the file of each line you wrote to the file. The following code fragment shows how you could do this:

 static      outputLine: string;      ISAMarray: dword[ 128*1024 ]; // allow up to 128K records.               .               .               .      mov( 0, ecx ); // Keep record count here.      forever         << create a line of text in "outputLine" >>         fileio.position( fileHandle );         mov( eax, ISAMarray[ecx*4] ); // Save away current record offset.         fileio.put( fileHandle, outputLine, nl ); // Write the record.         inc( ecx ); // Advance to next element of ISAMarray.         << determine if we're done and BREAK if we are >>      endfor;      << At this point, ECX contains the number of records and >>      << ISAMarray[0]..ISAMarray[ecx-1] contain the offsets to >>      << each of the records in the file.           >>

After building the file using the code above, you can quickly jump to an arbitrary line of text by fetching the index for that line from the ISAMarray list. The following code demonstrates how you could read line recordNumber from the file:

 mov( recordNumber, ebx ); fileio.seek( fileHandle, ISAMarray[ ebx*4 ] ); fileio.a_gets( fileHandle, inputString );

As long as you've precalculated the ISAMarray list, accessing an arbitrary line in this text file is a trivial matter.

Of course, back in the days when IBM programmers were trying to squeeze every byte from their databases as possible so they would fit on a 5MB disk drive, they didn't have 512 kilobytes of RAM to hold 128K entries in the ISAMarray list. Although a half a megabyte is no big deal today, there are a couple of reasons why keeping the ISAMarray list in a memory-based array might not be such a good idea. First, databases are much larger these days. Some databases have hundreds of millions of entries. While setting aside a half a megabyte for an ISAM table might not be a bad thing, few people are willing to set aside a half a gigabyte for this purpose. Even if your database isn't amazingly big, there is another reason why you might not want to keep your ISAMarray in main memory; it's the same reason you don't keep the file in memory: Memory is volatile, and the data is lost whenever the application quits or the user removes power from the system. The solution is exactly the same as for the file data: You store the ISAMarray data in its own file. A program that builds the ISAM table while writing the file is a simple modification to the previous ISAM generation program. The trick is to open two files concurrently and write the ISAM data to one file while you're writing the text to the other file:

 static      fileHandle: dword;    // file handle for the text file.      outputLine: string;   // file handle for the ISAM file.      CurrentOffset: dword; // Holds the current offset into the text file.                   .                   .                   .      forever           << create a line of text in "outputLine" >>           // Get the offset of the next record in the text file           // and write this offset (sequentially) to the ISAM file.           fileio.position( fileHandle );           mov( eax, CurrentOffset );           fileio.write( isamHandle, (type byte CurrentOffset), 4 );           // Okay, write the actual text data to the text file:           fileio.put( fileHandle, outputLine, nl ); // Write the record.           << determine if we're done and BREAK if we are >> endfor;

If necessary, you can count the number of records as before. You might write this value to the first record of the ISAM file (because you know the first record of the text file is always at offset zero, you can use the first element of the ISAM list to hold the count of ISAM/text file records).

Because the ISAM file is just a sequence of four-byte integers, each record in the file (i.e., an integer) has the same length. Therefore, we can easily access any value in the ISAM file using the random access file I/O mechanism. In order to read a particular line of text from the text file, the first task is to read the offset from the ISAM file and then use that offset to read the desired line from the text file. The code to accomplish this is as follows:

      // Assume we want to read the line specified by the "lineNumber" variable.      if( lineNumber <> 0 ) then      // If not record number zero, then fetch the offset to the desired      // line from the ISAM file:      intmul( 4, lineNumber, eax ); // Compute the index into the ISAM file.      fileio.seek( isamHandle, eax );      fileio.read( isamHandle, (type byte CurrentOffset), 4 ); // Read offset else      mov( 0, eax ); // Special case for record zero because the file                     // contains the record count in this position. endif; fileio.seek( fileHandle, CurrentOffset ); // Set text file position. fileio.a_gets( fileHandle, inputLine );   // Read the line of text.

This operation runs at about half the speed of having the ISAM array in memory (because it takes four file accesses rather than two to read the line of text from the file), but the data is non-volatile and is not limited by the amount of available RAM.

If you decide to use a memory-based array for your ISAM table, it's still a good idea to keep that data in a file somewhere so you don't have to recompute it (by reading the entire file) every time your application starts. If the data is present in a file, all you've got to do is read that file data into your ISAMarray list. Assuming you've stored the number of records in element number zero of the ISAM array, you could use the following code to read your ISAM data into the ISAMarray variable:

 static      isamSize: uns32;      isamHandle: dword;      fileHandle: dword;      ISAMarray: dword[ 128*1024 ];            .            .            .      // Read the first record of the ISAM file into the isamSize variable:      fileio.read( isamHandle, (type byte isamSize), 4 );      // Now read the remaining data from the ISAM file into the ISAMarray      // variable:      if( isamSize >= 128*1024 ) then           raise( ex.ValueOutOfRange );      endif;      intmul( 4, isamSize, ecx ); // #records * 4 is number of bytes to read.      fileio.read( isamHandle, (type byte ISAMarray), ecx );      // At this point, ISAMarray[0]..ISAMarray[isamSize-1] contain the indexes      // into the text file for each line of text.

^[3]Redundant array of inexpensive disks. RAID is a mechanism for combining lots of cheap disk drives together to form the equivalent of a really large disk drive.

^[4]This assumes, of course, that your files have a maximum size of four gigabytes.