15.2. The Remembrance of Data Passed StudyIn August 1998, I was chief technology officer of a computer security start-up. One of my jobs involved setting up a test bed of modem-equipped computers that would answer incoming phone calls and respond with a variety of different prompts. Instead of purchasing new computers for this somewhat mundane task, I bought 10 used machines at $20 each from a small-town computer store. Most of the computers had been sitting on a shelf for more than a year, and the store's owner didn't even know if they worked. My plan was to mix and match the components until I had five or six operational systems. When I got the computers back to my house and started to inventory the parts, I discovered that the computer store had neglected to sanitize the hard drives prior to selling me the machines. Intrigued, I inventoried the drives and discovered the following:
All of this information was visible once the computers were turned on; no special disk recovery software was needed at all. I called the store's owner. Once he got over his shock and embarrassment, he asked me to wipe the systems as a favor. Apparently, he had meant to sanitize the machines before he sold them, but he had forgotten to do so. 15.2.1. Other Anecdotal InformationMy experience with data left on disks that were subsequently sold on the secondary market is hardly unique. In recent years, there have been numerous reports of such cases, including:
While these cases are certainly notable, they represent a tiny fraction of the number of hard disks that are being repurposed, recycled, or otherwise resold on the secondary market. According to the market research firm Dataquest,[6] nearly 150 million disk drives will be retired in 2002up from 130 million in 2001. Dataquest estimates that 7 disk drives will be retired for every 10 drives that ship in the year 2002; this is up from a 3-for-10 rate of retirement in 1997. Thus, more and more drives are being retired every year!
But the term retired is something of a misnomer. As the experience at the VA Hospital demonstrates, many disk drives that are "retired" by one organization can appear elsewhere. Indeed, mainstream businesses are increasingly turning to used equipment in an effort to cut coststhe editors at CIO Magazine even ran a cover story giving their readers advice on finding the best deals.[7]
These anecdotal reports are interesting both because of their similarity to each other and because of their relative scarcity. Clearly, confidential information has been disclosed through computers sold on the secondary market more than a few times. Why, then, have there been so few reports of unintended disclosure? In the initial publication detailing this study,[8] Shelat and I proposed three possible hypotheses to answer this question:
This chapter argues that the third hypothesis is correct. Based on a combination of the information found on the drives and interviews conducted with some of the original data owners, it seems that most confidential information on these "retired" drives is erased but not overwritten. As a result, I believe that many repurposed drives contain significant amounts of personal or confidential information, but few of the drives' current users are aware of this fact. 15.2.2. Study MethodologyBetween January 1999 and January 2003, I purchased 235 used hard drives on the secondary market in an effort to determine what information they contained and what, if any, means were taken to clean the drives before they were discarded. Initially the drives were purchased at used computer stores such as WeirdStuff in Sunnyvale, California, and PC Recycle in Belleview, Washington. The majority of drives were purchased as the result of winning bids on the eBay online auction web site. Most purchases consisted of between 3 and 5 drives; in no case were more than 20 drives at a time from the same vendor. Modern hard disks store information in individually addressable blocks , with each block being 512 bytes in length. A 50-gigabyte disk thus has approximately 10 million blocks. On receipt, each drive was cataloged and entered into a database. Each drive was then attached to a computer running the FreeBSD operating system and the contents were copied off, block for block, using the command: dd if=/dev/ad2 of=NNN.img conv=noerror,sync where /dev/ad2 is the raw device of the disk, noerror instructs that the dd command should continue copying data even if an error is encountered, and sync specifies that error-containing blocks should be written to the output stream as all zeros. A filesystem is the piece of a computer's operating system that controls the allocation of disk blocks to individual files. Popular filesystems are FAT32 (used by Windows 3.1, Windows 95, and Windows 98), NTFS (used by Windows NT, 2000, and XP), FFS (used by BSD Unix), and ext2fs (used by Linux). The following discussion is for the FAT32 filesystem, but it applies to all modern filesystems with only minor changes. Once the images were created, they were mounted with FreeBSD's "memory disk" driver. I then attempted to read the data in the image using FreeBSD's native filesystem implementations for the FAT, NTFS, Novell, and Unix filesystems. Of the 235 disks, 59 were dead on arrival, and the remaining 176 had data that could be read, for a total of 125 gigabytes of image files. Of these drives, 11 disks contained no data at allthat is, every block on these disks had been overwritten with ASCII NUL bytes. Another 22 disks appeared to have been overwritten completely and then formatted using the Windows FORMAT command. On these 22 disks, more than 99% of the blocks were blank. For the majority of the remaining disks, it appeared that little if anything had been done to remove the data of their previous owners. Further examination appeared to contradict this conclusion. The remaining disks contained relatively large amounts of recoverable data. Nevertheless, a relatively small percentage of this data seemed to actually reside in files. There were only 168,459 files on the 176 readable drives, accounting for just 38,296,903[9] of the 190,681,765 non-zero disk blocks. Examining the files by file type, I found just 783 Microsoft Word files, 184 Microsoft Excel files, 30 Microsoft PowerPoint files, and just 11 Outlook PST filesnumbers that seemed suspiciously low given that these were used disk drives.
Typical of the disks recovered was Disk #70, an IBM DALA 3540 that was purchased for $5 on eBay from a Massachusetts retail store. The disk contained 541 megabytes of data in 1,057,392 disk blocks (each disk block holds 512 bytes). Only 6% of the disk blocks were filled with ASCII NUL bytes; the rest contained data. Yet when the disk was mounted, just three files were observedtwo of which were marked as "hidden" by the operating system: IO.SYS (hidden) MSDOS.SYS (hidden) COMMAND.COM Where was the rest of the data? 15.2.3. FORMAT Doesn't FormatBroadly speaking, modern disk drives have the ability to store two kinds of information. The majority of information stored by the device is directly addressable user data these are the actual blocks that are written by the computer's operating system onto the drive's media in response to WRITE commands, and read back into the computer in response to READ commands. The second kind of information stored on the disk drive is hidden data that is used for the proper operation of the disk drive itself. This information includes the disk's firmware and spare blocks that the drive will use when blocks containing directly addressable user data begin to fail. When a manufacturer delivers a drive to the computer maker or end user, all blocks that will be used to hold directly addressable user data are filled with the ASCII NUL characterthat is, the blocks are zeroed. (The hidden blocks generally are not zeroed, but they cannot be accessed by the computer's operating system; for most practical purposes, these blocks do not exist.) When a disk is formatted with the FAT filesystem, the Windows FORMAT command scans the entire disk, reading every block to make sure that the block is functioning. The FORMAT command then writes down boot blocks, the disk's root directory, and finally a file allocation table that is used to distinguish blocks that are in use by the filesystem from those that are not. This process typically takes between 10 and 20 minutes, owing to the time required to read every block on the drive. Once the root directory is written out, any information that was previously on the disk is rendered inaccessible. The data is still on the disk, but it cannot be retrieved using Windows because the files and directories of the disk cannot be reached by starting at the disk's now empty root directory. Thus, the Windows FORMAT command doesn't really erase the contents of the disk: it actually reads the entire disk and writes a new root. (Overwriting the FAT does make it more difficult to reassemble files that have been fragmentedthat is, written partially in one location and partially in one or more others. This tends to make it harder, although not impossible, to recover large files.) The failure of the FORMAT command to zero or otherwise initialize a hard drive has an interesting history. The first version of DOS, MS-DOS 1.0, worked only with floppy disks. At the time, floppies were sold without any track or sector information on their magnetic surface and they needed to be "formatted" before they could be used. In the process of formatting the disk, any bad blocks were detected and noted in the disk's FAT so that they would not be used accidentally. If a floppy disk containing data was formatted, the information that it contained would necessarily be overwritten. This process took a few minutes. Thus, the initial meaning of "format" to PC users in 1981 was "a process that initializes a piece of magnetic media, making it usable, and destroying any data that the media might contain in the process." With the introduction of DOS 2.0, the first version of DOS that directly supported hard-disk drives, FORMAT of hard disks was made nondestructive. Because hard drives were sold already initialized, it was only necessary for the FORMAT command to literally write a format of data structures into the disk's logical blocks so that the disk could be used with the operating system. But the FORMAT command continued to scan the entire disk for bad blocksa process that might take between 10 and 30 minutes. Thus, the FORMAT command gave the impression that it was overwriting the entire disk because it took a long time and because the resulting disk appeared to contain no data. But, in fact, no such overwriting took place. Not only does the FORMAT command turn visible data into invisible data, but it furthermore does so in a manner that is misleading. Equally misleading is the warning that the command displays: A:\>format c: WARNING, ALL DATA ON NON-REMOVABLE DISK DRIVE C: WILL BE LOST! proceed with Format (Y/N)?y Formatting 1,007.96M 100 percent completed. Writing out file allocation table Complete. Calculating free space (this may take several minutes)... Complete. Volume label (11 characters, ENTER for none)? 1,054,851,072 bytes total disk space 1,054,851,072 bytes available on disk 4,096 bytes in each allocation unit. 257,531 allocation units available on disk. Volume Serial Number is 4026-1EFC A:\> The DOS 2.0 FORMAT command could have overwritten the entire disk, but this would have doubled the amount of time that the command required to prepare a new hard drive because every block would have needed to be both written and then read. The program's creators appear to have made a tradeoff here between usability and securityincreasing one while decreasing the other. Unfortunately, it was an invisible, undocumented tradeoff. Microsoft could have done things differently. For example, the program's creators could have put in a command-line switch that would have forced the program to first overwrite each block with NULs before it was read back. Then, the program could have been modified so that it would display one of two different messages. The "ALL DATA ... WILL BE LOST" message could have been used when the disk was actually overwritten, and a different message could have been used for the less severe option. One reason that Microsoft's engineers may not have gone in this direction is that the hard drives that were sold in the 1980s generally came with their own separately packaged "disk utilities." Invariably, one of the "utilities" was a program that performed a so-called "low-level format" on the physical disk. The details of what a "low-level format" actually did varied from manufacturer to manufacturer and from drive to drive, but it generally was viewed as destroying all of the user-addressable information that the disk might contain. Mueller's 1991 book, Que's Guide to Data Recovery, noted that the key difference between a low-level format and a high-level format was that "you can recover dataunformatfrom a high-level format."[10] Nevertheless, such knowledge did not diffuse into the general computer-user population.
It is incredibly misleading for an operating system to give the impression that all of the information has been removed from a disk when, in fact, the information has merely been made inaccessible to users who have not obtained special data recovery tools. Such a situation is an invitation for mishap: given a freshly formatted hard disk, there is no way for a user to audit the disk and determine if it is, in fact, clean, or if it has a treasure-trove of hidden, confidential information. Modern versions of the Windows FORMAT command also have the ability to "quick format" a disk, which omits the media scan step. In this case, the entire disk can be formatted in just a few seconds. When Microsoft created the "quick format" option, the company could have gone back and changed the behavior of FORMAT when the "quick" option wasn't selected. Ideally, a non-quick format would actually overwrite the data on the disk. This would have aligned once again the internal workings of the commands with the effects that are visible to the user. Unfortunately, Microsoft left the behavior of the command as it was. 15.2.4. DELETE Doesn't DeleteJust as today's FORMAT command doesn't actually format disks, it turns out that commands for erasing individual files do not actually perform that function, either. Instead of overwriting the actual data, commands like DELETE and ERASE simply remove the entry in the file's containing directory and return the file's blocks to the free list. What happens after the file is deleted depends upon many factors, including the amount of free space on the disk and the system's pattern of usage. Once again, the usability problem is that the operating system gives the user the appearance that the data has been removed from the computer when, in fact, the data has merely been made inaccessible by ordinary means. The usability problem for end users is compounded by the fact that there is no mention of this behavior in the Microsoft documentation. For example, the Windows built-in help for the DELETE command simply states that DEL "deletes one or more files." As before, this systematic deception on the part of DELETE and ERASE wasn't exactly secreta 1987 advertisement for the Mace Utilities appearing in The New York Times noted that the $59.95 program could "Unformat, Undelete, Diagnose & Remedy" and much more.[11] But mention that files could be undeleted did not appear in a feature article until 1990, and then only in Peter Lewis's "Executive Computer" column on the 11th page of the Business section of The Times.[12]
15.2.5. A Taxonomy of Sanitized Recovered DataNow we have an explanation for what happened to the data on Disk #70: the disk was formatted with the Windows FORMAT command before it was resold. Indeed, running the Unix strings(1) command over the disk's image file reveals many interesting things about the disk's previous owner, including the fact that the disk had a copy of IBM AntiVirus Trial Edition installed (Example 15-1) and that the disk was used in some kind of medical application (Example 15-2). Additional investigation revealed that this disk had been used in a computer that belonged to a mail-order pharmacy. Example 15-1. The contents of block #854420 from Disk #70Displaying block 854420 Notes to Users of IBM AntiVirus version 3.0 build 307..= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =....This file conta ins important notes for all users of IBM AntiVirus,..including a summary of highlights in this release and last-minute..changes to the printed documentation. It is divided into these..section s:.... Introduction.. Highlights of release 300.. Highligh ts of release 301.. Highlights of release 302.. Highlights o f build 304.. Highlights of build 306.. Highlights of build Example 15-2. The contents of blocks 315782 and 315783 from Disk #70Displaying block 315782 *.......&@......u@.ALLERGY ALERT.......@.DPC5.......&@..... u@.D RUG TO DRUG INTERACTION.......@.DPC4.......&@.....0u@.THERAPEUTI C DUPLICATION.......@.DPC,.......&@.....@u@.HIGH DOSE ALERT..... ..@.DPC-.......&@.....Pu@.TOO EARLY REFILL.......@.DPC/........@ .....'u@.EXCESSIVE DURATION.......@.CMB=.......&@.....pu@ INFERR ED DRUG DISEASE PRECAUTION.......@.DPC(.......&@......u@.DRUG GE NDER.......@.DPC0.......&@......u@.DRUG AGE PRECAUTION.......@.D PC+.......&@......u@.LOW DOSE ALERT.......@.DPC........*@....... Displaying block 315783 09/30/1981 03:00 DUPLICAT.ION @.SUSPENDED LICENSE.......@.RTP0........@.......@.DIABETIC STRIP S - C.......@.CMB0........@.......@.DIABETIC STRIPS - B.......@. CMB7.......$@.......@.GENERIC PROD. SUBST-REFILL.......@.DAW>... ....$@.......@!GENERIC PROD. SUBST-NEW & REFILLS.......@.DAW;... ....&@......v@.BENEFICIARY NOT ELIGIBLE PRIME.......@.DPC2...... .0@.......@.MNFR. SPECIFIED ON RX.......@.NIS........&@..... v@. DUPLICATION CLAIM.......@.DPC-.......&@.....0v@.REQUIRES RECIEPT .......@.DPC/.......&@.....@v@.DRUG NOT AVAILABLE.......@.DPC-.. In order to facilitate the discussion of sanitization tools and practices, Shelat and I created a sanitization taxonomy (see Table 15-1). Using this taxonomy to discuss Disk #70, we can say that the disk contained one Level 0 file (COMMAND.COM) and two Level 1 files (IO.SYS and MSDOS.SYSboth files that were "hidden") and approximately 508 MB of Level 3 data.
The combination of the taxonomy and the statistical analysis of the operational disks provides a simple answer to the questions posed earlier in this chapter. Although the disks that were purchased contained large amounts of personal information, most of this information consisted of Level 2 and Level 3 files: a casual examination of the disks showed disks that were either formatted or had the user files deleted, leaving only the program files. Most potential recipients of disks sold on the secondary market, lacking tools for accessing Level 2 and Level 3 information, probably never encounter the confidential information on disks that they purchase.
This answer was confirmed, in part, by a series of interviews conducted between December 2003 and October 2004 with the previous owners of 16 of the drives. In some cases (Drives #7, #11, #73, #74, #75, #77, #94, and #134), the organization had a procedure in place for sanitizing the drives, but that procedure was not sufficient to do the job. In other cases (Drives #21 and #44), there was no formal procedure in place. Many owners that were not sophisticated had trusted their reseller to perform the sanitization processa trust that was betrayed (Drives #54, #193, and #205). In the remaining cases of drives that were traced back to their owners, no determination could be made (Drives #6 and #128). |