Chapter 10: File System Analysis | Windows Forensics: The Field Guide for Corporate Computer Investigations

Searching, validating, recovering, and analyzing the contents of an imaged drive are the most common forensic tasks performed by an examiner . Since the largest portion of evidence generally resides on a hard disk, this is where the most effort is spent and the most rewards are found in a large number of forensic scenarios.

File system analysis covers the examination of Windows-compatible file systems (NTFS, FAT, CDFS, and so on), as well as non-file space on a drive ( slack space and unallocated space). Many of the listed techniques may also be applied to non-Windows file systems and un-partitioned areas. These are noted where appropriate.

Searching

The most common forensic activity is searching a hard disk for strings of data. The searching can be file-based or slack-spacebased, and there are even searches of unallocated space. As in other forensic tasks, the context of the investigation determines the search type used. There are two primary search methods : index-based searching and bitwise searching.

Index-based searching generates a keyword index on the first pass through a series of files. Bitwise searching performs a full, regular expressionbased search on the raw data, file-specific or not. An index-based search may be used to provide quick, repeated searches with new terms on files copied from a shared drive. Conversely, a full bitwise search may be more relevant if a hard disk is being searched for deleted files or residual fragments of their contents.

Tip	Most of the techniques noted work with re-formatted drives as well. When Windows formats a partition (FAT or NTFS), the actual partition contents are not touched. Only the partition boot sector and file system metafiles are altered .

Index-based Searching

Index-based searches rely on the creation of an index of keywords based on the contents of files. A search tool generally opens all files on a drive/share/image/partition, searches them for repeating strings of printable characters , and creates a table of the repeated strings with pointers to the original content. The initial indexing can take hours or days. However, when completed, searching the index can be done in near-instant time.

Index searching has two primary advantages over bitwise searching. Because it is file-based, index searching can utilize hooks into various file types to index their contents in its native format. This allows the proper searching of contents for applications like Excel (XLS files) and Acrobat (PDF files), which store data in a modified format, and compressed files such as WinZip (ZIP files). Bitwise searches on these file types are ineffective as the data is not stored in a directly readable ASCII formation. Secondly, after the initial indexing, searching is extremely fast. Searching for tangential terms based on initial search results and browsing of indexed words for similar spellings is extremely effective with index searching.

REGULAR EXPRESSIONS

Regular expressions are a symbolic method for representing strings of text for the purposes of pattern matching. A familiar format for individuals familiar with Perl or grep (GNU Regular Expression Parser), regular expressions can perhaps best be described as wildcards on steroids.

Standard wildcard-based text searches are limited in their forensic uses, basic full string and substring matches being the most common. Forensic searches tend to require more complex searches such as looking for IP addresses in log files, finding any credit card numbers on a hard disk, or culling phone numbers from an email file. Regular expressions, when used with supporting programs, enable all of these searches.

A regular expression may look unusual when first encountered . To match a date in the format xx/xx/xxxx for example, a simple regular expression may look like \d+\/\d+\/\d+, which to the untrained eye looks like an obscure form of ASCII art. There are several books written specifically on regular expression construction and use (the O'Reilly books excel in this space), but constructing basic regular expressions can be done with a few simple metacharacters. Any character not specifically listed as a metacharacter will be matched as the character itself. The following table lists basic regular expression rules.

CHARACTER(S)	DESCRIPTION
\	Match the metacharacter immediately following it as if it were an ordinary character. This character is called Escape character.
.	Match any single character.
[]	Match any of the characters in the listed range. [abc] will match a, b, or c. [a-zA-Z0-9] will match any alphabetic or numeric character.
[ ˆ˜ n ]	Match any character except n. [ ˆ˜ 0-9] matches any nondigit character.
()	Used for grouping characters together.
{ n }	Match exactly n instances of a character.
{ n, m }	Match a minimum of n and a maximum of m instances of a character.
	Or; used to match one or another character. (\/ ˆ ) will match a single "/" or a single "-" character.
*	Match zero or more of the preceding character or pattern.
+	Match one or more of the preceding character or pattern.
?	Match zero or one of the preceding character or pattern.
$	Match the end of a line.
ˆ˜	Match the beginning of a line.
\w	Match any alphanumeric character or "_".
\W	Match any non-alphanumeric character.
\s	Match any whitespace character.
\S	Match any non-whitespace character.
\d	Match any digit character.
\D	Match any non-digit character.

A few example regular expressions frequently used in forensics include the following:

\s*($\d{3}$)?(\s ˆ )*\d{3}(\s-)*\d{4}Matches common U.S. phone number formats, including (xxx)xxx-xxxx, xxx xxxx, xxx-xxxx, and (xxx) xxx xxxx.
ftp\:\\\\http\:\\\\mailto\:\\\\telnet\:\\\\Matches the protocol portion of URL's.
[\w\.-]+\@\w+\.\w+Matches common email addresses.
\d{3}-\d{2}-\d{4}Matches social security numbers.
(\d{4}(\s-)*){3}\d{4}Matches common Visa or MasterCard credit card numbers.
\d{4}(\s-)*{4}\d{6}(\s-)*\d{5}Matches common American Express credit card numbers.

EnCase enterprise supports regular expressions on Windows, as do the ports of Perl and grep. Windows XP also includes an additional text search tool, Findstr, which supports basic regular expressions. To search for a phone number in any files in the current directory, the following would be used:

 C:\>findstr ([0-9]*)[0-9]*-[0-9]* * test.txt:Here is a phone number (215)555-1212. C:\>

In the preceding example, Findstr does not support \d, so [0-9] is used. Similarly, parentheses are not supported as a special character, and the + character is not supported, resulting in a weaker but still useful regular expression for phone numbers.

Note

There are many ways to write the same regular expression and subtle differences between regular expression parsers. This particular regular expression is not all that complete. It will match non-dates that fill a similar format as well, such as 99/999/99999. A better regular expression for dates might be \d{1,2}\/\d{1,2}\/\d{4}. Further refinement can limit the specific numbers entered into each location. How restrictive the expression will determine the number of matches to it.

One of the simplest and most popular tools for index searching is Google Desktop. Utilizing Google search engine technology, Google Desktop installs a spider on the desktop that can index any attached drives and automatically updates based on changes to those drives. This requires images to be mounted as actual drives by restoring them to a hard disk, preferably with a software or hardware write blocker. The Google Desktop search tool will automatically index those drives and after several hours instant searching with a browser interface will be possible. Figure 10-1 shows a search results sample.

Figure 10-1: Google Desktop search results

Tip	In Preferences, exclude any non-evidence drives from Google Search to index faster.

The advantages to Google Desktop are its pricing structure (free), speed, and familiar interface. The same query formats used in the online version of the Google search engine can be used locally, and the results are returned in a familiar, ranked format with highlighted text. The Google Desktop search tool supports web history and AOL Instant Messenger searches also, providing a view into the activities of a suspect not available through standard searching. The major limitation of the Google Desktop tool is format support. Binary files and unrecognized file formats will not be searched and indexed.

For more comprehensive index-based searches, dtSearch Desktop from dtSearch provides a commercial grade toolset for forensic searching. The indexing capabilities in the dtSearch tool are similar to those found in Google, with enhanced support for compressed and non-standard files. The ability to see the index provides for human analysis of similar terms and misspellings, and both fuzzy and phonic search capabilities return words similar to those typed, a helpful feature when dealing with place names or personal names that have alternate spellings. dtSearch uses custom filters for the most common file formats. By removing all files from the Exclude list and enabling the non-filter search, dtSearch can perform searches on binary and other files for text strings. dtSearch offers all of the advantages of Google Desktop with more powerful search and indexing capabilities and at a reasonable price. Sample output from dtSearch Desktop is shown in Figure 10-2.

Figure 10-2: dtSearch output

Warning

It is fairly common to see multiple spellings of the same name in the Western character sets from non-Western languages. Direct translations of names from Kanji, Arabic, and Cyrillic character sets may not have 1-to-1 mappings to Western characters, making alternative spellings common.

Although direct searching of non-file content is not supported by these tools, indirect searching is accomplished through the creative use of dd images. After creating a dd image of a drive's contents, that image file (uncompressed) is placed on the file system indexed by dtSearch (the same technique can be used to search a memory dump). The raw image contents, including slack space and unallocated space included in the image, will be searched for text strings. The inability to easily map the contents found in these locations back to files is a negative, but the ability to search non-file content in real time outweighs this drawback. A slightly better option, the forensic version of Win-Hex supports the dumping of slack and unallocated space to individual files for searching.

Tip	As a general rule, keep the image file sizes to 1GB for indexing purposes. Files larger than that tend to index fairly slowly, and dtSearch does not directly support files over 2GB.

Bitwise Searching

Bitwise searches look for simple text strings or regular expression matches in any sectors on a drive including both sectors that are currently unallocated and those residing in OS slack space. The ability to do regular expression searches enables the examiner to search for complex text terms as well as non-text (binary) values such as file headers. This precludes indexing, and as a result the entire contents of the drive must be searched for each term , making bitwise searches significantly slower than index searches.

WinHex provides basic text and binary (in Hexadecimal) searches of physical drive contents. While not supporting full regular expressions, wildcard searching is supported and all drive contents, including unallocated and slack space, are searched. Figure 10-3 shows a search for GIF87, one of the file headers for the GIF image file format (along with GIF89). This search will reveal not only any standalone or embedded GIF files present on the system but also previously deleted and intentionally renamed GIF files. A standard index-based search would not necessarily return any results with the same search. The headers of files would likely not be searched and the unallocated space where deleted files reside would not be searched at all.

Figure 10-3: WinHex search for GIF87

For full regular expressionbased searching, EnCase Enterprise supports GREP-based regular expression parsing in addition to ASCII, Unicode, and hex-based searches. As with WinHex noted previously, the hex searches allow the tool to be used to find binary signatures. Unicode permits searching for characters outside of the standard (and extended) ASCII character set. When conducting international investigations, Unicode support is a must.

Tip	Search terms can and should be grouped together to minimize the traversals of the drive needed. Although the number of CPU comparisons is not necessarily reduced, the critical path read times from the drive are.

EnCase can search image files that have been acquired with or imported into the software and can perform raw (and remote) searches for keywords. With remote searching, multiple drives can be analyzed simultaneously using their local resources in a forensically sound manner. Instead of the entire drive images, only the search hits can then be acquired . This is a key feature for bandwidth-limited remote investigations. An example of the EnCase search feature is shown in Figure 10-4.

Figure 10-4: EnCase Enterprise searching

As noted previously, the two biggest limitations of bitwise searches are speed and non-standard file formats. A bitwise search can take multiple hours (up to days) depending on the size of the drive(s) or images searched. Consecutive searching (searches based on previous search results) can therefore take weeks to complete successfully and cannot be done on the fly. In addition to the speed concern, file formats such as XLS (Excel) and PDF (Acrobat), compressed files, and encrypted files will not be found by a standard bitwise search. Even simple obfuscation techniques such as ROT-13 will foil a bitwise search.

Note	The simplest form of a Caesar Cipher, ROT-13 rotates each of the alphabet characters 13 spaces and replaces them with the new character in that location. It is sometimes used to bypass simple mail filters or search tools.

A final but less likely way in which bitwise searches can fail is due to file boundaries. On a heavily fragmented disk, many files are stored in non-contiguous sectors. Any text on the boundary of one of these sectors (that is, spanning two sectors) will not be accurately identified by most bitwise tools unless the sectors are contiguous. The odds of missing a keyword based on sector boundaries increase as fragmentation increases and are inversely proportional to cluster size. Since a cluster is the smallest amount of information an OS can address, intact files will have all sectors within a single cluster as contiguous.

Search Methodology

Many investigations call for a combination of searching techniques, and the methodology applied to a particular case needs to be context specific. Factors affecting the choice and order of searching include:

Awareness of suspect. If the suspect is aware that he is under investigation, file-based content may have been deleted, which leans toward bitwise searching. If the content is likely to still be present on the drive intact, index-based searching may be more effective.
Likely data format. If there is a chance that the content resides in PDF, XLS, zipped , gzipped , or Windows-compressed files, index-based searching will be more thorough. A preliminary bitwise search for the header bytes from these file types and subsequent recovery of deleted files before the index-based search will combine both techniques for the maximum effectiveness.
Time constraints. For a single keyword or group of keywords, a bitwise search will be slightly faster in most circumstances due to the overhead created by indexing. If subsequent or real-time searches are expected, which is common when performing an investigation interactively with non-IT investigators or Subject Matter Experts, index-based searches will be faster overall.
Search complexity. When searching for very complex regular expressions (for example, looking for all strings that match a credit card number or phone number), bitwise search tools will have more success. Searches based on synonyms of words, phonetically similar words, and fuzzy spellings of words will be more successful with index-based searches.

Tip	On a sparsely populated drive where content is not expected to be present in unallocated space, index-based searching will be significantly faster.

All considerations being equal, for ongoing investigations generating an index for subsequent searching can almost always be performed during downtime. Generating an index overnight immediately following an acquisition allows the investigator to decide what type of search to perform the next day. Since the indexing is complete, an index-based search or searches can be done simultaneously with direct batched, bitwise searches running in parallel.