Hash Analysis | Windows Forensics: The Field Guide for Corporate Computer Investigations

Hashing is the cryptographic term for the generation of a mathematically unique fingerprint from specific contents. In forensic work the specific contents can be a single file or an entire drive. Hashes are used extensively in forensics for both analysis and validation (previously described using the MD5 hash function).

A good hash algorithm has two qualities: it is one way and has a very limited number of collisions. One-way functions have a known algorithm that can be used to produce and reproduce a constant result given constant inputs, but given the result the original inputs cannot accurately be reverse engineered. The most common example of this is the Exclusive Or (XOR) function. The XOR function takes two binary inputs; if one and only one input is a 1, then the result is 1 and otherwise 0. If the output is known (it is 0 or 1), there is no way to know definitively each of the two inputs.

Collisions are a more complex issue in hash functionsand an area in which a lot of research is currently being conducted . A collision is the result of two distinct messages resulting in the same hash value. What this means from a forensic standpoint is that an image can be altered in such a way that it can generate the same hash value, allowing for the alteration of the original data with a valid fingerprint from that alteration. The most commonly used hash algorithm in forensics, MD5, has been shown to have collision issues. At this time they are of minor import in proving integrity (see the sidebar "Hash Algorithm Security" for details).

In addition to generating and validating the forensic integrity of evidence, the two major hash operations performed on a file system are positive comparisons and negative comparisons. Positive comparisons look for files matching a known hash value; negative comparisons look for files that do not match any known hash value.

HASH ALGORITHM SECURITY

The MD5 hash algorithm, based on the earlier MD4 algorithm (which was faster but potentially less cryptographically sound), was developed over a decade ago by esteemed cryptographer Ron Rivest (the "R" of RSA fame). The algorithm generates a 128-bit hash value based on an algorithm using block-based XOR-ing of the source data. The probability of finding two messages with the same hash value was postulated to be one in 2 ⁶⁴ , which is the subject of a collision attack (based on the birthday paradox from probability theory). The harder preamble attack, which is most relevant to forensic hashing (given a message, find another message that has the same hash value), has a theoretical probability of one in 2 ¹²⁸ . This results in a computationally infeasible position with current hardware.

From the MD5 algorithms, the government created two additional algorithms, SHA-0, which was found to have fundamental flaws, and SHA-1. Both algorithms as well as the MD5 algorithm are now known to be vulnerable to collisions at a much higher probability than originally thought, one that is computationally feasible . The later and more robust SHA-256 and SHA-512 algorithms have much lower probabilities of collision and have not suffered from the same issue, though they still have a theoretically weakened probability over the expected one in 2 ²⁵⁶ and 2 ⁵¹² estimates.

What does this mean for forensic use? First, for signature matching of known files, the probability of a mismatch is extremely unlikely , even at the lowest bounds of the probability scale, now projected to be around 2 ³⁰ for MD5. This means that the odds of two files giving a positive mismatch (one being misidentified as the other) is around one in one billion. While this may seem to be a small number (and many hash sets of known files use MD5), it is not a critical technical issue to investigators for the following reasons:

Any individual trying to match the signature of a known good file to hide content would need to pad it in such a way that it still made logical sense, in essence hiding the good content within specific padding to match that of an expected good file. This would require a preamble attack and is significantly more difficult than the probabilities noted previously, although theoretically possible in the near future.
Accidental matches, the more likely scenario, would be extremely few in number. If one was unlucky enough to get a collision-match, a manual review of the files (which would be done anyway) would show the differences if a positive search was done.
On a negative search to exclude files, there is the possibility, however remote, that a file of interest is excluded. Since negative searches are generally performed from a gold build of one's own making, the analyst can use a more secure hash algorithm (SHA-512 would be the best bet at this time).
The in-practice probability of a collision is significantly less likely than the theoretically obtained probability noted previously.

The other issue is the potential for alteration of evidence by an investigator. Unlike collisions, this type of attack is a preamble attack and has a significantly higher complexity. To do this, the investigator would need to do the following:

Obtain a valid evidence image with the hash written down.
Alter that image to contain the planted evidence.
Alter the remainder of the file in such a way that it results in the same hash value.

Step 3 is where things become quickly infeasible in practice. First, even with MD5 this is computationally impractical . The use of SHA-512 makes it nearly computationally impossible with current hardware and attack methods . Second, the alteration would need to be performed in a way that made the data still logically sound (for example, without changing file headers and directory information). This increases the order of difficulty exponentially for a drive and even for a single file. Because of this, there is no reason to worry in the near future about the reliability of hashing. However, the forensic examiner should do the following:

Use MD5Sum's SHA-512 algorithm for future cases.
Generate or obtain new hash sets (positive and negative) that use the SHA-512 algorithm.
Keep an eye on the hash algorithm space for newer algorithms which build on the lessons learned from MD5/SHA, but do not become an early adopter.

Positive Hash Analysis

Positive hash analysis relies on the examiner having a known hash on which she wants to search. A drive is searched for known content by hashing every file present and comparing it to a list of known hashes. Because a hash value is unique to a specific piece of content, that content can be matched even if the file metadata (for example, name , attributes, dates) has been altered, either intentionally or accidentally .

Tip

One way to tackle inappropriate material that propagates through an organization is to add it to an internal database of hashes. Subsequent investigations can then use those hashes to search for the same content (or organization-wide searches for the content can be performed). This can help with discovery motions when you need to identify everyone who has a particular file as well. EnCase Enterprise excels from a distributed standpoint in this area. The same hashes can be used on mail and file gateways also. For software hash libraries, including cracking tools, Mares offers a nice subscription-based product at http://www.dmares.com/maresware/hash_cd.htm. This incorporates the free NIST hash set from the National Software Reference Library at http://www.nsrl.nist.gov.

Investigators will frequently search for binary content such as images, movies, cracking tools, and music files. Because there may not be text or easily identified unique characteristics to search these files, positive hash searching is used. A hash is generated of the content from either files in the investigators possession or files from a hash library. For large hash sets, it is generally easier to create a hash of all files on a drive then compare that list to the list of known hashes. For smaller lists, the files can be compared in real-time.

Tip	Importing into an Access database is an easy way to maintain a listing of known hashes.

To generate a hash list of all files on a given system, the fsum command line tool comes in handy. It calculates file hashes in all of the major formats and has a built-in comparison capability. To recursively generate a hash list (which would be redirected into a file), fsum can be used as follows :

 C:\>fsum -r * SlavaSoft Optimizing Checksum Utility - fsum 2.51 Implemented using SlavaSoft QuickHash Library <www.slavasoft.com> Copyright (C) SlavaSoft Inc. 1999-2004. All rights reserved. ; SlavaSoft Optimizing Checksum Utility - fsum 2.51 <www.slavasoft.com> ; ; Generated on 03/17/05 at 15:52:22 ; 437b9b514e75dcf0b2ef09121230bf0d *anadisk\ADCONFIG.EXE 96484b8ae5acbe0d2d9c924115a91475 *anadisk\ANADISK.DOC a512353b65fccc078a95fe9564ac4390 *anadisk\ANADISK.EXE 83d1a9bbb0af9f6b3bb251c4977a3641 *anadisk\CRCHECK.EXE 9fa15cfdbc9806f5eac4f1e0ee9cc15b *anadisk\ORDER.FRM 976e33f7df994978ed2e99b0d0a7cd38 *anadisk\READ.ME 6065cce665b3017454792eda15b84bc3 *anadisk\WHATS.NEW f4d42be7372564cadc89e164f7326b21 *analog 5.91beta1\docs\acknow.html e3ef9c205a658d16dc5409293d6bae2a *analog 5.91beta1\docs\alias.html ...

The MD5 hash values shown previously are the most commonly found in commercial and public domain hash collections. For internal hash collections, SHA-512 is recommended. The fsum command can generate the SHA-512 hashes by using the -SHA512 command line switch.

After comparing the known hash set and the list of files on a given system, any abnormal (unauthorized software, inappropriate material, and so on) content can be quickly uncovered for further analysis.

Negative Hash Analysis

As the name implies, a negative hash analysis looks for files that are not on a known list. This is a common technique for performing quick searches on disk by eliminating all known-good files as well as identifying files which have been modified by a user . The NSRL provides hash values for most common software applications (multiple versions of each) and can be used as a starting point, but a gold-build hash set is more valuable .

A gold-build hash set is generated from the common installation (gold build) in a particular environment. Corporations may have multiple gold builds, and they may be updated frequently to incorporate patches or additions. The latest version of the gold build can be hashed en masse using the fsum tool mentioned previously to generate a database of all installed-by-default files.

Note	Gold build refers to the original color of CD-R media. The first gold build was created in-house and then sent out to be pressed on commercial equipment.

When applied to a user environment, the hash set generated from the gold build can be used to do a negative hash analysis based on deltas. Only files that are not present on the gold build (or are modified from the gold build) are identified for analysis. The generation of the gold build files is the same as the procedure for generating positive hash search information noted previously.

Tip	Many configuration and personalized files change as a normal course of action. These cannot be excluded automatically as someone could rename a file to that of a configuration file to avoid analysis. That said, they can be prioritized appropriately if a manual analysis occurs.

EnCase provides negative hash exclusions as an integrated feature of its search tool. Other utilities can have the list of new files added to their include list (for example dtSearch), or those files can be manually reviewed. They can also be removed from full directory listings when showing what user files were present, significantly reducing the amount of data presented to the critical components .