Hashing | Advanced .NET Programming

Hashing is a way of providing a check on message or file integrity. It is not the same as encryption, but as we'll see soon, its use is important as part of the process of signing code. It is based on a similar concept to the old parity and cyclic redundancy checks, common in programming many years ago (Parity and cyclic redundancy checks are still common, but only in areas such as communications where there is no time to perform more sophisticated checks). We will illustrate the principle by quickly defining our own hashing algorithm. Let's suppose we want to check a file hasn't been corrupted, and we supply an extra byte at the end of the file for this purpose. The byte is calculated as follows: we examine every existing byte of data in the file, and count how many of these bytes have the least significant bit set to 1. If this number is odd, we set the corresponding bit of the check byte to 1. If it's even, we set that bit to 0. Then we do the same for every other bit in each byte. Thus, for example, suppose we have a short file containing three bytes as follows, we would get this result:

File Pointer	Value (binary)	Value (hex)
First byte	01001101	0x4d
Second byte	00011000	0x18
Third byte	11110011	0xf3
CHECK BYTE	10100110	0xa6

The check byte is our hash of the data. Notice that it has the following properties: it is of fixed length, independent of the size of the file, and there is no way of working out what the file contents are from the hash. It really is impossible because so much data is lost in calculating the hash (you should contrast this with the process of working out a private key given the public key - that's not impossible, but simply would take too long to be practical). The advantage of hashing the file is that it provides an easy check on whether the file has been accidentally corrupted. For files such as assemblies, the hash will be placed somewhere in the file. Then, when an application reads the file, it can independently calculate the hash value and verify that it corresponds to the value stored in the file. (Obviously, for this to work, we need an agreed file format, allowing the application to locate the hash. Also, the area of the file where the hash is stored cannot be used in calculating the hash, since those bytes get overwritten when the hash is placed there!)

The scheme I've just presented above is a very simple scheme, and has the disadvantage that the hash is only one byte long. That means that if some random corruption happened, there's still (depending on the types of errors that are likely to occur) a 1 in 256 chance that the corrupted file will generate the same hash, so the error won't be detected. In .NET assemblies, the algorithm used for the hash is known as SHA-1. This algorithm was developed by the US Government National Institute of Standards and Technology (NIST) and the National Security Agency (NSA). The hash generated by SHA1 contains 160 bits. That means that the chance of a random error not being detected is so trivially small that you can forget it. You might also hear of other standard hash algorithms. The ones supported by the .NET framework are HMACSHA-1, MACTripleDES, MD-5, SHA-1, SHA-256, SHA-384 and SHA-512.

It's also worth bearing in mind that these days, computer systems are sophisticated enough that files don't get randomly corrupted as often as once was the case. However, what could happen is that someone accidentally replaces one of the modules in an assembly with a different file - perhaps a different version of the same module. This is a more likely scenario that will be detected by the hash, since the hash covers all files in the assembly.

On the other hand, the hash won't protect you against malicious tampering with the file - since someone who does that and knows the file format will simply calculate and store a correct new hash in the file after they have finished their tampering.