Sometimes it's essential to know whether data has changed. For instance, crackers invading Unix systems often replace crucial files like /etc/passwd or /usr/ucb/cc with their own hacked versions that enable them to regain access to the system if the original hole they entered through is plugged. Therefore, if you discover your system has been penetrated, one of the first things you need to do is reinstall any changed files. Of course, this raises the question of how you identify the changed files, especially since anybody who's capable of replacing system executables is more than capable of resetting the last-modified date of the files. You can keep an offline copy of the system files; but this is costly and difficult, especially since multiple copies need to be stored for long periods of time. If you don't discover a penetration until several months after it occurred, you may need to roll back the system files to that point in time. Recent backups are likely to have been made after the penetration occurred and thus also likely to be compromised.
As a less threatening example, suppose you want to be notified whenever a particular web page changes. It's not hard to write a robot that connects to the site at periodic intervals, downloads the page, and compares it to a previously retrieved copy for changes. However, if you need to do this for hundreds or thousands of web pages, the space to store the pages becomes prohibitive. Peer-to-peer file sharing applications such as BitTorrent have similar needs. They need to know who's sharing which files without transferring the complete files. After a file is transferred, it must be checked to make sure it wasn't corrupted in transit.
All these tasks need a way to compare files at different times without storing the files themselves. A hash function reads an indefinite number of sequential bytes and assigns a number to that sequence of bytes. This number is called a hash code or digest. The size of the number depends on the hash function. It is not necessarily the same size as any Java primitive data type like int or long. For instance, digests calculated with the SHA algorithm are 20-byte numbers. You can store the digests of the files, then compare the digests. The digests are generally much smaller than the files themselves.
Hash functions are also used in digital signatures. To prove that you authored a document, you first calculate the hash function for the message, then encrypt the hash code with your private key. To check your signature, the recipient of the message decrypts the hash code with your public key and compares it to a new hash code they calculate for the document with the same hash function. If the two codes match, then only someone who knew your private key could have signed the message. Although you could simply encrypt the entire message with your private key rather than a hash code, public key algorithms are rather slow, and encrypting a 20-byte hash code is much faster than encrypting even a short email message. In Java, digital signatures are implemented through the java.security.Signature class. I won't talk much about that class in this book, but it is dependent on the MessageDigest classes I will discuss.
12.1.1. Requirements for Hash Functions
There are better and worse hash functions. Strong hash functions make it extremely unlikely that two different documents share a hash value. Furthermore, hash functions used for cryptography must be one-waythat is, given a hash code, you should not be able to create a document with that hash code. A strong one-way hash function must meet several related criteria. These criteria include:
The same document always has the same hash code. The hash code does not depend on the time it's calculated, a random number, or anything other than the sequence of bytes in the document. Without this requirement, the same document could have different hash codes at different times, indicating that documents had changed when in fact they hadn't.
Given any sample of the documents you wish to track, all hash codes are equally likely. For instance, if the hash code is a 64-bit long, even and odd numbers should be equally likely.
Impossible to reverse engineer
There should be no means easier than brute force to produce a document that matches a certain hash code. For instance, if I know the hash code is 9,423,456,789, I shouldn't be able to then create a file that happens to have that exact hash code.
It should be difficult to find two documents that share a hash code. You cannot easily find two documents with the same hash code, regardless of what that hash code is. The previous criterion means that you can't change the document to match a hash code. This criterion says you can't change two documents to match each other.
Sensitive dependence on initial conditions
Small changes in a document produce large changes in its hash code. Without this requirement, somebody attempting to create a document with a given hash code could modify the document a little at a time until the hash code matched, much as you might adjust the hot and cold water faucets gradually until the water reaches a desired temperature. A hash function should act more like a faucet that can scald or freeze you after the tiniest nudge.
The hash code does not say anything about the document it represents. The one-way hash function is not even partially invertible. For instance, knowing that the hash code is even should not suggest that the document being hashed contains an even number of bytes. Nor should it suggest that the document being hashed is 60% more likely to contain an even number of bytes than an odd number. While one-way hash functions need to be reproduciblethat is, the same document always has the same hash codethey should otherwise be completely random. It is extremely hard, perhaps impossible, to prove that any function meets this criterion. Nonetheless, stronger functions come closer than weaker functions; and years of experience among cryptographers allow them to make reasonable guesses about what are and are not strong hash functions, even if their hunches can't be proved to a mathematical certainty.
The proper design of one-way hash functions is a well-studied field. It's easy to create weak one-way hash functions. However, it is much harder to create truly strong, reliable, one-way hash functions. Nonexperts tend to make nonobvious but serious mistakes when implementing hash functions. Therefore, this is a task that's best left to the experts. Fortunately, the Java core API contains some hash functions designed by experts that the rest of us can use without earning a PhD in applied mathematics first.
Streams in Memory
The File System
Working with Files
File Dialogs and Choosers
Character Sets and Unicode
Readers and Writers
Formatted I/O with java.text
The Java Communications API
The J2ME Generic Connection Framework