Prerequisites | Professional XML Development with Apache Tools: Xerces, Xalan, FOP, Cocoon, Axis, Xindice (Wrox Professional Guides)

To get the most out of this chapter, you need to be familiar with XML, XML namespaces, and the DOM APIs. You also need to know some of the basics of cryptography. We’ll go over the concepts at a basic level so you have an idea of what’s going on. If you’re familiar with cryptographic concepts, then you can skip ahead to the next section.

One-Way Hashing

A hash function is a function that takes an array consisting of an arbitrary number of bytes as input and converts it to an output array of fixed size. The output array is usually shorter than the input array. This is a loose definition, but if you’ve dealt with hashing, this is exactly the way hash functions work.

A one-way hash function is a hash function that works in only one direction—it should be very difficult, if not almost impossible, to obtain the input array from the hash value. A good one-way hash function won’t map two different input arrays to the same hash value. This makes one-way hash functions a good way to summarize or fingerprint some data.

In the cryptography world, the terms one-way hash function; cryptographic checksum, and message digest all mean the same thing. These functions are used to check whether data has been tampered with. You do this by taking the data and computing the hash value for it. Then you transmit the data and the hash value to someone else. The recipient takes the data, computes the hash value for it (the hash algorithm is usually public), and compares the computed hash value with the received hash value. If they’re the same, the data hasn’t been altered. If they’re different, then the data has been altered, because no two input values yield the same hash value.

You may also see the term message authentication code (MAC). This is a one-way hash function that’s combined with a secret key. Unless you have the secret key, the recipient can’t verify the hash function. An easy way to implement a MAC is to compute a one-way hash value and then encrypt the hash value with a symmetric key encryption scheme.

Common one-way hashing algorithms are Secure Hash Algorithm 1 (SHA1) and Message-Digest 5 (MD5). SHA1 is the algorithm of choice because MD5 is based on the MD4 algorithm, which has been broken. SHA1 produces a 160-bit hash value, whereas MD5 produces a 128-bit hash value.

Symmetric Key Encryption

Symmetric key encryption is also known as secret key encryption. With these encryption schemes, you can exchange data securely with anyone who knows the encryption key. The details of the encryption algorithm can be well known, but as long as the key is a secret, your data will be safe. When people say encryption, they’re usually referring to symmetric key encryption. Usually, the strength of a symmetric key algorithm depends on the size of the encryption key: The longer the key, the stronger the algorithm.

Some common symmetric key algorithms are DES, Triple DES (or 3-DES), Advanced Encryption Standard (AES), CAST, International Data Encryption Algorithm (IDEA), and Blowfish. DES is the U.S. Governments Data Encryption Standard, which until recently was the algorithm approved for all U.S. Government encryption. Triple DES is just what it sounds like: The data to be encrypted is encrypted using the DES algorithm three times. DES was replaced by the Advanced Encryption Standard (AES) in 2000. IDEA is considered a very secure symmetric key algorithm; CAST was designed in Canada and has been used in some commercial products. Blowfish was designed by Bruce Schneier of CounterPane and is in the public domain. Blowfish has been widely implemented and will appear in the Linux 2.6 series kernels.

Public Key Encryption

Public key encryption uses two keys: a private key, which is kept secret; and a public key, which is distributed to whoever wants it. In public key systems, the private key and public key are generated together. Public key encryption works like this. A person who wants to send a message to the holder of a private key takes that person’s public key and uses it to encrypt a message. The only way to decrypt a message that has been encrypted with that public key is to use the corresponding private key. If everyone has a public key and a private key, then you have a system in which a person can send secret messages to any other person.

Public key encryption algorithms are much slower than symmetric key algorithms. For this reason, in practice public key methods are used to safely communicate a shared key for a symmetric encryption algorithm. Once the symmetric key has been safely transmitted, then the two parties can communicate via symmetric key encryption. Two of the most popular public key encryption algorithms are the RSA algorithm originally developed at MIT and the ElGamal algorithm.

Digital Signatures

The goals of digital signature algorithms are fairly straightforward. The recipient of digitally signed data must be sure the signature is actually the signature of the person who claims to have signed the document, and that signature should be difficult, if not impossible, to forge or reuse. It should be impossible for someone to digitally sign a document and then later claim that they didn’t sign the document. Finally, a digital signature is no good unless you can be sure the data it’s attached to hasn’t been tampered with.

One method of implementing digital signatures is to use public key encryption. We’ll explain how this works using public key encryption terminology, but you should be aware that not every public key encryption algorithm can be used this way. In some public key algorithms, you can use the private key to encrypt data, which can only be decrypted using the corresponding public key (this is in addition to the normal usage of the public and private keys). The digital signature protocol works like this. To digitally sign some data, the signer encrypts the data using their private key. Verifying the signature is simple: The recipient of the signed data uses the sender’s public key to decrypt the data. This neatly satisfies all the requirements for a digital signature algorithm: Only one private key can be used to encrypt the data and have it still be decryptable with the public key. This means you know which key signed the data. (This is the best you can do. If the signer leaves their computer unprotected, then it’s conceivable that someone else could use their private key to generate a signature—such attacks are beyond what digital signatures can protect against.) You can’t generate the signature without having possession of the private key, making it difficult to forge the signature. Because the value of the signature depends on the data being encrypted by the private key, the signature can’t be reused for other data. It’s hard to claim that a document wasn’t signed by a particular private key, because verification with the public key indicates which private key was used to generate the signature. Finally, modifying the encrypted data means that it can’t be verified with the public key, so you’ll be able to tell whether the data has been tampered with.

One important issue that arises from the use of public key encryption for digital signatures is how to obtain and trust a public key. The distribution of public keys is a big area, and we’ll just point out that the ITU’s X.509 standard is helping to solve this problem.

The issue of how to trust a public key takes us to the topic of certificates. A digital certificate is a public key that has been digitally signed by a trusted third party. This third party is often known as a Certification Authority (CA). The CAs are supposed to be well known and highly trustworthy entities. You’ve probably heard of Verisign, Entrust, or Thawte—names that appear when you’re dealing with SSL certificates.

Remember that public key encryption is computationally expensive. One way to improve the practical efficiency of computing digital signatures is to sign a one-way hash value of the data being signed. The one-way hash is usually much smaller than the data being signed, and it’s less expensive to sign the hash value than to sign the data. Using this method, verifying a signature involves the following steps: independently generate the hash value for the received data, verify the signed hash value, and test the two hash values for equality.

Two digital signature algorithms are in wide use today: the RSA public key cryptosystem and the DSA algorithm proposed by the U.S. National Institute of Standards and Technology (NIST).