[Page 346 (continued)] In this appendix, we derive the mathematical justification for the birthday attack. We begin with a related problem and then look at the problem from which the name "birthday attack" is derived. Related Problem A general problem relating to hash functions is the following. Given a hash function H, with n possible outputs and a specific value H(x), if H is applied to k random inputs, what must be the value of k so that the probability that at least one input y satisfies H(y) = H(x) is 0.5? For a single value of y, the probability that H(y) = H(x) is just 1/n. Conversely, the probability that H(y) H(n)]. If we generate k random values of y, then the probability that none of them match is just the product of the probabilities that each individual value does not match, or [1 (1/n)]k. Thus, the probability that there is at least one match is 1 [1 (1/n)]k. The binomial theorem can be stated as follows: For very small values of a, this can be approximated as (1 ka). Thus, the probability of at least one match is approximated as 1 [1 (1/n)]k 1 [1 (n)] = k/n. For a probability of 0.5, we have k = n/2. In particular, for an m-bit hash code, the number of possible codes is 2m and the value of k that produces a probability of one-half is Equation 11-1 [Page 347]The Birthday Paradox The birthday paradox is often presented in elementary probability courses to demonstrate that probability results are sometimes counterintuitive. The problem can be stated as follows: What is the minimum value of k such that the probability is greater than 0.5 that at least two people in a group of k people have the same birthday? Ignore February 29 and assume that each birthday is equally likely. To answer, let us define P(n, k) = Pr[at least one duplicate in k items, with each item able to take on one of n equally likely values between 1 and n] Thus, we are looking for the smallest value of k such that P(365, k) 0.5. It is easier first to derive the probability that there are no duplicates, which we designate as Q(365, k 365, then it is impossible for all values to be different. So we assume 365. Now consider the number of different ways, k values with no duplicates. We may choose any of the 365 values for the first item, any of the remaining 364 numbers for the second item, and so on. Hence, the number of different ways is Equation 11-2 If we remove the restriction that there are no duplicates, then each item can be any of 365 values, and the total number of possibilities is 365k. So the probability of no duplicates is simply the fraction of sets of values that have no duplicates out of all possible sets of values: and Equation 11-3 This function is plotted in Figure 11.10. The probabilities may seem surprisingly large to anyone who has not considered the problem before. Many people would guess that to have a probability greater than 0.5 that there is at least one duplicate, the number of people in the group would have to be about 100. In fact, the number is 23, with P(365, 23) = 0.5073. For k = 100, the probability of at least one duplicate is 0.9999997. Figure 11.10. The Birthday Paradox (This item is displayed on page 348 in the print version) Perhaps the reason that the result seems so surprising is that if you consider a particular person in a group, the probability that some other person in the group has the same birthday is small. But the probability that we are concerned with is the probability that any pair of people in the group has the same birthday. In a group of 23, there are (23(23 1))/2 = 253 different pairs of people. Hence the high probabilities. Useful Inequality Before developing a generalization of the birthday problem, we derive an inequality that will be needed: Equation 11-4 [Page 348]Figure 11.11 illustrates the inequality. To see that the inequality holds, note that the lower line is the tangent to ex at x = 0. at The slope of that line is just the derivative of ex at x = 0; Figure 11.11. A Useful Inequality [Page 349]The tangent is a straight line of the form ax + b, with a = 1, and the tangent at x = o must equal eo Thus, the tangent is the function (1 x), confirming the inequality of Equation (11.4). Further, note that for small x, we have (1 x) eThe General Case of Duplications The birthday problem can be generalized to the following problem: Given a random variable that is an integer with uniform distribution between 1 and n and a selection of k instances (k n, k), that there is at least one duplicate? The birthday problem is just the special case with n = 365. By the same reasoning as before, we have the following generalization of Equation (11.3): Equation 11-5 We can rewrite as Using the inequality of Equation (11.4): Now let us pose the question: What value of k is required such that P(n, k) 0.5? To satisfy the requirement, we have For large k, we can replace k x (k 1) by k2, and we get Equation 11-6 As a reality check, for n = 365, we get which is very close to the correct answer of 23. We can now state the basis of the birthday attack in the following terms. Suppose we have a function H, with 2m possible outputs (i.e., an m-bit output). If H is applied to k random inputs, what must be the value of k so that there is the probability of at least one duplicate [i.e., H(x) = H(y) for some inputs x, y)]? Using the approximation in Equation (11.6): [Page 350] Equation 11-7 Overlap between Two Sets There is a problem related to the general case of duplications that is also of relevance for our discussions. The problem is this: Given an integer random variable with uniform distribution between 1 and n and two sets of k instances (k n, k), that the two sets are not disjoint; that is, what is the probability that there is at least one value found in both sets? Let us call the two sets X and Y, with elements {x1, x2,..., xk} and {y1, y2,..., yk}, respectively. Given the value of x1, the probability that y1 = x1 is just 1/n, and therefore probability that does not match x1 is [1 (1/n)]. If we generate the k random values in Y, the probability that none of these values is equal to is [1 (1/n)]k. Thus, the probability that there is at least one match to x1 is 1 [1 (1/n)]k. To proceed, let us make the assumption that all the elements of X are distinct. If n is large and if k is also large (e.g., on the order of ), then this is a good approximation. In fact, there may be a few duplications, but most of the values will be distinct. With that assumption, we can make the following derivation: Using the inequality of Equation (11.4): R(n, k) > 1 (e1/n)k2 R(n, k) > 1 (ek2/n) Let us pose the question: What value of k is required such that R(n, k) > 0.5? To satisfy the requirement, we have Equation 11-8 We can state this in terms related to birthday attacks as follows. Suppose we have a function H, with 2m possible outputs (i.e., an m-bit output). Apply H to k random inputs to produce the set X and again to k additional random inputs to produce the set Y. What must be the value of k so that there is the probability of at least 0.5 that there is a match between the two sets (i.e., H(x) = H(y) for some inputs x X, Y)? Using the approximation in Equation (11.8): |