4.2 Multiplication

Team-Fly

If the individual summands n₁, n₂, n₃, …, n_r are all equal to one and the same integer n, then one calls the addition "multiplication of the integer n by the multiplier r" and sets n₁ + n₂ + n₃ + ··· + n_r = rn.

—Leopold Kronecker, On the Idea of Number

Multiplication is one of the most critical functions of the entire FLINT/C package due to the computation time required for its execution, since together with division it determines the execution time of many algorithms. In contrast to our experience heretofore with addition and subtraction, the classical algorithms for multiplication and division have execution times that are quadratic in the number of digits of the arguments, and it is not for nothing that Donald Knuth asks in one of his chapter headings, "How fast can we multiply?"

In the literature there have been published various procedures for rapid multiplication of large and very large integers, among which are some rather difficult methods. An example of this is the procedure developed by A. Schönhage and V. Strassen for multiplying large numbers by application of fast Fourier transforms over finite fields. The running time in terms of the number of digits n in the arguments is bounded above by O(n log n log log n) (see [Knut], Section 4.3.3). These techniques encompass the fastest known multiplication algorithms, but their advantage in speed over the classical O (n²) methods comes into play only when the number of binary digits is in the range 8,000–10,000. Based on the demands of cryptographic systems, such numbers, at least for the present, are far beyond the range envisioned in the application domain of our functions.

For our realization of multiplication for the FLINT/C package we would like first to use as a basis the grade school method based on "Algorithm M" given by Knuth (see [Knut], Section 4.3.1), and we shall make an effort to achieve as efficient an implementation of this procedure as possible. Then we shall occupy ourselves in a close examination of the calculation of squares, which offers great potential for savings, and for both cases we shall finally look at the multiplication procedure of Karatsuba, which is asymptotically better than O (n²).^[2] The Karatsuba multiplication arouses our curiosity, since it seems simple, and one could pleasantly occupy a (preferably rainy) Sunday afternoon trying it out. We shall see whether this procedure has anything to contribute to the FLINT/C library.

4.2.1 The Grade School Method

We are considering multiplication of two numbers a and b with representations

click to expand

to the base B. According to the procedure that we learned in school, the product ab can be computed for m = n = 3 as shown in Figure 4.1.

First, the partial products (a₂a₁a₀)_B · b_j for j = 0, 1, 2, are calculated: The values a_ib_j are the least-significant digits of the terms (a_ib_j + carry) with the inner products a_ib_j, and the c_2j are the more-significant digits of p_2j. The partial products are summed at the end to form the product p = (p₅p₄p₃p₂p₁p₀)_B.

In the general case the product p = ab has the value

The result of a multiplication of two operands with m and n digits has at least m + n − 1 and at most m + n digits. The number of elementary multiplication steps (that is, multiplications by factors smaller than the base B) is mn.

A multiplication function that followed exactly the schema outlined above would first calculate all partial products, store these values, and then sum them up, each provided with the appropriate scaling factor. This school method is quite suitable for calculating with pencil and paper, but for the possibilities of a computer program it is somewhat cumbersome. A more efficient alternative consists in adding the inner products a_ib_j at once to the accumulated values in the result digit p_i+j, to which are added the carries c from previous steps. The resulting value for each pair (i, j) is assigned to a variable t:

click to expand
Figure 4.1: Calculations for multiplication

where t can be represented as

and we then have

The current value of the result digit is taken by the assignment p_i+j ← l from this representation of t. As the new carry we set c ← k.

The multiplication algorithm thus consists entirely of an outer loop for calculating the partial products a_i(b_{n − 1} b_{n − 2} … b₀)_B and an inner loop for calculating the inner products a_ib_j, j = 0, …, n − 1, and the values t and p_i + j. The algorithm then appears as follows.

Algorithm for multiplication

Set p_i ← 0 for i = 0, …, n − 1.
Set i ← 0.
Set j ← 0 and c ← 0.
Set t ← p_i+j + a_ib_j + c, p_i+j ← t mod B, and c ← ⌊t/B⌋.
Set j ← j + 1; if j ≤ n − 1, go to step 4.
Set p_i+n ← c.
Set i ← i + 1; if i ≤ m − 1, go to step 3.
Output p = (p_m+n−1p_m+n−2…p₀)_B.

The following implementation of multiplication contains at its core this main loop. Corresponding to the above estimate, in step 4 the lossless representation of a value less than B² in the variable t is required. Analogously to how we proceeded in the case of addition, the inner products t are thus represented as ULONG types. The variable t is nonetheless not used explicitly, and the setting of the result digits p_i+j and the carry c occurs rather within a single expression, analogous to the process already mentioned in connection with the addition function (see page 23). For initialization a more efficient procedure will be chosen than the one shown in step 1 of the algorithm.

Function:	multiplication
Syntax:	int mul_l (CLINT f1_l, CLINT f2_l, CLINT pp_l);
Input:	f1_l, f2_l (factors)
Output:	pp_l (product)
Return:	E_CLINT_OK if all is ok E_CLINT_OFL if overflow

 int mul_l (CLINT f1_l, CLINT f2_l, CLINT pp_l) {   register clint *pptr_l, *bptr_l;   CLINT aa_l, bb_l;   CLINTD p_l;   clint *a_l, *b_l, *aptr_l, *csptr_l, *msdptra_l, *msdptrb_l;   USHORT av;   ULONG carry;   int OFL = E_CLINT_OK;

First the variables are declared; p_l will hold the result and thus is of double length. The ULONG variable carry will hold the carry. In the first step the case is dealt with in which one of the factors, and therefore the product, is zero. Then the factors are copied into the workspaces a_l and b_l, and leading zeros are purged.

  if (EQZ_L (f1_l) || EQZ_L (f2_l))    {      SETZERO_L (pp_l);      return E_CLINT_OK;    }  cpy_l (aa_l, f1_l);  cpy_l (bb_l, f2_l);

According to the declarations the pointers a_l and b_l are given the addresses of aa_l and bb_l, where a logical transposition occurs if the number of digits of aa_l is smaller than that of bb_l. The pointer a_l always points to the operand with the larger number of digits.

    if (DIGITS_L (aa_l) < DIGITS_L (bb_l))      {         a_l = bb_l;         b_l = aa_l;      }    else      {         a_l = aa_l;         b_l = bb_l;      }    msdptra_l = a_l + *a_l;    msdptrb_l = b_l + *b_l;

To save time in the computation, instead of the initialization required above, the partial product (b_{n − 1}b_{n − 2} … b₀)_B · a₀ is calculated in a loop and stored in p_n, p_{n − 1}, …, p₀.

 carry = 0;   av = *LSDPTR_L (a_l);   for (bptr_l = LSDPTR_L (b_l), pptr_l = LSDPTR_L (p_l);                  bptr_l <= msdptrb_l; bptr_l++, pptr_l++)     {       *pptr_l = (USHORT)(carry = (ULONG)av * (ULONG)*bptr_l +                        (ULONG)(USHORT)(carry >> BITPERDGT));     }   *pptr_l = (USHORT)(carry >> BITPERDGT);

Next follows the nested multiplication loop, beginning with the digit a_l[2] of a_l.

    for (csptr_l = LSDPTR_L (p_l) + 1, aptr_l = LSDPTR_L (a_l) + 1;                  aptr_l <= msdptra_l; csptr_l++, aptr_l++)      {        carry = 0;        av = *aptr_l;        for (bptr_l = LSDPTR_L (b_l), pptr_l = csptr_l;                 bptr_l <= msdptrb_l; bptr_l++, pptr_l++)        {           *pptr_l = (USHORT)(carry = (ULONG)av * (ULONG)*bptr_l +               (ULONG)*pptr_l + (ULONG)(USHORT)(carry >> BITPERDGT));        }      *pptr_l = (USHORT)(carry >> BITPERDGT);    }

The largest possible length of the result is the sum of the numbers of digits of a_l and b_l. If the result has one digit fewer, this is determined by the macro RMLDZRS_L.

  SETDIGITS_L (p_l, DIGITS_L (a_l) + DIGITS_L (b_l));  RMLDZRS_L (p_l);

If the result is larger than can be accommodated in a CLINT object, it is reduced, and the error flag OFL is set to the value E_CLINT_OFL. Then the reduced result is assigned to the object pp_l.

   if (DIGITS_L (p_l) > (USHORT)CLINTMAXDIGIT)   /* overflow ? */     {       ANDMAX_L (p_l);   /* reduce modulo (Nmax + 1) */       OFL = E_CLINT_OFL;     }   cpy_l (pp_l, p_l);   return OFL; }

With t = O(mn) the run time t for the multiplication is proportional to the product of the numbers of digits m and n of the operands. For multiplication, too, the analogous mixed function is implemented, which processes a CLINT type and as second argument a USHORT type. This short version of CLINT multiplication requires O(n) CPU multiplications, which is the result not of any particular refinement of the algorithm, but of the shortness of the USHORT argument. Later, we shall set this function implicitly within a special exponentiation routine for USHORT bases (see Chapter 6, the function wmexp_l()).

For the implementation of the umul_l() function we return primarily to a code segment of the mul_l() function and reuse it with a few modifications.

Function:	multiplication of a CLINT type by a USHORT
Syntax:	int umul_l (CLINT aa_l, USHORT b, CLINT pp_l);
Input:	aa_l, b (factors)
Output:	pp_l (product)
Return:	E_CLINT_OK if all is ok E_CLINT_OFL if overflow

 int umul_l (CLINT aa_l, USHORT b, CLINT pp_l) {   register clint *aptr_l, *pptr_l;   CLINT a_l;   CLINTD p_l;   clint *msdptra_l;   ULONG carry;   int OFL = E_CLINT_OK;   cpy_l (a_l, aa_l);   if (EQZ_L (a_l) || 0 == b)     {       SETZERO_L (pp_l);       return E_CLINT_OK;     }

After these preliminaries, the LINT factor is multiplied in a pass through a loop by the USHORT factor, and at the end the carry is stored in the most-significant USHORT digit of the CLINT value.

  msdptra_l = MSDPTR_L (a_l);  carry = 0;  for (aptr_l = LSDPTR_L (a_l), pptr_l = LSDPTR_l (p_l);                  aptr_l <= msdptra_l; aptr_l++, pptr_l++)    {      *pptr_l = (USHORT)(carry = (ULONG)b * (ULONG)*aptr_l +                     (ULONG)(USHORT)(carry >> BITPERDGT));    }  *pptr_l = (USHORT)(carry >> BITPERDGT);    SETDIGITS_L (p_l, DIGITS_L (a_l) + 1);    RMLDZRS_L (p_l);    if (DIGITS_L (p_l) > (USHORT)CLINTMAXDIGIT)   /* overflow ? */      {        ANDMAX_L (p_l);      /* reduce modulo (Nmax + 1) */        OFL = E_CLINT_OFL;      }    cpy_l (pp_l, p_l);    return OFL; }

4.2.2 Squaring Is Faster

The calculation of a large square is accomplished with significantly fewer multiplications than in the case of the multiplication of large numbers. This is a result of the symmetry in the multiplication of identical operands. This observation is very important, since when it comes to exponentiation, which involves not one, but hundreds, of squarings, we shall be able to achieve considerable savings in speed. We again look at the well-known multiplication schema, this time with two identical factors (a₂a₁a₀)_B (see Figure 4.2).

We recognize that the inner products a_ia_j for i = j appear once (in boldface in Figure 4.2) and twice for i ≠ j (in boxes in the figure). Thus we can save three out of nine multiplications by multiplying the sum a_ia_jB^i+j for i < j by 2. The sum of the inner products of a square can then be written as

The number of required elementary multiplications is thus reduced with respect to the school method from n² to n(n + 1)/2.

click to expand
Figure 4.2: Calculations for squaring

A natural algorithmic representation of squaring calculates the above expression with the two summands in two nested loops.

Algorithm 1 for squaring

Set p_i ← 0 for i = 0, …, n − 1.
Set i ← 0.
Set t ← p_2i + , p_2i ← t mod B, and c ← ⌊t/B⌋.
Set j ← i + 1.
Set t ← p_i+j + 2a_ia_j + c, p_i+j ← t (mod B), and c ← ⌊t/B⌋.
Set j ← j + 1; if j ≤ n − 1, go to step 5.
Set p_i+n ← c.
Set i ← i + 1; if i ≤ n − 1, go to step 3.
Output p = (p_{2n − 1}p_{2n − 2} … p₀)_B.

In selecting the necessary data types for the representation of the variables we must note that t can assume the value

(in step 5 of the algorithm). But this means that for representing t to base B more than two digits to base B will be needed, since we also have B² − 1 < 2B² − 2B < 2B² − 1, and so a ULONG will not suffice for representing t (the inequality above is derived from the fact that one additional binary digit is needed). While this poses no problem for an assembler implementation, in which one has access to the carry bit of the CPU, it is difficult in C to handle the additional binary digit. To get around this dilemma, we alter the algorithm in such a way that in step 5 the required multiplication by 2 is carried out in a separate loop. It is then required that step 3 be carried out in its own loop, whereby for a slight extra expenditure of effort in loop management we are spared the additional binary digit. The altered algorithm is as follows.

Algorithm 2 for squaring

Initialization: Set p_i ← 0 for i = 0, …, n − 1.
Calculate the product of digits of unequal index: Set i ← 0.
Set j ← i + 1 and c ← 0.
Set t ← p_i+j + a_ia_j + c, p_i+j ← t mod B, and c ← ⌊t/B⌋.
Set j ← j + 1; if j ≤ n − 1, go to step 4.
Set p_i+n ← c.
Set i ← i + 1; if i ≤ n − 2, go to step 3.
Multiplication of inner products by 2: Set i ← 0 and c ← 0.
Set t ← 2p_i + c, p_i ← t mod B, and c ← ⌊t/B⌋.
Set i ← i + 1; if i ≤ 2_n − 2, go to step 9.
Addition of the inner squares: Set i ← 0 and c ← 0.
Set t ← p_2i + + c, p_2i ← t mod B, and c ← ⌊t/B⌋.
Set t ← p_2i+1 + c, p_2i+1 ← t mod B, and c ← ⌊t/B⌋.
Set i ← i + 1; if i ≤ n − 1, go to step 9.
Set p_2n−1 c; output p = (p_{2n − 1}p_{2n − 2} … p₀)_B.

In the C function for squaring the initialization in step 1 is likewise, in analogy to multiplication, replaced by the calculation and storing of the first partial product a₀(a_{n − 1}a_{n − 2} … a₁)_B.

Function:	squaring
Syntax:	int sqr_l (CLINT f_l, CLINT pp_l);
Input:	f_l (factor)
Output:	pp_l (square)
Return:	E_CLINT_OK if all is ok E_CLINT_OFL if overflow

 int sqr_l (CLINT f_l, CLINT pp_l) {   register clint *pptr_l, *bptr_l;   CLINT a_l;   CLINTD p_l;   clint *aptr_l, *csptr_l, *msdptra_l, *msdptrb_l, *msdptrc_l;   USHORT av;   ULONG carry;   int OFL = E_CLINT_OK;   cpy_l (a_l, f_l);   if (EQZ_L (a_l))     {       SETZERO_L (pp_l);       return E_CLINT_OK;     }   msdptrb_l = MSDPTR_L (a_l);   msdptra_l = msdptrb_l - 1;

The initialization of the result vector addressed by pptr_l is carried out by means of the partial product a₀(a_{n − 1}a_{n − 2} … a₁)_B, in analogy with multiplication. The digit p₀ is here not assigned; it must be set to zero.

  *LSDPTR_L (p_l) = 0;  carry = 0;  av = *LSDPTR_L (a_l);  for (bptr_l = LSDPTR_L (a_l) + 1, pptr_l = LSDPTR_L (p_l) + 1;                  bptr_l <= msdptrb_l; bptr_l++, pptr_l++)    {      *pptr_l = (USHORT)(carry = (ULONG)av * (ULONG)*bptr_l +                       (ULONG)(USHORT)(carry >> BITPERDGT));    }  *pptr_l = (USHORT)(carry >> BITPERDGT);

The loop for summing the inner products a_ia_j.

  for (aptr_l = LSDPTR_L (a_l) + 1, csptr_l = LSDPTR_L (p_l) + 3;                           aptr_l <= msdptra_l; aptr_l++, csptr_l += 2)    {      carry = 0;      av = *aptr_l;      for (bptr_l = aptr_l + 1, pptr_l = csptr_l; bptr_l <= msdptrb_l;                  bptr_l++, pptr_l++)         {           *pptr_l = (USHORT)(carry = (ULONG)av * (ULONG)*bptr_l +                (ULONG)*pptr_l + (ULONG)(USHORT)(carry >> BITPERDGT));         }      *pptr_l = (USHORT)(carry >> BITPERDGT);    }  msdptrc_l = pptr_l;

Then comes multiplication of the intermediate result in pptr_l by 2 via shift operations (see also Section 7.1).

  carry = 0;  for (pptr_l = LSDPTR_L (p_l); pptr_l <= msdptrc_l; pptr_l++)    {      *pptr_l = (USHORT)(carry = (((ULONG)*pptr_l) << 1) +                           (ULONG)(USHORT)(carry >> BITPERDGT));    }  *pptr_l = (USHORT)(carry >> BITPERDGT);

Now we compute the "main diagonal."

    carry = 0;    for (bptr_l = LSDPTR_L (a_l), pptr_l = LSDPTR_L (p_l);                  bptr_l <= msdptrb_l; bptr_l++, pptr_l++)    {      *pptr_l = (USHORT)(carry = (ULONG)*bptr_l * (ULONG)*bptr_l +                     (ULONG)*pptr_l + (ULONG)(USHORT)(carry >> BITPERDGT));      pptr_l++;      *pptr_l = (USHORT)(carry = (ULONG)*pptr_l + (carry >> BITPERDGT));    }

All the rest follows in analogy to multiplication.

  SETDIGITS_L (p_l, DIGITS_L (a_l) << 1);  RMLDZRS_L (p_l);  if (DIGITS_L (p_l) > (USHORT)CLINTMAXDIGIT)   /* overflow ? */    {      ANDMAX_L (p_l);    /* reduce modulo (Nmax + 1) */      OFL = E_CLINT_OFL;    }  cpy_l (pp_l, p_l);  return OFL;  }

The run time for squaring is, with O (n²), likewise quadratic in the number of digits of the operators, but with n(n + 1)/2 elementary multiplications it is about twice as fast as multiplication.

4.2.3 Do Things Go Better with Karatsuba?

The antispirit of multiplication and division deconstructed everything and then focused only on a specific part of the whole.

—Sten Nadolny (trans. Breon Mitchell), God of Impertinence

As announced, we shall now consider a method of multiplication named for the Russian mathematician A. Karatsuba, who has published several variants of it (See [Knut], Section 4.3.3). We assume that a and b are natural numbers with n = 2k digits to base B, so that we can write a = and b = with digits a₀ and a₁, respectively b₀ and b₀, to base B^k. Were we to multiply a and b in the traditional manner, then we would obtain the expression

with four multiplications to base B^k and thus n² = 4k² elementary multiplications to base B. However, if we set

then we have

For calculating ab it now appears that only three more multiplications by numbers to base B^k, or 3k² multiplications to base B, are necessary, in addition to some additions and shifting operations (multiplication by B^k can be accomplished by left shifting by k digits to base B; see Section 7.1). Let us assume that the number of digits n of our factors a and b is a power of 2, with the result that by recursive application of the procedure on the remaining partial products we can end with having to carry out only elementary multiplications to base B, and this yields a total of elementary multiplications, as opposed to n² in the classical procedure, in addition to the time for additions and shift operations.

For squaring, this process can be simplified somewhat: With

we have

Furthermore, to our advantage, the factors in the squaring always have the same number of digits, which is not generally the case in multiplication. With all these advantages, we should, however, mention that recursion within a program function always costs something, so that we may hope to experience a savings in time over the classical method, which manages without the added burden of recursion, only when the numbers get large.

To obtain information on the actual time performance of the Karatsuba procedure the functions kmul() and ksqr() are provided. The division of the factors into two halves takes place in situ, and so a copying of the halves is unnecessary. But it is necessary that the functions be passed pointers to the least-significant digits of the factors and that the numbers of digits be passed separately.

The functions presented below in experimental form use the recursive procedure for factors having more than a certain number of digits determined by a macro, while for smaller factors we turn to conventional multiplication or squaring. For the case of nonrecursive multiplication the functions kmul() and ksqr() use the auxiliary functions mult() and sqr(), in which multiplication and squaring are implemented as kernel functions without the support of identical argument addresses (accumulator mode) or reduction in the case of overflow.

Function:	Karatsuba multiplication of two numbers a_l and b_l with 2k digits each to base B
Syntax:	void kmul (clint aptr_l, clint bptr_l, int len_a, int len_b, CLINT p_l);
Input:	aptr_l (pointer to the least-significant digit of the factor a_l) bptr_l (pointer to the least-significant digit of the factor b_l) len_a (number of digits of a_l) len_b (number of digits of b_l)
Output:	p_l (product)

 void kmul (clint *aptr_l, clint *bptr_l, int len_a, int len_b, CLINT p_l) {   CLINT c01_l, c10_l;   clint c0_l[CLINTMAXSHORT + 2];   clint c1_l[CLINTMAXSHORT + 2];   clint c2_l[CLINTMAXSHORT + 2];   CLINTD tmp_l;   clint *a1ptr_l, *b1ptr_l;   int l2;   if ((len_a == len_b) && (len_a >= MUL_THRESHOLD)                   && (0 == (len_a & 1)) )     {

If both factors possess the same even number of digits above the value MUL_THRESHOLD, then recursion is entered with the splitting of the factors into two halves. The pointers aptr_l, a1ptr_l, bptr_l, b1ptr_l are passed to the corresponding least-significant digits of one of the halves. By not copying the halves, we save valuable time. The values c₀ and c₁ are calculated by recursively calling kmul() and then stored in the CLINT variables c0_l and c1_l.

        l2 = len_a/2;        a1ptr_l = aptr_l + l2;        b1ptr_l = bptr_l + l2;        kmul (aptr_l, bptr_l, l2, l2, c0_l);        kmul (a1ptr_l, b1ptr_l, l2, l2, c1_l);

The value c₂:= (a₀ + a₁) (b₀ + b₁) − c₀ − c₁ is computed with two additions, a call to kmul(), and two subtractions. The auxiliary function addkar() takes pointers to the least-significant digits of two equally long summands together with their number of digits, and outputs the sum of the two as a CLINT value.

        addkar (a1ptr_l, aptr_l, l2, c01_l);        addkar (b1ptr_l, bptr_l, l2, c10_l);        kmul (LSDPTR_L (c01_l), LSDPTR_L (c10_l),                    DIGITS_L (c01_l), DIGITS_L (c10_l), c2_l);        sub (c2_l, c1_l, tmp_l);        sub (tmp_l, c0_l, c2_l);

The function branch ends with the calculation of B^k (B^kc₁ + c₂) + c₀, which used the auxiliary function shiftadd(), which during the addition left shifts the first of the two CLINT summands by a given number of places to base B.

         shiftadd (c1_l, c2_l, l2, tmp_l);         shiftadd (tmp_l, c0_l, l2, p_l);      }

If one of the input conditions is not fulfilled, the recursion is interrupted and the nonrecursive multiplication mult() is called. As a requirement for calling mult() the two factor halves in aptr_l and bptr_l are brought into CLINT format.

  else    {      memcpy (LSDPTR_L (c1_l), aptr_l, len_a * sizeof (clint));      memcpy (LSDPTR_L (c2_l), bptr_l, len_b * sizeof (clint));      SETDIGITS_L (c1_l, len_a);      SETDIGITS_L (c2_l, len_b);      mult (c1_l, c2_l, p_l);      RMLDZRS_L (p_l);    } }

The Karatsuba squaring process proceeds analogously to this and will not be described in detail. For calling kmul() and ksqr() we make use of the functions kmul_l() and ksqr_l(), which are equipped with the standard interface.

Function:	Karatsuba multiplication and squaring
Syntax:	int kmul_l (CLINT a_l, CLINT b_l, CLINT p_l); int ksqr_l (CLINT a_l, CLINT p_l);
Input:	a_l, b_l (factors)
Output:	p_l (product or square)
Return:	E_CLINT_OK if all is ok E_CLINT_OFL if overflow

The implementation of the Karatsuba functions are contained in the source file kmul.c on the included CD-ROM.

Extensive tests with these functions (on a Pentium III processor at 500 MHz under Linux) have given best results when the nonrecursive multiplication routine is called for a digit count under 40 (corresponding to 640 binary digits). The computation time of our implementation appears in Figure 4.3).

click to expand
Figure 4.3: CPU time for Karatsuba multiplication

We conclude from this overview what we expected, that between standard multiplication and squaring there is a difference in performance of about 40 percent, and that for numbers of over 2000 binary digits a pronounced spread of the measured times becomes noticeable, with the Karatsuba routine in the lead. It is interesting to note that "normal" squaring sqr_l() is noticeably faster than Karatsuba multiplication, and Karatsuba squaring ksqr_l() takes the lead only above 3000 binary digits.

The large drop in performance of the Karatsuba functions for smaller numbers that was remarked on in the first edition of this book has in the meantime been eliminated. Yet there is still potential for improvement. The observable discontinuity in the calculation times of kmul_l() indicates that the recursion breaks off earlier than specified by the threshold value if the factors of a recursion step do not have an even number of digits. In the worst case this occurs right at the beginning of the multiplication, and even for very large numbers we are no better off than we were in the standard case. It would seem worthwhile, then, to extend the Karatsuba functions to be able to process arguments with differing numbers of digits and odd numbers of digits.

At the Max Planck Institute in Saarbrücken, J. Ziegler [Zieg] developed a portable implementation of Karatsuba multiplication and squaring for a 64-bit CPU (Sun Ultra-1) that overtakes the conventional method at 640 binary digits. With squaring an improvement in performance of 10% occurred at 1024 binary digits and 23% at 2048 binary digits.

Once again, the Karatsuba functions have no particular advantage for our cryptographic applications without considerable changes, for which reason we shall prefer to fall back on the functions mul_l() and sqr_l(), which realize the conventional procedures. For applications for which the Karatsuba functions seem suited one could simply substitute those functions for mul_l() and sqr_l().

^[2]When we say that the calculation time is asymptotically better, we mean that the larger the numbers in question, the greater the effect. One should not fall victim to premature euphoria, and for our purposes such an improvement may have no significance whatsoever.

Team-Fly