6.5 Integer Multiplication and Division

In Section 6.1.6, we demonstrated how addition can be based on logic. Choosing two's complement representation for signed integers makes subtraction a special case of addition that is relatively easy to implement at the hardware level. That choice avoids complications associated with positive and negative variants of zero in sign and magnitude representation.

In this section, we extend the exploration of arithmetic founded on logic. First we look at an early algorithm that accomplishes multiplication by using addition. Then we look at division problems, where the divisor is a constant. The divisor is converted into a reciprocal to be used in a multiplication operation that is mathematically equivalent to the division.

6.5.1 Booth's Algorithm for Multiplication

Repeated addition can accomplish multiplication (N x X = X + X + X for N =3, etc.), but requires an accumulator that is double the width of the operands. Repeated addition using a counted loop takes a long time for very large values of N, and can require carries.

Performing "long multiplication" on decimal numbers involves a combination of individual multiplication steps and an overall addition, with multiple carries, that is guided by positional weights (Section 1.8.1). Long multiplication for binary numbers works analogously. The multiplication steps are trivial (either 0 x X or 1 x X), but the "troublesome" carries (Section 6.1.6) are still present.

During the course of his development of an early computer architecture, Andrew Donald Booth first explained a method in which the carries cannot build up. More importantly, the algorithm itself can proceed in identical fashion regardless of the signs or order of the two's complement operands (X = multiplicand, N = multiplier).

Our presentation of Booth's algorithm uses a notational approach introduced by John Peatman, which we demonstrate with 8-bit numbers. The multiplier can be written as:

N = 128ns + 64n6 + 32n5 + 16n4 + 8n3 + 4n2 + 2n1 + n0

where ns is the sign bit. This expression can be rewritten in the following "clever" way:

N = 128(ns n6) 64(n6 n5) 32(n5 n4) 16(n4 n3) 8(n3 n2) 4(n2 n1) 2(n1 n0) (n0 0)

where all of the terms now have the same form. In an 8-bit number, the sign bit ns is obviously n7.

Since each ni bit is a coefficient taking a value of 0 or 1 in a weighted positional notation, each symbolic difference in parentheses will be +1, 1, or 0. These values constitute the recipe for adding, subtracting, or ignoring each multiple 2i X needed to compose the product P = N x X.

It is less obvious, but much more important, to observe that adjacent nonignorable operations alternate in sign (addition or subtraction) as a partial product P is being accumulated. Consider the number N = 0b00001011 with any X:

Criterion

 

Starting condition:

P = 0

10

1(n0 0) = 1

 

P = 1X

11

2(n1 n0) = 0

no change

 

01

4(n2 n1) = +4

 

P = +3X

10

8(n3 n2) = 8

 

P = 5X

01

16(n4 n3) = +16

 

P = +11X

00

2i(ni ni 1) = 0

all the rest yield 0

 

where the criterion column (examined below) juxtaposes bits ni and ni 1. Note also the special first case n0 and 0. The correct sum is produced for the multiplier N = 1110 and any X.

Recall that addition of numbers with unlike signs can never exceed the bit-width of the representation. The strict alternation of signs that is, addition and subtraction keeps the partial product P from growing any larger than the width of N. Stated differently, a widening P coupled with a narrowing N always fits in 2w bits, where w is the width of the numeric representation.

The schematic arrangement for the Booth algorithm in Figure 6-4 has the following features: one register of width w bits to hold the multiplicand X; a double register of width 2w bits, split into L and R halves, where L is initialized to zero and the R contains the multiplier N; and additional storage for a single bit, ni 1, which is the bit most recently shifted out of the LR double register. That single bit is initialized to zero.

Figure 6-4. Register scheme for the Booth algorithm

graphics/06fig04.gif

Shifting L (and R) to the right, relative to a stationary X, is equivalent to writing down the partial products of X that are shifted to the left in standard long multiplication.

The criterion for addition or subtraction can be recognized as the appropriate entry in Table 6-4, corresponding to the current multiplier bit ni and the previous multiplier bit ni 1.

Table 6-4. Operation at Step i in the Booth Algorithm

Bit ni

Bit ni 1

Action

0

0

do nothing; then shift right

0

1

add X; then shift right

1

0

subtract X; then shift right

1

1

do nothing; then shift right

With the stated initializations and an understanding of the presentation in Figure 6-4 and Table 6-4, Booth's algorithm simply becomes:

 repeat w times: isolate the current ni for use in the next iteration L := L+X, L := L-X, or leave L alone // Table 6-4 shift the pattern in LR to the right one bit position maintain sign extension for the pattern in LR at all times 

Carries from the leftmost bit of L are ignored. For N = 11 as given above, this plays out as follows for X = 13 = 0b00001101:

iteration

L

R

n i 1

 
 

00000000

00001011

0

initialization

1

11110011

11111001

00001011

10000101

0

1

10: subtract X from L

shift LR (same sign); update ni 1

2

11111001

11111100

10000101

11000010

1

1

11: do nothing

shift LR (same sign); update ni 1

3

00001001

00000100

11000010

11100001

1

0

01: add X to L

shift LR (same sign); update ni 1

4

11110111

11111011

11100001

11110000

0

1

10: subtract X from L

shift LR (same sign); update ni 1

5

00001000

00000100

11110000

01111000

1

0

01: add X to L

shift LR (same sign); update ni 1

6

00000100

00000010

01111000

00111100

0

0

00: do nothing

shift LR (same sign); update ni 1

7

00000010

00000001

00111100

00011110

0

0

00: do nothing

shift LR (same sign); update ni 1

8

00000001

00000000

00011110

10001111

0

1

00: do nothing

shift LR (same sign); update ni 1

with a final result of 0b10001111 = 14310 = 13 x 11. The L portion is zero only because this particular product is a small positive number.

Can we now convert this algorithm into Itanium instructions using 64-bit operands? Our outline suggests roles for a double-width register, shifting of a double-width intermediate quantity that eventually becomes the product, and a criterion for predicating addition, subtraction, or neither at each stage in a counted loop. Representing the single-bit memory location with register rn, initialized to zero, we express Itanium instructions for the body of the loop:

         and         rt=0x1,rr;;   // Isolate lowest bit of R         xor         ri=rn,rt;;    // ri = whether to act         cmp.ne      pi,p0=0,ri    // pi = whether to act         mov         rn=rt;;       // Save bit for next time (pi)    cmp.eq.unc  pa,ps=0,rt;;  // Add, subtract, nop? (pa)    add         rl=rl,rx      // Add X to L (ps)    sub         rl=rl,rx;;    // Subtract X from L         shrp        rr=rl,rr,1    // New R of shifted LR         shr         rl=rl,1       // New L of shifted LR 

There are countless ways to implement Booth's algorithm in any language or architecture. Some may offer shorter code sequences based on tight iterative loops, but EPIC principles can significantly improve performance by eliminating the branches inside those loops.

6.5.2 Unsigned Multiplication

We can easily adapt the implementation of Booth's algorithm for circumstances where we know that the two operands are unsigned, rather than signed, integers. In order to see how to make this correction, look back at the leading term in the algebraic sum for the value of the multiplier N (see the beginning of Section 6.5.1).

When the operands are signed, as with the normal implementation of Booth's algorithm, the leading term has the general weighted value 2w 1nw 1, where w is the bit width of the integer representation. When the operands are unsigned, that leading term should have the weighted value +2w 1nw 1. The result from the normal implementation of Booth's algorithm needs to be corrected by +2x2w 1nw 1 = +2wnw 1.

In pseudocode, X is added to L after the last shift operation of the loop if and only if the leading bit nw 1 of N is 1:

                               // Unsigned multiplication only        tbit.nz   pn,p0=rn,63  // If bit 63 of N is 1, (pn)   add       rl=rl,rx     //  add X after last shift                               // Unsigned multiplication only 

Since our development treated X symbolically throughout, the leading (and no longer sign) bit of X can be ignored, as can any carry-out to the left in L.

6.5.3 Division Using Known Reciprocals

When we introduced integer arithmetic instructions in Chapter 4, we pointed out that Alpha, Itanium, and some other high-performance architectures lack an integer divide instruction. As Bhandarkar has noted, several methods can be used to implement integer division:

  • Use of floating-point hardware to convert from integer into floating-point representation, to divide, and then to convert back. Accuracy might be limited by the precision of the floating-point representation (not a problem for the Itanium ISA);

  • Iterative testing and subtraction to implement algorithms analogous to decimal long division; and

  • Multiplication by the reciprocal of the divisor, and then a shift instruction to normalize the result.

This last method, using a reciprocal, is especially applicable for division by a constant that is known at the time the program is assembled or compiled.

Consider the very common need to divide a binary integer by 10 as part of an algorithm for formatting decimal numbers for printing. The reciprocal of 10 is the repeating fraction 0.0001100110011... (binary). In order to preserve maximum precision, we can instead use the normalized 64-bit fraction 0.CCCCCCCCCCCCCCCD (hexadecimal representation of 0.8) for the multiplication, with a subsequent division by 8. The final hexadecimal digit of the fraction is D rather than C in order to ensure normal rounding, since shifting as a means of division truncates, rather than rounds, the result. According to standard algebra, this division by 8 can be deferred until an extended-precision product has been formed through multiplication of the source number by the normalized fraction. That is, we can use the simple mathematical identities x/10 = 0.1x = 0.8x/8.

We now explain this method in more detail. Since one-tenth is less than one-eighth but greater than one-sixteenth, the representation of 0.1 (decimal) as a binary fraction begins with three zero bits before the first one bit (that is, no one-half, no one-quarter, no one-eighth, one one-sixteenth, etc.). The full binary representation for one-tenth is then found, by a continuation of this process, to contain a repeating bit pattern, …1100…, which corresponds to the hexadecimal character C. With binary numbers and fractions, each shift to the left by one bit position corresponds to a multiplication of the value by a factor of two, and each shift to the right by one bit position corresponds to a division of the value by a factor of two. If we shift the bit pattern representing one-tenth three bit positions to the left, so that those first three zeros are gone, the value becomes 23 = 8 times greater; thus the resulting 0.CCC… hexadecimal pattern represents the value eight tenths.

When we wish to divide a value x by ten, we first multiply that value by 0.8 to form an intermediate product 0.8x of highest precision. Then we divide the intermediate product 0.8x by 8 to obtain a result equivalent to x/10. Division by 8 is easily accomplished on any computer that supports an arithmetic shift to the right by three bit positions.

This algorithm is quite easily expressed using Itanium instructions, with the source number in register rx. We imagine some function, umpy8hi, which implements either Booth's algorithm or a sequence of Itanium floating-point instructions to perform 8-byte unsigned multiplication while retaining the high-order portion of a full 128-bit product. For reasons that will unfold below, we need an instruction that yields the high portion. Bhandarkar's suggestion can be expressed as:

 DOT8 = 0xCCCCCCCCCCCCCCCD     // 0.8 (finite approx.) (code fragment) movl     rf = DOT8;;          // Get reciprocal factor umpy8hi  rl = rf,rx;;         // rl = high 64 bits shr      rq = rl,3;;          // Divide by 8 

We now have the desired quotient in register rq. The original source number is still in register rx. If we are doing a modulo calculation and need to know the remainder, we can go on to compute it from rx 10*rq.

We can illustrate this method by asking for the result of integer division of 256 by 10 using this program fragment. Since 256 = 28, the multiplication of DOT8 by 256 will result in a 128-bit product based on a shift to the left by 8 bits, or 2 hexadecimal digits, where

 low-order 64 bits are:    CCCCCCCCCCCCCD00 (hexadecimal) high-order 64 bits are:   00000000000000CC (hexadecimal) 

Remember that our version of Booth's algorithm retains the low-order 64 bits in R and the high-order 64 bits in L (register rl in the pseudocode just above). Continuing, we now divide the high-order portion by 8,

 CC16 = 1100 11002       0001 10012  =  1916 = 2510 

thus obtaining the proper result of 25; integer division discards the remainder.

In developing this methodology, we chose to treat the reciprocal of the known divisor as a binary fraction and the multiplicand (dividend) as a whole number. Accordingly, the split between the whole number and the fraction was located exactly between the high- and low-order 64-bit portions of the full product. Picking off the integer part of the quotient then simply meant choosing the contents of register L in our illustration of Booth's algorithm. Among computer architectures with machine instructions for multiplication, some provide the low- and high-order segments in separate registers from the execution of a single instruction, while others have two different opcodes that specify which segment is placed into a destination register.



ItaniumR Architecture for Programmers. Understanding 64-Bit Processors and EPIC Principles
ItaniumR Architecture for Programmers. Understanding 64-Bit Processors and EPIC Principles
ISBN: N/A
EAN: N/A
Year: 2003
Pages: 223

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net