6.5 Machine and Arithmetic Idioms

An idiom is an idiosyncrasy. Several arithmetic operations and 80x86 instructions have idiosyncrasies that you can take advantage of when writing assembly language code. Some people refer to the use of machine and arithmetic idioms as "tricky programming" that you should always avoid in well-written programs. While it is wise to avoid tricks just for the sake of tricks, many machine and arithmetic idioms are well known and commonly found in assembly language programs. Some of them can be really tricky, but a good number of them are simply "tricks of the trade." This text cannot even begin to present all of the idioms in common use today; they are too numerous and the list is constantly changing. Nevertheless, there are some very important idioms that you will see all the time, so it makes sense to discuss those.

6.5.1 Multiplying Without MUL, IMUL, or INTMUL

If you take a quick look at the timing for the multiply instruction, you'll notice that the execution time for this instruction is often long.^[1] When multiplying by a constant, you can sometimes avoid the performance penalty of the mul, imul, and intmul instructions by using shifts, additions, and subtractions to perform the multiplication.

Remember, a shl instruction computes the same result as multiplying the specified operand by two. Shifting to the left two bit positions multiplies the operand by four. Shifting to the left three bit positions multiplies the operand by eight. In general, shifting an operand to the left n bits multiplies it by 2n. You can multiply any value by some constant using a series of shifts and adds or shifts and subtractions. For example, to multiply the AX register by ten, you need only multiply it by 8 and then add in two times the original value. That is, 10*AX = 8*AX + 2*AX. The code to accomplish this is

      shl( 1, ax );           // Multiply AX by two.      mov( ax, bx);           // Save 2*AX for later.      shl( 2, ax );           // Multiply ax by eight (*4 really, but it                              // contains *2).      add( bx, ax );          // Add in AX*2 to AX*8 to get AX*10.

Many x86 processors can multiply the AX register (or just about any register, for that matter) by various constant values much faster using shl than by using the mul instruction. This may seem hard to believe because it only takes one instruction to compute this product:

      intmul( 10, ax );

However, if you look at the timings, the shift and add example above requires fewer clock cycles on many processors in the 80x86 family than the mul instruction. Of course, the code is somewhat larger (by a few bytes), but the performance improvement is usually worth it. Of course, on the later 80x86 processors, the multiply instructions are quite a bit faster than the earlier processors, but the shift and add scheme is often faster on these processors as well.

You can also use subtraction with shifts to perform a multiplication operation. Consider the following multiplication by seven:

      mov( eax, ebx );    // Save EAX * 1      shl( 3, eax );      // EAX = EAX * 8      sub( ebx, eax );    // EAX*8 - EAX*1 is EAX*7

This follows directly from the fact that EAX*7 = (EAX*8)-EAX. A common error beginning assembly language programmers make is subtracting or adding one or two rather than EAX*1 or EAX*2. The following does not compute EAX*7:

      shl( 3, eax );      sub( 1, eax );

It computes (8*EAX)-1, something entirely different (unless, of course, EAX = 1). Beware of this pitfall when using shifts, additions, and subtractions to perform multiplication operations.

You can also use the lea instruction to compute certain products. The trick is to use the scaled index addressing modes. The following examples demonstrate some simple cases:

      lea( eax, [ecx][ecx] );      // EAX := ECX * 2      lea( eax, [eax][eax*2] );    // EAX := EAX * 3      lea( eax, [eax*4] );         // EAX := EAX * 4      lea( eax, [ebx][ebx*4] );    // EAX := EBX * 5      lea( eax, [eax*8] );         // EAX := EAX * 8      lea( eax, [edx][edx*8] );    // EAX := EDX * 9

6.5.2 Division Without DIV or IDIV

Much as the shl instruction can be used for simulating a multiplication by some power of two, you may use the shr and sar instructions to simulate a division by a power of two. Unfortunately, you cannot use shifts, additions, and subtractions to perform a division by an arbitrary constant as easily as you can use these instructions to perform a multiplication operation. Therefore, keep in mind that this trick is useful only when dividing by powers of two. Also, don't forget that the sar instruction rounds toward negative infinity rather than toward zero; this is not the way the idiv instruction operates (it rounds toward zero).

Another way to perform division is to use the multiply instructions. You can divide by some value by multiplying by its reciprocal. Because the multiply instruction is faster than the divide instruction, multiplying by a reciprocal is usually faster than division.

Now you're probably wondering, "How does one multiply by a reciprocal when the values we're dealing with are all integers?" The answer, of course, is that we must cheat to do this. If you want to multiply by one-tenth, there is no way you can load the value 1/10th into an 80x86 register prior to performing the multiplication. However, we could multiply 1/10th by 10, perform the multiplication, and then divide the result by ten to get the final result. Of course, this wouldn't buy you anything at all; in fact it would make things worse because you're now doing a multiplication by ten as well as a division by ten. However, suppose you multiply 1/10th by 65,536 (6553), perform the multiplication, and then divide by 65,536. This would still perform the correct operation and, as it turns out, if you set up the problem correctly, you can get the division operation for free. Consider the following code that divides AX by ten:

      mov( 6554, dx );           // 6554 = round( 65,536/10 ).      mul( dx, ax );

This code leaves AX/10 in the DX register.

To understand how this works, consider what happens when you multiply AX by 65,536 ($10000). This simply moves AX into DX and sets AX to zero (a multiply by $10000 is equivalent to a shift left by 16 bits). Multiplying by 6,554 (65,536 divided by ten) puts AX divided by ten into the DX register. Because mul is faster than div, this technique runs a little faster than using a straight division.

Multiplying by the reciprocal works well when you need to divide by a constant. You could even use it to divide by a variable, but the overhead to compute the reciprocal only pays off if you perform the division many, many times (by the same value).

6.5.3 Implementing Modulo-N Counters with AND

If you want to implement a counter variable that counts up to 2ⁿ−1 and then resets to zero, simply use the following code:

      inc( CounterVar );      and( nBits, CounterVar );

where nBits is a binary value containing n one bits right justified in the number. For example, to create a counter that cycles between 0 and 15, you could use the following:

      inc( CounterVar );      and( %00001111, CounterVar );

6.5.4 Careless Use of Machine Idioms

One problem with using machine idioms is that the machines change over time. The DOS/16-bit version of this text recommends the use of several machine idioms in addition to those this chapter presents. Unfortunately, as time passed Intel improved the processor and tricks that used to provide a performance benefit are actually slower on the newer processors. Therefore, you should be careful about employing common "tricks" you pick up; they may not actually improve the code. On the wide range of 80x86 processors available today, it is very difficult to find a machine idiom that provides an improvement on all CPUs. An idiom that works great on a Pentium II may be lousy on a Pentium IV. A trick that works great on a Pentium IV may not produce the best results on an Athlon CPU. So be careful; that tricky code you're writing may actually run slower than the straight-forward code on many CPUs.

^[1]Actually, this is specific to a given processor. Some processors execute the intmul instruction fairly fast.