Flylib.com

Books Software

 
 
 

26.5. Routines

 < Free Open Study > 

26.5. Routines

One of the most powerful tools in code tuning is a good routine decomposition. Small, well-defined routines save space because they take the place of doing jobs separately in multiple places. They make a program easy to optimize because you can refactor code in one routine and thus improve every routine that calls it. Small routines are relatively easy to rewrite in a low-level language. Long, tortuous routines are hard enough to understand on their own; in a low-level language like assembler, they're impossible .

Cross-Reference

For details on working with routines, see Chapter 7, "High-Quality Routines."


Rewrite Routines Inline

In the early days of computer programming, some machines imposed prohibitive performance penalties for calling a routine. A call to a routine meant that the operating system had to swap out the program, swap in a directory of routines, swap in the particular routine, execute the routine, swap out the routine, and swap the calling routine back in. All this swapping chewed up resources and made the program slow.

Modern computers collect a far smaller toll for calling a routine. Here are the results of putting a string-copy routine inline:

Language

Routine Time

Inline-Code Time

Time Savings

C++

0.471

0.431

8%

Java

13.1

14.4

-10%


In some cases, you might be able to save a few nanoseconds by putting the code from a routine into the program directly where it's needed via a language feature like C++'s inline keyword. If you're working in a language that doesn't support inline directly but that does have a macro preprocessor, you can use a macro to put the code in, switching it in and out as needed. But modern machines—and "modern" means any machine you're ever likely to work on—impose virtually no penalty for calling a routine. As the example shows, you're as likely to degrade performance by keeping code inline as to optimize it.

 < Free Open Study > 
 < Free Open Study > 

26.6. Recoding in a Low-Level Language

One long-standing piece of conventional wisdom that shouldn't be left unmentioned is the advice that when you run into a performance bottleneck, you should recode in a low-level language. If you're coding in C++, the low-level language might be assembler. If you're coding in Python, the low-level language might be C. Recoding in a lowlevel language tends to improve both speed and code size . Here is a typical approach to optimizing with a low-level language:

  1. Write 100 percent of an application in a high-level language.

  2. Fully test the application, and verify that it's correct.

  3. If performance improvements are needed after that, profile the application to identify hot spots. Since about 5 percent of a program usually accounts for about 50 percent of the running time, you can usually identify small pieces of the program as hot spots.

    Cross-Reference

    For details on the phenomenon of a small percentage of a program accounting for most of its run time, see "The Pareto Principle" in Section 25.2.


  4. Recode a few small pieces in a low-level language to improve overall performance.

Whether you follow this well-beaten path depends on how comfortable you are with low-level languages, how well-suited the problem is to low-level languages, and on your level of desperation. I got my first exposure to this technique on the Data Encryption Standard program I mentioned in the previous chapter. I had tried every optimization I'd ever heard of, and the program was still twice as slow as the speed goal. Recoding part of the program in assembler was the only remaining option. As an assembler novice, about all I could do was make a straight translation from a high-level language to assembler, but I got a 50 percent improvement even at that rudimentary level.

Suppose you have a routine that converts binary data to uppercase ASCII characters . The next example shows the Delphi code to do it:

Delphi Example of Code That's Better Suited to Assembler
procedure HexExpand(var source: ByteArray;

   var target: WordArray;

   byteCount: word);

var

   index: integer;

   lowerByte: byte;

   upperByte: byte;

   targetIndex: integer;

begin

   targetIndex := 1;

   for index := 1 to byteCount do begin

      target[ targetIndex ] := ((source[ index ] and $F0) shr 4) + ;

      target[ targetIndex+1 ] := (source[ index ] and
procedure HexExpand(var source: ByteArray; var target: WordArray; byteCount: word); var index: integer; lowerByte: byte; upperByte: byte; targetIndex: integer; begin targetIndex := 1; for index := 1 to byteCount do begin target[ targetIndex ] := ((source[ index ] and $F0) shr 4) + $41; target[ targetIndex+1 ] := (source[ index ] and $0f) + $41; targetIndex := targetIndex + 2; end; end;
f) + ; targetIndex := targetIndex + 2; end; end;

Although it's hard to see where the fat is in this code, it contains a lot of bit manipulation, which isn't exactly Delphi's forte. Bit manipulation is assembler's forte, however, so this code is a good candidate for recoding. Here's the assembler code:

Example of a Routine Recoded in Assembler
procedure HexExpand(var source;

   var target;

   byteCount : Integer);

   label

   EXPAND;



   asm

         MOV   ECX,byteCount      // load number of bytes to expand

         MOV   ESI,source         // source offset

         MOV   EDI,target         // target offset

         XOR   EAX,EAX            // zero out array offset



   EXPAND:

         MOV   EBX,EAX            // array offset

         MOV   DL,[ESI+EBX]       // get source byte

         MOV   DH,DL              // copy source byte



         AND   DH,$F              // get msbs

         ADD   DH,             // add 65 to make upper case



         SHR   DL,4               // move lsbs into position

         AND   DL,$F              // get lsbs

         ADD   DL,             // add 65 to make upper case

         SHL   BX,1               // double offset for target array offset

         MOV   [EDI+EBX],DX       // put target word



         INC   EAX                // increment array offset

         LOOP  EXPAND             // repeat until finished

   end;

Rewriting in assembler in this case was profitable, resulting in a time savings of 41 percent. It's logical to assume that code in a language that's more suited to bit manipulation C++, for instancewould have less to gain than Delphi code would. Here are the results:

Language

High-Level Time

Assembler Time

Time Savings

C++

4.25

3.02

29%

Delphi

5.18

3.04

41%


The "before" picture in these measurements reflects the two languages' strengths at bit manipulation. The "after" picture looks virtually identical, and it appears that the assembler code has minimized the initial performance differences between Delphi and C++.

The assembler routine shows that rewriting in assembler doesn't have to produce a huge, ugly routine. Such routines are often quite modest, as this one is. Sometimes assembler code is almost as compact as its high-level-language equivalent.

A relatively easy and effective strategy for recoding in assembler is to start with a compiler that generates assembler listings as a byproduct of compilation. Extract the assembler code for the routine you need to tune, and save it in a separate source file. Using the compiler's assembler code as a base, hand-optimize the code, checking for correctness and measuring improvements at each step. Some compilers intersperse the high-level-language statements as comments in the assembler code. If yours does, you might keep them in the assembler code as documentation.

cc2e.com/2672

Checklist: Code-Tuning Techniques

Improve Both Speed and Size

  • Substitute table lookups for complicated logic.

  • Jam loops .

  • Use integer instead of floating-point variables .

  • Initialize data at compile time.

  • Use constants of the correct type.

  • Precompute results.

  • Eliminate common subexpressions .

  • Translate key routines to a low-level language.

Improve Speed Only

  • Stop testing when you know the answer.

  • Order tests in case statements and if-then-else chains by frequency.

  • Compare performance of similar logic structures.

  • Use lazy evaluation.

  • Unswitch loops that contain if tests.

  • Unroll loops.

  • Minimize work performed inside loops.

  • Use sentinels in search loops.

  • Put the busiest loop on the inside of nested loops.

  • Reduce the strength of operations performed inside loops.

  • Change multiple-dimension arrays to a single dimension.

  • Minimize array references.

  • Augment data types with indexes.

  • Cache frequently used values.

  • Exploit algebraic identities.

  • Reduce strength in logical and mathematical expressions.

  • Be wary of system routines.

  • Rewrite routines inline.


 < Free Open Study >