This book covers issues of using assembler language to optimize C++ software created in the Visual Studio .NET environment. In terms of software development and debugging, optimization means improving some of the software product s performance characteristics. In addition, the term often implies a system of measures aimed at improving software performance.
The optimization process itself may be performed either by the developer (manual optimization) or automatically (by the compiler of the development environment, in which the application is debugged ). It is also possible for the developer to use a third-party debugger to debug and optimize the program.
Most developers are aware that under the pressure of tough competition, performance issues have become a crucial factor determining the success or failure of an application in the software market. So without serious work on improving a program code s performance, it is impossible to ensure that the application will be competitive. And although everyone recognizes the necessity and importance of software optimization, it still remains a controversial issue. Disputes in this area are mainly related to the following question: Is it really necessary for a developer to choose to optimize his or her application manually when there are ready-made, dedicated hardware and software tools for this task?
Some developers consider it impossible to improve an application s performance without using the debugging functionality of the compiler itself, especially given that all modern compilers have built-in tools for optimizing program code. In part, this really is the case, as today all existing development tools presuppose the use of optimizing algorithms when generating an executable module.
It is possible to rely completely on the compiler ( everything has been done in advance ), and expect it to generate optimal code without making any effort to improve the program quality. In many cases, the code needs no further revision at all. For example, small office applications or network testing utilities usually need no optimization.
But in most cases, you cannot rely completely on the standard compiler features and skip manual optimization of the program. Whether you like it or not, you will have to face the problem of improving performance when developing more serious applications, such as databases or all sorts of client-server and network applications. In most cases of this type, your development environment s optimizing compiler will not make a big difference.
If you develop real-time applications such as hardware drivers, system services, or industrial applications, the task cannot even be completed without serious work on manual code optimization to ensure the best possible performance. Not because the development tools are not perfect and do not provide the required level of optimization, but because any complex program includes a great number of interrelated parameters, which no development tool can improve better than the developer. The optimization process is more akin to an art than to pure programming, and thus it is difficult to describe it in terms of a universal procedure.
The process of improving an application s performance is usually difficult and time-consuming . There is no single criterion, by which we can characterize optimization. Moreover, the optimization process itself is quite controversial: For example, when you manage to reduce the program s memory usage, you achieve this at the cost of its speed.
No program can be extremely fast, have minimum size , and provide the user with full-scale functionality at the same time. It is impossible to write such an ideal application, although it is possible to bring the application close to this ideal.
In good applications, these characteristics are usually combined in reasonable proportions , depending on what is more important for the particular project: speed, program size (meaning both the size of the application file and its memory usage), or, say, a convenient user interface.
For most office applications, an extremely important factor is a convenient user interface and as much functionality as possible. For example, for a person using an electronic telephone directory, a response that is 10% faster or slower does not make a big difference. The size of such an application does not generally matter much either, as hard drive capacities are now large enough to hold dozens and even hundreds of such electronic database systems. The working program may need dozens of megabytes of RAM, but this does not present a problem today either. What is crucial for such an application is to provide the user with convenient ways to manipulate the data.
For an application using the client/server model for data processing and user interaction (for example, most network applications), the optimization criteria will be different. In this case, priority will be given to issues of memory usage (in particular, for the server side of the application) and optimization of client-side interaction via the network.
With real-time applications, the crucial point is synchronization in receiving, processing, and possibly transferring data in reasonable time intervals. As a rule, in such programs you will need to optimize the level of CPU usage and synchronization with the operating system. If you are a system programmer developing drivers or services for working with an operating system such as Windows 2000, then inefficient program code will at best slow down the whole system performance, and at worst could be beyond imagination .
As you can see, improving an application s performance may be determined by different factors. In each case, the criteria are selected depending on the application s purpose.
Let s now focus on optimization methods, and compare different approaches that can help you increase application performance.
The simplest way to make an application run faster is to upgrade your computer to a faster processor or add more memory. When you upgrade the hardware, the performance problem is resolved by itself.
Using this approach, you will most probably reach deadlock, as you will always depend on hardware solutions. Incidentally, many of the expectations of the performance of new-generation processors, and new memory types and system bus architectures have been greatly exaggerated. In practical work, their performance is not as good as the manufacturers declarations. For instance, new memory chips usually have greater data-storage capacities, but are not faster than preceding models. The same goes for hard drives : Their performance is improving more slowly than their capacity.
If you develop commercial applications, you should take into account that users will not necessarily have the latest processor model and fast memory chips. In addition, many of them will not be willing to invest in a new computer if they are quite satisfied with what they have.
So you can hardly rely on solving software problems solely by acquiring new equipment.
For this reason, let s now turn to methods of increasing performance using only algorithmic and programming methods.
When optimizing an application, you will need to consider the following issues:
Thorough elaboration of the algorithm of the program you are developing
Available computer hardware and getting the most out of it
Tools provided by the high-level language of the environment in which you are developing the application
Using the low-level assembler language
Making use of specific processor characteristics
Let s now look at each of these issues in greater detail.
Developing the algorithm for your future application is the most complicated part of the whole lifecycle of the program. The depth, at which you think out all the aspects of your task, will largely influence how successfully it is implemented as program code. Generally, changes in the structure of the program itself can produce a much greater effect than fine-tuning the program code. There are no ideal solutions, so there are always some mistakes or defects that may occur when the algorithm is developed. Here, it is important to find algorithm bottlenecks that have the greatest effect upon the program s performance.
Moreover, practical experience shows that almost in all cases, you can find a way to improve the program algorithm after it is ready. It is certainly much better if you work out the algorithm thoroughly at the very beginning of the development process, as this will save you a great deal of trouble on revising program code fragments in the short term. So do not try to save time on developing the program algorithm, and this will help you spare headaches when debugging and testing the program, thus saving time later on.
You should also bear in mind that an algorithm efficient in relation to application performance will never correspond completely to the task specification, and vice versa. Many well-structured and legible algorithms are often inefficient when it comes to implementation of the program code. One reason is that the developer tries to simplify the overall structure of the program by using multiple-level nested calculation structures wherever it s possible, and in this case a simpler algorithm inevitably leads to a loss of application performance.
When you start to develop an algorithm, it is difficult to envisage what the program code will look like. To develop a program algorithm correctly, you should stick to the following simple guidelines:
Study the application s purpose thoroughly.
Determine the main requirements of the application and present them in a formalized way.
Decide how to represent incoming and outgoing data, as well as its structure and possible limitations.
Based on these parameters, work out the program version (or model) for implementing the task.
Choose how you will implement the task.
Develop an algorithm to implement the program code. Be careful not to confuse the algorithm for solving the problem and the algorithm for implementing the program code. Generally, these algorithms never coincide. This is the most responsible stage in developing a software product!
Develop the source code of the program according to the algorithm for implementing the program code.
Debug and test the program code of the application.
You should not stick to these guidelines rigidly, however. In every project, the developer is free to choose how to develop the application. Some stages may be subdivided into further steps, and some of them may be skipped . For minor tasks , you can simply work out an algorithm, then correct it slightly to implement the program code, and debug the program.
When creating large applications, you may need to develop and test several isolated fragments of program code, meaning that you will have to add more detail to the program algorithm.
There are a number of resources that can help you to create the correct algorithm. The principles for building efficient algorithms are already well explored, and there are a lot of good books that cover these issues, such as The Art of Computer Programming by D.Knuth.
Software developers usually want to ensure that application performance should depend as little as possible on computer hardware. Therefore, you should also consider the worst-case scenario, in which the user is working on a very old computer. In this case, revising the hardware operation often allows you to find resources to improve the application s performance.
The first thing you need to do is to examine the performance of the hardware components that the program is supposed to use. If you know what works faster and what is slower, this can help you in developing the program. By analyzing system performance, you can find bottlenecks and make the right decision.
Carrying capacity is different for different components of the computer. The fastest are the CPU and RAM, while hard drives and CD-ROMs are relatively slow. Slowest of all are peripherals, such as printers, plotters , or scanners .
Most Windows applications employ a graphical user interface (GUI), and therefore make active use of the computer s graphics features. In this case, when developing an application, you should consider the carrying capacity of the system bus and the computer s graphics subsystem.
Virtually all applications make use of hard disk resources. In most cases, the performance of the disk subsystem has a great effect upon application performance. If your program uses hard-disk resources ”for example, if it writes and moves files quite frequently ”then a slow hard drive will inevitably be an obstacle to performance.
One more example. The prevailing use of CPU registers may help you increase performance by reducing system bus traffic when the program works with the RAM. In many cases, you can improve application performance by caching the data. The data cache may be helpful for disk operations, or when working with the mouse, a printing device, etc.
If you are developing a commercial application, you should determine the lowest hardware configuration, on which your program can run. This configuration should be taken into account when planning any optimization measures.
Using this method of optimization usually involves analyzing the program code to find any bottlenecks in the operation of the program. Finding the points at which the program slows down considerably is often a difficult task. In this case, dedicated programs called profilers may be helpful.
The purpose of profilers is to determine the performance of an application, help you debug the program, and find points where performance drops considerably. One of the best programs of this kind is Intel s VTune Performance Analyzer, which I recommend for debugging and optimizing your applications.
High-level languages also contain built-in debugging tools. Modern compilers help you detect errors, but give you no information as to the efficiency of a program fragment. That is why it is a good idea to have a helpful profiler at hand.
Many developers prefer to debug their programs manually. This is not the worst option if you have a clear idea of how the application works. Anyway, regardless of how you are debugging, it is worth considering the following factors that affect application performance:
The number of calculations performed by the program. One factor improving application performance is reducing the number of calculations. When running, the program should not calculate the same value twice. Instead, it should calculate every value only once and store it in the memory for future use. You can achieve considerably better performance by replacing calculations with simply accessing pre-generated value tables.
Use of mathematical operations. Any application uses mathematical operations in one way or another. Analyzing the efficiency of these calculations is quite a complicated task, and in different cases can depend on different factors. Better performance can be achieved by using simpler arithmetic operations. Thus, you can replace multiplication and division operations by the corresponding block of addition and subtraction commands whenever possible. If the program uses floating-point operations, then try to avoid integer commands, as they will slow down performance. There is one more nuance: If possible, try to reduce the number of division operations. Performance also drops when mathematical operations are used in loops . Instead of multiplication by 2 raised to a power, you can use the commands for left-shifting bits.
Use of loop calculations and nested structures. This concerns the use of loops like WHILE , FOR , SWITCH , and IF . Loop calculations help you simplify the structure of the program, but at the same time reduce its performance. Take a close look at the program code to find calculations using nested structures and loops. The following rules may be helpful for optimizing loops:
Never use a loop to do what can easily be done without a loop.
If possible, try to avoid using the jump commands within loops.
You can achieve better performance even by bringing just one or two operators outside the loop. There are some more things you can do to increase program efficiency. For example, you can calculate invariant values outside loops. You can unroll loops, or combine separate loops with the same number of iterations into a single loop. You should also try to reduce the number of commands used in the body of the loop. Also try to reduce the number of cases when a procedure or a subroutine is called from within the loop body, as the processor may slow down when calculating their efficient addresses.
It is also useful to reduce the number of jump commands in the program. To do so, you can, for example, reconstruct the conditional blocks so that the jump condition returns a TRUE condition much less often than a negative. It is also a good idea to place more general conditions to the starting point of the program branching sequence. If your program contains calls followed by returns to the program, it is better to transform them into jumps .
In summary, it is desirable to reduce the number of jumps and calls wherever it s possible, especially at those points of the program where performance is determined only by the processor. To do this, you should organize the program so that it can be executed in a direct (linear) sequence with a minimal number of jump points.
Implementation of multithreading. If used correctly, this technique can produce better performance, but otherwise it may slow down the program. Practical experience shows that the use of multithreading is efficient for large applications, whereas smaller programs with multithreading tend to slow down. The possibility of breaking the executed process into several threads is provided by Windows architecture. Multithreading can be helpful for optimizing programs. You should bear in mind that every thread requires additional memory and processor resources, so this method is unlikely to be effective if hardware performance is not high enough (e.g., if the system has a slow processor or not enough memory).
Allocation of similar and frequently repeated calculations into separate subroutines (procedures). There is a widespread opinion that the use of subroutines always increases the application performance, making it possible to reuse the same code fragment for performing similar calculations at different points of the program. This is partially true, as it makes the program easily readable and the algorithm easier to understand. But from the point of view of the processor, an algorithm with the linear sequence is always (!) more efficient than use of procedures. Every time you use a procedure, the program makes a jump to another memory address, while at the same time storing the address where it should return to the main program on the stack. This always slows down the program. This does not imply that you should reject using subroutines or procedures completely: You should just use them within reason.
Using assembler is one of the most efficient methods of program optimization, and optimization techniques are largely similar to those used with high-level languages. But assembler provides the programmer with a number of additional options. Without repeating those issues that are similar to optimization in high-level languages, here we shall focus on techniques characteristic only for assembler.
Using assembler is in many respects a good way to eliminate the problem of redundant program code. Assembler code is more compact than its high-level analog. To see this, you can simply compare the disassembled listings of the same program written in assembler and in a high-level language. The assembler code generated by a high-level language compiler, even with optimization options applied, does not solve the problem of redundant program code. At the same time, assembler lets you develop short, efficient code.
As a rule, assembler program modules perform better than programs written in a high-level language. This is due to a smaller number of commands needed to implement the code fragment. It takes the processor less time to execute a smaller set of commands, thus increasing application performance.
You can develop individual modules completely in assembler, and then link them to high-level language programs. You can also make use of built-in tools in high-level languages to write assembler procedures directly into the body of your program. This feature is supported by all high-level languages. By using the built-in assembler, you can obtain greater efficiency. This is most effective when used to optimize mathematical expressions, program loops, and data-array processing blocks in the main program.
For optimization based on the specific properties of the processor, you need to take into account the architectural peculiarities of every particular Intel processor. This method extends to assembler optimization.
Here, we shall only consider optimization options for Pentium processors. Every new processor model usually has some new improvements in its architecture. At the same time, all Pentium processors have some characteristics in common. So processor-level optimization may be based both on the common properties of the whole family and on the specifics of each model.
Processor-level optimization of program code lets you enhance the performance of both high-level language programs and assembler procedures. Developers who use high-level languages are often unaware of this method, so it is seldom used, even though it can provide virtually unlimited possibilities. And those who develop assembler programs and procedures sometimes make use of the properties of new processor models.
It should be noted that even earlier Intel processors included additional commands. Though rarely used by developers, these commands allow the program code to be made more efficient.
So what processor properties can be used to provide optimization? First of all, it is useful to align data and addresses with the borders of 32-bit words. Besides, all processors from 80386 onward support enhanced calculation features, which you can use for optimizing the programs. These features were added by supplementary commands and by expanding the operand-addressing options. To improve program performance, you can use the following methods:
Transfer commands with a zero or sign extension ( movzx or movsx ).
Setting the byte to TRUE or FALSE depending on the content of the CPU flags. This lets you avoid using conditional jump commands (for example, commands like setz , setc , etc.).
Commands for bit checking, settings, resetting, and scanning ( bt , btc , btr , bts , bsp , bsr ).
Extended index addressing and addressing modes with index scaling.
Quick multiplication using the lea command with scaled index addressing.
Multiplication of 32-bit numbers and division of a 64-bit number by a 32-bit one.
Operations for processing multibyte data arrays and strings.
Processor commands for copying and moving multibyte data arrays require a smaller number of processor cycles than classical commands of this type. From MMX processors onward, processors add complex commands combining several functions performed by separate commands. There is now a considerably larger set of commands for bit operations. These commands are also complex, and let you perform several operations at once. The options provided by these commands will be covered in Chapter 10 , which explores built-in tools in high-level languages.
As has already been seen, using the properties of the processor s hardware architecture has great potential for optimization. This is quite a complicated business, requiring knowledge of data-processing methods and of performing processor commands at the hardware level. I can assert in all confidence that this domain contains virtually unlimited potential for program optimization.
Naturally, processor-level optimization has its own peculiarities. For instance, if your program is meant to run on systems with processors of several different generations, then you should optimize the program based on the common features of all those devices.
In addition, there is also a lot of other options for optimizing application code. As you can see, the program itself has a great deal of optimization potential. This book focuses mainly on optimization using assembler, and considering possible solutions to this task in greater detail.
Assembler is widely used as a tool for optimizing the performance of high-level language applications. By combining assembler and high-level language modules reasonably, you can achieve both higher performance and smaller executable code. This combination is now used so frequently that the interface of high-level language programs with assembler modules included has become a special concern for compiler producers . As a rule, modern compilers come with built-in assembler.
In practical work, there are two basic options for combining assembler with high-level languages.
The first approach is to use a separate object module file with one or several procedures for data processing. The procedures are called from within a program created in a high-level development environment such as Visual C++ .NET.
In the source code of the high-level language application, you need to declare the assembler procedure accordingly , and can then call it from any point of the main program. During the assembly, the external object module (written in assembler) is linked to the main program.
The file containing the source code of the procedure usually has the ASM extension. To compile it, you can resort to one of the widely used packages such as Microsoft Macro Assembler (MASM), Borland Turbo Assembler (TASM 5.0), or Netwide Assembler (NASM), which is more powerful than the first two but not as widely used.
Compiling separate assembler modules has a number of advantages. First of all, you can use this program code in applications written in different high-level languages and even in different operating environments. It is also important that you can develop and debug the program code of the procedures separately. Among possible drawbacks are certain difficulties in integrating such a module with the main high-level language program. When using this approach, you should have a clear idea of the mechanism for calling external procedures and sending the parameters to the procedure you are calling. This approach also enables you to use assembler object modules or function libraries repeatedly. In this case, you should take care of the interface for interaction between the assembler module and the high-level language program. Issues of integrating assembler modules with C++ programs will be covered in more detail in Chapter 7 .
The second approach is based on the use of built-in assembler. Using built-in assembler to develop procedures is convenient, first of all, due to fast debugging. Since the procedure is developed within the body of the main program, you do not need any dedicated tools to integrate this procedure with the program that calls it. Nor do you have to worry about the order of sending the parameters into the procedure you are calling, or about restoring the stack. Possible drawbacks of this approach include certain limitations imposed by the development environment on the operation of the assembler modules. In addition, procedures developed in built-in assembler cannot be transformed into external modules for repeated use.
Like the high-level languages, all modern assembler development tools come with an integrated debugger. Although such a debugger may offer you somewhat lower service than high-level languages, its features are quite enough for analyzing program code.
It is true that many developers consider assembler to be just a supplementary tool for improving programs. But in spite of this, the role of assembler has changed considerably in recent years , and it is also regarded as an independent tool for developing highly efficient applications.
Until recently, there existed a vivid stereotype of the use of assembler for application development. Lots of programmers working with high-level languages believe that assembler is complicated, the assembler software cannot be well structured, and that assembler code is hardly portable to other platforms. Many may remember the times of developing assembler programs in MS-DOS, which really was difficult. And besides, the lack of modern development tools at that time hindered the development of complicated projects.
In recent years, this situation has changed with the appearance of completely new and highly efficient tools that let you develop assembler programs quickly. These are dedicated Rapid Application Development (RAD) systems such as MASM32, Visual Assembler, and RADASM. The size and performance of a window-based SDI (Single-Document Interface) application written in the assembler language is really impressive!
As a rule, such development tools come with resource compilers, large libraries of ready-to-use functions, and powerful debugging tools. So it is fair to say that developing programs in assembler has become as easy as developing them in high-level languages.
Thus, the main reason that kept developers from using assembler widely ”i.e., the lack of Rapid Application Development tools ”has been eliminated. And what applications can be developed in the assembler language? It is much easier to say for which projects you should not use it. Small and medium- sized 32-bit Windows applications can be written completely in assembler. But if you need to develop a complicated program requiring the use of advanced technologies, then you would be better off choosing high-level languages, and then using assembler to optimize certain fragments of the program.
There is one more difficulty in using assembler: It is intended for developing procedural applications, and does not use the object-oriented programming (OOP) methodology. This causes certain limitations on its usage. Nevertheless, this in no way prevents you from using assembler for writing classical procedural Windows applications.
Modern assembler development tools enable you to create a graphical user interface (GUI) while retaining the fundamental advantage of assembler: The size of your executable module will be incredibly small. Short, fast assembler applications are useful when code size and program speed are crucial factors, for example, in real-time applications, system utilities and programs, as well as hardware drivers.
The assembler programs let you control both the peripherals of the personal computer and the non-standard devices connected to it. The minimal size of the executable program code ensures high performance of these devices. The real-time applications are widely used in industrial control systems, in scientific and laboratory research, and also in military investigations.
As to the system programs and utilities, their peculiarity is close interaction with the operating system, and so the speed of such applications may have a considerable effect upon the overall performance of the whole operating system. This is also largely applicable to the development of hardware drivers and system services.
Assembler development tools also let you create fast console (command line) utilities. By using Windows calls in such utilities, you can implement a lot of very complicated functions (copying files, searching and sorting, processing and analysis of mathematical expressions, etc.) at an extremely high level of performance.
Another important use of assembler is to develop drivers for computer-controlled non-standard and specialized devices. For these tasks, assembler may be very efficient. The huge number of examples of this sort of usage includes computer-based data-processing systems with external devices (such as microcontroller and digital processor devices used in technological processes, as well as smart-card terminals and all kinds of analyzers), single-board computers using flash memory, and systems for diagnostics and testing all kinds of equipment.
There is one more, rather exotic aspect of using assembler, which concerns using assembler for the main program and a high-level language (say, C++) for the supplementary modules. As a rule, such a program uses the powerful library functions of the high-level language, such as mathematical or string functions. In addition, if you develop the interface by calling the WIN API (Application Programming Interface), you can obtain an extremely powerful program. But this technique demands outstanding knowledge of both assembler and the high-level language.
Apart from the techniques considered above, there are a number of other methods for improving the quality of the software. Experienced developers resort to a lot of tricks and hacks to improve an application s performance level.
As mentioned above, program optimization is a creative process, and developers may have individual preferences when choosing how to debug and optimize applications.