Chapter 12: Optimizing Multithread Applications with Assembly Language | Visual C++ Optimization with Assembly Code

Download CD Content

In Win32, a thread is the main executed element. An application (a process) can contain several independent threads, which share its address space and other resources. A thread is an independent unit inside a process. Using threads allows you to simplify your application and enjoy the advantages of parallel processing. Ordinary applications are executed as a single thread, but this slows down the work of applications that require simultaneous data processing (such as file search, sorting several arrays, etc.). It should be noted that arranging a thread also requires certain processor resources. Therefore, it is logical to use multithreading when the thread is executed long enough.

Programming multithread applications is described in special literature quite comprehensively, so we will not focus on their design. Rather, we will demonstrate how the C++ .NET 2003 inline assembler can be used to improve the performance of multithread applications. The main advantages of the assembler ”a high speed of data processing and a small size of the code ”can be successfully used for the optimization of multithread applications.

We will illustrate this with a specific example. Consider a task of simultaneously searching for and counting the number of particular characters in a text file.

Suppose you need to find the number of the characters ˜ r , ˜ t , and ˜ f in the testchar text file and display the result. It would be convenient to implement this task as three independent threads. One of the threads could look for the ˜ r characters and count their number, and the second and third threads could do the same for the ˜ t and ˜ f characters.

The source code of the program is shown in Listing 12.1.

Listing 12.1: A three-thread application for counting the number of characters

 // THREAD_EXAMPLE_FPU. cpp : Defines the entry point for the console  // application.  #include <stdio.h>  #include <windows.h>  #include <process.h>  FILE *fp;  char buf[128];  int i, numread;  int num_t, num_f, num_r;  HANDLE tTimer = NULL;  HANDLE rTimer = NULL;  HANDLE fTimer = NULL;  LARGE_INTEGER liDueTime;  void char_t(void*)  {  _asm  {      mov  EDX, 0      lea  ESI, buf      mov  ECX, 70      mov  AL, ' t '  next_char:      cmp  AL, BYTE PTR [ESI]      jne  inc_address      inc  EDX  inc_address:      inc  ESI      dec  ECX      jnz  next_char      lea  EAX, num_t      mov  DWORD PTR [EAX], EDX      }    tTimer = CreateWaitableTimer(NULL, TRUE, "tTimer");    SetWaitableTimer(tTimer, &liDueTime, 0, NULL, NULL, 0);  }  void char_f (void*)  {  _asm {       mov  EDX, 0       lea  ESI, buf       mov  ECX, 70       mov  AL, ' f '  next_char:       cmp  AL, BYTE PTR [ESI]       jne  inc_address       inc  EDX  inc_address:       inc  ESI       dec  ECX       jnz  next_char       lea  EAX, num_f       mov  DWORD PTR [EAX] , EDX      }    fTimer = CreateWaitableTimer (NULL, TRUE, "fTimer");    SetWaitableTimer (f Timer, &liDueTime, 0, NULL, NULL, 0);    }  void char_r (void*)  {  _asm {     mov  EDX, 0     lea  ESI, buf     mov  ECX, 70     mov  AL, ' r '  next_char:     cmp  AL, BYTE PTR [ESI]     jne  inc_address     inc  EDX  inc_address:     inc  ESI     dec  ECX     jnz  next_char     lea  EAX, num_r     mov  DWORD PTR [EAX] , EDX     }    rTimer = CreateWaitableTimer (NULL, TRUE, "rTimer")    SetWaitableTimer (rTimer, &liDueTime, 0, NULL, NULL, 0);  }  int _tmain (int argc, _TCHAR* argv[])   {     printf ("OPTIMIZING OF MULTITHREADING APPLICATION WITH ASM DEMO\n\n");     if ((fp=fopen("d: \testchar", "r"))=NULL)     {      printf ("The file 'testchar' was not opened\n");      exit (1);     };     numread=fread(buf, sizeof (char), 70, fp);     fclose (fp) ;     printf ("           First 70 chars :\n\n");     printf ("%.70s\n", buf);     liDueTime. QuadPart=   10;     _beginthread(char_f, 0, NULL);     _beginthread(char_t, 0, NULL);     _beginthread(char_r, 0, NULL);     while (WaitForSingleObject(fTimer, INFINITE) != WAIT_OB JECT_0);     CancelWaitableTimer (fTimer);     while (WaitForSingleObject (tTimer, INFINITE) != WAIT_OBJECT_0);     CancelWaitableTimer (tTimer);     while (WaitForSingleObject(rTimer, INFINITE) != WAIT_OB JECT_0);     CancelWaitableTimer (rTimer) ;     printf ("\nNumber of 'f' characters=%d\n", num_f);     printf ('\nNumber of 't' characters=%d\n", num_t);     printf ("\nNumber of 'r' characters=%d\n", num_r);     MessageBox(NULL, "Searching completed!", "FIND CHARS, MB_OK);     return (0);  };

Each of the three threads is started with the _beginthread function. The uniform functions char_f , char_t , and char_r written almost entirely in the assembler are passed to this function in turn as the first parameter. Waitable timer objects are used to synchronize the threads. Why?

When a few threads are started in an application, it is likely that a thread does not process data on time, and the application cannot run correctly. To synchronize data processing by the application and individual threads, it is necessary to inform the application that a particular thread has completed data processing, and the data are ready.

This and other issues of synchronizing applications and threads are related to system programming and are rather complicated. We will not focus on these issues in too much detail; however, we will concentrate on one of the possible solutions to the synchronization problem using the waitable timer, specifically on its practical implementation.

First, a waitable timer object is created with the CreateWaitableTimer function. The function returns the descriptor of the waitable timer object. This descriptor can be used by so-called wait functions for locking or terminating processes and threads. Although this might be a simplified explanation, it will help you to understand the key concepts. A wait function does not pass control to other pieces of code until a certain condition is satisfied. Setting the signal by the synchronized object is most often used for such a condition. It is often said that the object is in the signal state. In our case, the synchronized object is the waitable timer.

Now, we will look at how the main process interacts with one of its threads in our example:

A waitable timer object is created with the CreateWaitableTimer function in the thread. In a certain time interval (for example, when the thread completes data processing), the waitable timer object enters the signal state.
The waiting function of the main thread or process ( waitForSingleObject in our example) checks the synchronized object (the waitable timer). If the object entered the signal state, the wait function returns the WAIT_OBJECT_0 value. After receiving this value, the main process can consider the called thread complete and start processing the received data. The synchronized object that is not needed any longer should be deleted with the CancelWaitableTimer function.

In practice, this procedure is the following. A new waitable timer object is created with the CreateWaitableTimer function in each of the threads. The function returns the object descriptor that is used later to access the waitable timer object. After that, the timer is activated with the SetWaitableTimer function. One of this function s parameters is a pointer to a variable that contains the time interval, after which the timer object will enter the signal state. In our case, this interval (measured in units equal to 0.1 microsecond) is written to the liDueTime variable. The length of the interval is chosen at will and is equal to 10 milliseconds .

The code fragments for each of the threads are almost the same:

 . . . jnz next  fwait  };    yTimer=CreateWaitableTimer (NULL, TRUE, "yTimer");    SetWaitableTimer (yTimer, &liDueTime, 0, NULL, NULL, 0)  }  . . .

In our program, the time interval that defines the termination of the character search function in the auxiliary thread is set to 10 milliseconds. Although this value was chosen at will, remember that setting this parameter when developing such applications should depend on the performance of the thread. It is likely that required operations will not all finish before the signal is set. It is not by chance that the assembler is used intensively in such applications. Since it is very fast, the assembly code makes it possible for several threads to work with a minimum synchronization time!

The character search algorithm implemented in the assembly block is rather simple, and you will easily make sense of it.

When the WaitForSingleObject wait function is called by the main thread or process, it checks the signal condition. If the condition is not satisfied, the calling thread or process enters the waiting state. In this state, the processor resources are used little or not used at all. When the condition is satisfied, the application resumes its work. As soon as the wait function completes, the waitable timer object is canceled with the CancelWaitableTimer function that takes the waitable timer object descriptor as a parameter. Here is a piece of code of the main program:

 . . . while (WaitForSingleObject (tTimer, INFINITE) !=WAIT_OBJECT_0) ;  CancelWaitableTimer (tTimer) ;   . . .

This program requires the multithreading library. For example, if you compile the program in the mythreads.cpp file from the command line, type

 cl /MT /D "_X86_" mythreads . cpp

If you compile in the Visual C++ .NET 2003 environment, set the /MT option manually using the following procedure:

Select Properties in the Project menu.
Select the Configuration Properties / C/C++ / Code Generation / Runtime Library page.
Set the /MT compiler option.

The window of the program is shown in Fig. 12.1.

Fig. 12.1: Window of a program demonstrating three threads

Here is another example of a multithread application. The application computes the quotient of division of the square root of one variable by the square root of another. Both variables are stored at the same position in two equal-length arrays of floating-point numbers .

The computation is implemented as follows :

One thread computes the square root of the variable from the first array ( f1 ).
The other thread computes the square root of the variable from the second array ( f2 ).
After the threads complete computation, the main program uses the obtained data for future computation (division).

This example is similar to the previous one, but it demonstrates how the floating-point variables can be processed . Since floating-point operations require much more intensive computation than operations over characters, be sure to carefully select the time interval for synchronizing waitable time objects with the main process. The source code of the console application is shown in Listing 12.2.

Listing 12.2: Performing mathematical operations in a two-thread application

 // THREAD_EXAMPLE_FPU.cpp : Defines the entry point for the console  // application.  // SQRT(f1)/SQRT(f2);  #include "stdafx.h"  #include <windows.h>  #include <process.h>  FILE *fp;  float f1[7]={34.13, 96.03, 234.1, 954.25, 54.103, 3.14, 8.33};  float f2[7]={67.11, 23.12, 5.87, 76.32, 19.43, 67.11, 5.09};  float fdv[7] ;  HANDLE xTimer=NULL;  HANDLE yTimer=NULL;  LARGE_INTEGER liDueTime;  void sqrt_x (void*)  {  _asm {        lea ESI, f1        mov ECX, 7        finit        fldz  next:        fld DWORD PTR [ESI]        fsqrt        fstp DWORD PTR [ESI]        add ESI, 4        dec ECX        jnz next        fwait        };      xTimer=CreateWaitableTimer (NULL, TRUE, "xTimer");      SetWaitableTimer (xTimer, &liDueTime, 0, NULL, NULL, 0);    }  void sqrt_y (void*)  {  _asm {         lea EDI, f2         mov ECX, 7         finit         fldz  next:         fld DWORD PTR [EDI]         fsqrt         fstp DWORD PTR [EDI]         add EDI, 4         dec ECX         jnz next         fwait        };     yTimer = CreateWaitableTimer (NULL, TRUE, "yTimerP);     SetWaitableTimer (yTimer, &liDueTime, 0, NULL, NULL, 0);   }  int _tmain(int argc, _TCHAR* argv[])  {   printf ("OPTIMIZING OF MULTITHREADING APPLICATION WITH ASM FPU \n\n")   printf ("\nf1  : ");   for (int cnt = 0; cnt < 7; cnt ++)        printf ("%.3f ", f1[cnt]);   printf ("\nf 2 : ");   for (int cnt=0; cnt < 7; cnt)        printf ("%.3f ", f2[cnt]);    liDueTime.QuadPart =   10;   _beginthread(sqrt_x, 0, NULL);   _beginthread(sqrt_y, 0, NULL);  while (WaitForSingleObject (xTimer, INFINITE) ! = WAIT_OBJECT_0);  CancelWaitableTimer (xTimer);  while (WaitForSingleObject (yTimer, INFINITE) ! = WAIT_OB JECT_0);  CancelWaitableTimer (yTimer);  _asm{         lea ESI, DWORD PTR f1         lea EDI, DWORD PTR f2         lea EDX, DWORD PTR fdv         mov ECX, 7         finit         fldz  next:         fld DWORD PTR [ESI]         fld DWORD PTR [EDI]         fdiv         fstp DWORD PTR [EDX]         add  ESI, 4         add  EDI, 4         add  EDX, 4         dec  ECX         jnz  next         fwait     };    printf("\n\nSQRT(f1)  :");    for (int cnt = 0; cnt < 7; cnt++)         printf("%.3f ", f1[cnt]);    printf("\nSQRT(f2) : ");    for (int cnt = 0; cnt < 7; cnt++)         printf("%.3f ", f2[cnt]);   printf("\n\nSQRT{f1) / SQRT(f2) : ");  for (int cnt = 0; cnt < 7; cnt++)      printf("%.3f ", fdv[cnt]);  MessageBox(NULL, "Calculations completed!", "FIND SQRT", MB_OK)  return 0;  }

Computation is done by the main process and two auxiliary threads. The threads compute the square roots of the elements of two arrays, and the main process finds the quotient of division of one value by the other. The threads are started with the following statements.

 _beginthread(sqrt_x, 0, NULL)  _beginthread(sqrt_y, 0, NULL)

One of the threads executes the sqrt_x function, while the other executes the sqrt_y function. These functions create waitable timer objects and set the signal states for the WaitForSingleObject functions of the main thread. The WaitForSingleObject functions wait for the threads to terminate and bring the waitable timers to inactive states:

 while (WaitForSingleObject(xTimer, INFINITE) != WAIT_OBJECT_0)  CancelWaitableTimer(xTimer)  while (WaitForSingleObject(yTimer, INFINITE) != WAIT_OBJECT_0)  CancelWaitableTimer(yTimer)

The results of the auxiliary threads work are used by the main process for further computation in the assembly block.

The window of the program is shown in Fig. 12.2.

Fig. 12.2: Window of a program demonstrating mathematical operations in two threads

Multithreading is a complex topic, and it can be implemented in various ways. In this chapter, we described one of the most frequently used ways: with waitable timers. Using the assembler in such programs significantly improves the performance and quality of a program as a whole.

These examples of simple programs demonstrate the use of the assembler and can help programmers when writing applications with intensive computation requiring parallel execution of operations.