Data Alignment

Processors work most efficiently with data that is properly aligned. In the case of the SSE or better instruction set, there is not one 128-bit load but two. Processors have been designed for efficient operations so internally the data is not loaded misaligned , it is loaded 128-bit aligned, but in the case of the SSE, it is corrected by shifting two halves of two 128-bit loads to adjust for the requested 128 bits of data. This misalignment is very inefficient and time consuming! This means that instead of loading only the needed 16 bytes, 32 bytes were loaded by the processor.

The first item on the agenda is the alignment of data values. Pointers are typically 4-byte on a 32-bit processor; 64-bit requires 8-byte; 128-bit requires 16-byte.

 #define ALIGN2(len) (((len) + 1) & ~1)     // round up to 2 items  #define ALIGN4(len) (((len) + 3) & ~3)     // round up to 4 items  #define ALIGN8(len) (((len) + 7) & ~7)     // round up to 8 items  #define ALIGN16(len) (((len) + 15) & ~15)  // round up to 16 items 

These can easily be used to align bytes (or implied —8 bits). So to align to 16 bytes:

 nWidth = ALIGN16(nWidth); // 128-bit alignment! 

Some of you may note that the basic equation of these macros:

 (A, X)    (((A) + (X)) & ~(X)) 

relies on a byte count of 2 N so that a logical AND can be taken to advantage and could possibly be replaced with:

 (A, X)    ((A) % ((X) + 1)) 

and be easier to read, but that would be giving too much credit to the C compiler as some will do a division for the modulus and some will see the binary mask and take advantage with a mere logical AND operation. Even though this latter code is clearer, it may not be compiled as fast code. If you are not sure what your compiler does, then merely set a breakpoint at a macro, then either expand the macro or display mixed C and assembly code. The division or logical AND will be right near where your instruction pointer (IP) is pointing to your position within the code.

A macro using this alternate method would be:

 #define ALIGN(len, bytes)(((len) + ((bytes)-1)) % (bytes)) 

This is a little obscure and typically unknown by non-console developers, but CD sector size alignment is needed for all files destined to be loaded directly from a CD or DVD as they are typically loaded by number of sectors rather than number of bytes and this is typically 2048 or 2336 bytes in size . All these require some sort of alignment correction jig.

 // round up 1 CD sector #define ALIGN2048(len) (((len) + 2047) & ~2047) 

Sometimes CD sectors are not 2048 byte but 2336. Since this is not 2 N , a modulus (%) must be used since simple bit masking will not work.

 #define ALIGN2336(len) (((len) + 2335) % 2336) 

The correction is as simple as:

 void *foo(uint nReqSize) {    uint nSize;    nSize = ALIGN16(nReqSize);   :   :   // Insert your other code here! } 
Tip  

There is a simple trick to see if your value is 2 N .

 A   (A  1)  ?  

Subtracting a value of 1 before a logical AND would skew the bits. If only one bit is set, then the result of the AND would be 0. If more than one bit is set, the result would be non-zero .

The requested size is stretched to the appropriate sized block. This really comes in handy when building dedicated relational databases in tools for use in games .

Goal  

Ensure properly aligned data.

I have had several incidents over the years with compilers and mis-aligned data. Calls to the new function or the C function malloc() returned memory on a 4-byte alignment but when working with 64-bit MMX or some 128-bit SSE instructions there would be unaligned memory stall problems. Some instructions cause a misalignment exception error, while others just cause memory stalls. The 80x86 is more forgiving than other processors as its memory accesses can be mis-aligned without a penalty, but there are SIMD instructions that require proper memory alignment or an exception will occur. Thus it is always best to ensure memory is always properly aligned. The PowerPC and MIPS processors require that memory be properly aligned. For cross-platform portability, it is very important to ensure that your data is properly aligned at all times whether you know your application will be ported or not.

The 80x86 has an alignment check flag in the CR0 register that can be enabled to verify all memory is aligned properly. (Use with caution unless you are writing your own board support package.)

The first half of the remedy is easy. Just make sure your vectors are a data type with a block size set to 16 bytes and that your compiler is set to 16-byte alignment and not 1-byte aligned or the default, even if using 64-bit MMX-based instructions. The smart move is to ensure that all data structures are padded so they will still be aligned properly even if the compiler alignment gets set to 1 byte. This will make your life easier, especially if code is ever ported to other platforms, especially UNIX. This is a safety factor. Normally one would manually pack the data elements by their size to ensure proper alignment and insert (future) pad bytes where appropriate, but by adjusting the alignment in the compiler you can ensure that the ported applications using different compilers will export proper data files and network messages.

Notice the "Struct member alignment" field in the following property page for Project Settings in Visual Studio version 6 and Visual C++.NET. The default is 8 bytes, which is denoted by the asterisk, but 16 bytes is required for vector programming.

image from book
Figure 2-1: Visual C++ (version 6) memory alignment property page
image from book
Figure 2-2: Visual C++ .NET (version 7) memory alignment property page

You should get into the habit of always setting your memory alignment to 16 for any new project. It will help your application even if it uses scalar and no SIMD-based instructions.



32.64-Bit 80X86 Assembly Language Architecture
32/64-Bit 80x86 Assembly Language Architecture
ISBN: 1598220020
EAN: 2147483647
Year: 2003
Pages: 191

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net