Stacks and Vectors | 32/64-Bit 80x86 Assembly Language Architecture

Never , never , never pass packed data on the stack as a data argument; always pass a pointer to the packed data instead. Ignoring any possible issues such as faster consumption of a stack, the need for more stack space, or security issues such as code being pushed onto the stack, there is no guarantee that the data would be properly aligned. There are exceptions, but it is not portable. Do not declare local stack arguments for packed data. If in assembly language, the stack register can be "corrected" and then "uncorrected" before returning. A better (portable cross-platform) solution would be to declare a buffer large enough for 16-byte aligned data elements and padded with an extra 12 bytes of memory. Vectors are aligned to 128-bit (16-byte) data blocks within that buffer.

This is only one example of aligning memory. Adding fixed sizes to allow for modulus 16 to the buffer will correct the alignment and will improve processing speed as well.

I realize that padding or aligning memory to 16 bytes initially appears to be crude, but it delivers the functionality you require through the use of pointers, and it is cross-platform compatible.

3D Vector (Floating-Point)

 typedef struct vmp3DVector {   float x;   float y;   float z; } vmp3DVector;       // Three 96-bit vectors aligned to 128 bits each, thus four floats each   // so 3  4 = 12  bytes, but add three extra floats (+3) to handle a   // possible misaligned offset of {4, 8, 12}. Once the first is   // aligned, all other 4-byte blocks will be aligned as well!     float vecbuf[(3 *4)+3];   // enough space for 3 full vectors. vmp3DVector *pvA, *pvB, *pvD;       // Force proper alignment     pvA = (vmp3DVector*) ALIGN16((int)(vecbuf));   // Force pvB = (vmp3DVector*) ALIGN16((int)(pvA+1));     // +4 float pvD = (vmp3DVector*) ALIGN16((int)(pvB+1));     // +4 float

Of course if you are dealing with quad vectors, then align the first one. All the others, which are the same data type and are already 16 bytes in size , will automatically be aligned.

3D Quad Vector (Floating-Point)

 typedef struct vmp3DQVector {   float x;   float y;   float z;   float w; } vmp3DQVector;     vmp3DQVector *pvC, *pvE, *pvF;   // Force proper alignment     pvC = (vmp3DQVector*) ALIGN16((int)(vecbuf)); pvE = pvC+1; pvF = pvE+1;

The same applies for 4—4 matrices. The following is just a quick and dirty demonstration of aligned three-element vector data structures.

 // Copy vectors to aligned memory pvA->x=vA.x;   pvA->y=vA.y; pvA->z=vA.z; pvB->x=vB.x;   pvB->y=vB.y; pvB->z=vB.z;     vmp_SIMDEntry();      // x86 FPU/MMX switching        // if (most likely) non-aligned memory  vmp_CrossProduct0  (&vD, &vA, &vB);        // if (guaranteed) aligned memory  vmp_CrossProduct  (pvD, pvA, pvB);     vmp_SIMDExit();       //x86 FPU/MMX switching

Note the convention of the appended zero used by vmp_CrossProduct0 and vmp_CrossProduct. The zero denotes that the function is not guaranteed to be aligned to (n mod 16) with a zero remainder.

Another item to keep in mind is that a vector is 12 bytes in length (as it is made up of three floats, and a float is four bytes in size), but it is being read/written as 16 bytes on a processor with a 128-bit data width. The extra 4-byte float must be preserved. If the trick of 128-bit memory allocation is utilized, then an out of bounds error will not occur since the data is being advanced in 16-byte blocks. This fourth float is scratch data and as such is not harmful . (No past end of buffer access!)

There are always exceptions to the rule and that occurs here as well. The compiler for the AltiVec instruction set for Motorola typically found in Macintosh PowerPC computers uses the following local argument stack declaration:

 void Foo(void) {   vector float vD, vA, vB; }

Trivia

The PowerPC's AltiVec SIMD instructions never have an alignment exception as the four lower address bits, A ₀₃ , are always forced to zero. So if your memory is misaligned, your data will not be. But it definitely will not be where you expected it!

The following vector declaration automatically aligns the data to a 16-byte alignment. The GNU C compiler (GCC) can generate the following definition:

 typedef float FVECTOR[4] \                             __attribute__((aligned (16)));     void Foo(void) {   FVECTOR vD, vA, vB; }

I am sorry to say that there is only one method for the 16-byte aligned stack frame of data within the Visual C++ environment for the 80x86-based Win32 environment, but unfortunately this only works with Visual C++ .NET (version 7.0) or higher, or with version 6 and a processor pack. The following is a snippet from a DirectX header file d3dx8math.h:

 #if _MSC_VER >= 1300   // Visual C++ ver. 7 #define _ALIGN_16 __declspec(align(16)) #else #define _ALIGN_16      // Earlier compiler may not understand #endif                 // this; do nothing.

So that the following could be used:

 vmp3DVector vA;  __declspec(align(16)) vmp3DVector vB;  _ALIGN_16 vmp3DVector vC;

The alignment of vA cannot be guaranteed, but vB and vC are aligned on a 16-byte boundary. Codeplay's Vector C and Intel's C++ compilers also support this declaration.

There is, however, the Macro Assembler (MASM), which has the following:

 align 16

followed by a 16-byte aligned data declaration for Real4 (float).

 vHalf Real4 0.5,0.5,0.5,0.5

Another item that can easily be implemented from within assembly language is a stack correction for 16-byte memory alignment. The stack pointer works by moving down through memory while adding items to the stack, so by using a secondary stack frame pointer the stack can be corrected.

 push ebx mov ebx,esp          ; ebx references passed arguments and esp,0fffffff0h   ; 16-byte align     ; Insert your code reference by [esp-x]     mov esp,ebx  pop ebx

The line of assembly {and esp} actually snaps the stack pointer down the stack to a 16-byte boundary; thus the next local stack argument will be 16-byte aligned, e.g., [esp-16], [esp-32], etc.