| ||
The more recent Visual C++ and Intel compilers support a method of programming in assembly language referred to as intrinsics . This is where the functionality of SIMD instructions has been wrapped within C wrappers and compiled into code as inline code. Let us examine the following example:
void test(float *c, float a, float b) { *c = a + b; }
Not to oversimplify the power of using intrinsics to get code up and running quickly, the following code uses intrinisics in conjunction with (__m128) XMM registers with SSE single-precision floating-point instructions. Note that it looks more complicated, but I chose a simple scalar expression to resolve.
#include <xmmintrin.h> void test(float *c, float a, float b) { __m128 ta, tb; ta = _mm_load_ps(&a); tb = _mm_load_ps(&b); ta = _mm_add_ps(ta, tb); _mm_store_ps(c, ta); }
But underneath in the pure assembly code generated by the compiler this breaks down to something similar to the following:
push ebx mov ebx,esp sub esp,8 and esp,0FFFFFFF0h ; 16-byte align stack add esp,4 push ebp mov ebp,dword ptr [ebx+4] mov dword ptr [esp+4],ebp mov ebp,esp sub esp,98h push esi push edi ; __m128 ta, tb ; ta = _mm_load_ps(&a); lea eax,[ebx+0Ch] movaps xmm0,xmmword ptr [eax] movaps xmmword ptr [ebp-30h],xmm0 movaps xmm0,xmmword ptr [ebp-30h] movaps xmmword ptr [ebp-10h],xmm0 ; tb = _mm_load_ps(&b); lea eax,[ebx+10h] movaps xmm0,xmmword ptr [eax] movaps xmmword ptr [ebp-40h],xmm0 movaps xmm0,xmmword ptr [ebp-40h] movaps xmmword ptr [ebp-20h],xmm0 ; ta = _mm_add_ps(ta, tb); movaps xmm0,xmmword ptr [ebp-20h] movaps xmm1,xmmword ptr [ebp-10h] addps xmm1,xmm0 movaps xmmword ptr [ebp-50h],xmm1 movaps xmm0,xmmword ptr [ebp-50h] movaps xmmword ptr [ebp-10h],xmm0 ; _mm_store_ps(c, ta); movaps xmm0,xmmword ptr [ebp-10h] mov eax,dword ptr [ebx+8] movaps xmmword ptr [eax],xmm0 pop edi pop esi mov esp,ebp pop ebp mov esp,ebx pop ebx ret