VERTEX SHADER INSTRUCTIONS | Real-Time Shader Programming (The Morgan Kaufmann Series in Computer Graphics)

Vertex shaders version 1.0 and 1.1 can be up to 128 instructions long, while DirectX 9 shaders can be up to 256 instructions long. These instructions are a collection of general instructions, macroinstructions, and version and constant definition instructions. They are instructions that operate on input registers or constants and can selectively choose which of the register's four float elements to use. These elements in turn can be swizzled, negated, or masked.

The philosophy of shaders is to keep them simple and straightforward. Thus you'll see a very limited set of instructions, which, with a bit of cleverness, can actually be expanded to a wider variety of uses, as we'll see later. Generally (with a few notable exceptions), each shader instruction corresponds to one clock cycle of execution time. This means that the longer your shader program, the longer it'll take to execute with almost a direct linear correspondence. Macroinstructions are just that, and they can expand up to 12 general instructions, so remember that when you're calculating the shader size and number of clock cycles. On the other hand, the register masks and swizzles add no execution time and should be used freely.

abs(macro)

vs 2.0

This macro computes the absolute value of the input register.

One slot

 _______________________________________________________________________________ abs Dest0, Source0 _______________________________________________________________________________

This macro is equivalent to

 _______________________________________________________________________________ max Dest0, Source0, -Source0 _______________________________________________________________________________

which you can use if you're using a prevertex shader 2.0 shader. In any case, you'll end up with the absolute value of the Source0 in Dest0.

Setup One source register, Source0.

Results Dest0 is filled with the absolute value of Source0.

  abs r0 , r0  abs r0.z, r0.z

  SetSourceRegisters();  // Simulate the add instruction  TempReg.x = abs( Source0.x );  TempReg.y = abs( Source0.y );  TempReg.z = abs( Source0.z );  TempReg.w = abs( Source0.w );  WriteDestinationRegisters();

add

vs 1.0, 1.1, 2.0

Adds two sources into the destination register.

One slot

 _______________________________________________________________________________ add Dest0, Source0, Source1 _______________________________________________________________________________

Adds the Source0 and Source0 registers and places the result in Dest0 register.

Setup Two source registers, Source0 and Source1.

Results Each element of Dest0 is filled with the element-by-element addition of the elements of Source0 and Source1.

  add r0 , r0 , c2  add r0.z, r0.z, -r0.z

  SetSourceRegisters();  // Simulate the add instruction  TempReg.x = Source0.x + Source1.x;  TempReg.y = Source0.y + Source1.y;  TempReg.z = Source0.z + Source1.z;  TempReg.w = Source0.w + Source1.w;  WriteDestinationRegisters();

call

vs 2.0

Makes an unconditional function call to the instruction label.

One slot

 _______________________________________________________________________________ call_InstructionLabelID _______________________________________________________________________________

Pushes the address of the following instruction onto the internal shader stack and then sets the current instruction address to the address of the instruction that follows the label instruction with the name InstructionLabelID. The instruction label ID will be an integer in the range [1, 16]. Calls cannot be nested.

Typically, you'd create a shader subroutine that terminates with the ret instruction.

Setup Requires a valid, existing instruction label.

Results The shader execution is transferred to the instruction following the instruction label.

 call_1 call_16 call_Fred // Error! Invalid label call_0    // Error! Invalid label (out of range)

 // Simulate the call instruction // make a cast to a bare function pointer typedef (void (*fp)(void)); // take address of the label fp pFP = (fp)IntructionLabelID; pFP(); // call the function // returns here only when ret is executed

callnz

vs 2.0

Call if Not Zero. Makes a function call to the instruction label.

One slot

 _______________________________________________________________________________ callnz InstructionLabelID BoolSource0 _______________________________________________________________________________

If the boolean register Source0 is not zero, then the address of the following instruction is pushed onto the internal shader stack, and then the current instruction address is set to the address of the instruction that follows the label instruction with the name InstructionLabelID. The instruction label ID will be an integer in the range [1, 16]. Calls cannot be nested.

Typically, you'd create a shader subroutine that terminates with the ret instruction.

Setup Source0 is a boolean register. Requires a valid, existing instruction label.

Results If the source register is not zero, the shader execution is transferred to the instruction following the instruction label.

   callnz 1 b0 // transfer execution to label1 if = ! = b0   callnz 2 r0 // Error! Not a boolean register

   // Simulate the callnz instruction   // make a cast to a bare function pointer   typedef (void (*fp)(void));   if ( 0 != Boolean argument )       {       fp pFP = (fp)IntructionLabelID;       pFP(); // call the function       }

crs(macro)

vs 2.0

The three-component cross product computed.

Two slots

 _______________________________________________________________________________ crs Dest0, Source0, Source1 _______________________________________________________________________________

Computes the three-component cross product using the right-hand rule. There are fairly severe restrictions on the use of swizzles. The w elements of all registers are ignored.

This macro is equivalent to the following sequence of instructions:

 _______________________________________________________________________________ mul Dest0.xyz,  Source0.yzxw, Source1.zxyw mad Dest0.xyz, -Source1.yzxw, Source0.zxyw, Dest0 _______________________________________________________________________________

Setup Two source registers, Source0 and Source1. These registers must not be the same as the destination register. The source registers must not have any swizzles

The destination register must have a destination mask, and that mask must not contain a reference to the w element of the destination register.

Results The cross product of the two input registers is stored in the specified elements of the destination register.

   crs r0.xyz, r1.. r2 // fill r0 with dp3

   SetSourceRegisters();   // Simulate the crs macro   TempReg.x = Source0.x * Source1.z - Source0.z * Source1.y;   TempReg.y = Source0.z * Source1.x - Source0.x * Source1.z;   TempReg.z = Source0.x * Source1.y - Source0.y * Source1.x;   // note w component ignored   WriteDestinationRegisters();

dcl

vs 2.0

Declare. Map a vertex element to an input register.

Takes no slots

 _______________________________________________________________________________ dcl Dest0 _______________________________________________________________________________

In order to make it easier to optimize and verify shaders VS 2.0 now requires a declaration statement on all input registers. Thus all texture or vertex input registers must be declared before use in the shader. Dest0 will be a specific input register. The partial precision modifier (_pp) can be applied to the declaration statement to indicate a lower precision is acceptable when using this register. You must supply a component mask on Dest0 to indicate which elements are in use and valid. dcl statements must appear before the first executable instruction.

 dcl    t1.rg // using a 2D texture dcl    t2    // using a 4D texture (default mask) dcl_pp t3    // indicate partial precision is OK

def

vs 1.0, 1.1, 2.0

Sets the value of vertex shader floating point constants. In DirectX 8 it is up to the programmer to insert these into the shader code.

No slot

 _______________________________________________________________________________ def   Dest0,   value0,   value1,   value2,   value3 _______________________________________________________________________________

Stores four floating point values in the elements of the Dest() register. If these instructions are used in a shader, the instructions must follow the vs instruction and precede any other instructions.

Setup Four floating point values separated by commas.

Results In DirectX 8 this instruction has no effect upon the shader code to follow. You must manually insert the returned code fragment into your shader. If you use the def in a shader, then when the shader is compiled, you will have to use the fourth parameter returned from D3DXAssembleShader(). This parameter will contain an ID3DXBuffer interface, which will contain a compiled shader code fragment. You will have to manually insert this fragment into your shader declaration.

In DirectX 9, this instruction causes the register to immediately assume the values specified. Previous values are restored when the shader exits.

 def     r0, 0.0f, 0.5f, 0.25f, -1.0f def     r1, 1.0f, 2.0f, 5.0f, 10.0f

defi

vs 2.0

Sets the value of vertex shader integer constants.

No slot

 _______________________________________________________________________________ defi IntDest0, value0, value1, value2, value3 _______________________________________________________________________________

Stores four integer values in the elements of IntDest() register for use in this shader.

Setup Four integer values separated by commas.

Results Locally sets these values into the register. A local call takes precedence over an external SetVertexShaderConstantI() call to set a shader constant. The previous values of the register are restored upon exit from the shader.

 defi i0, 0, 2, 4, 8 defi i1, -2, -1, 1, 2

defb

vs 2.0

Sets the value of vertex shader boolean constants.

No slot

 _______________________________________________________________________________ defb BoolDest0, value0, value1, value2, value3 _______________________________________________________________________________

Stores four boolean values in the elements of BoolDest0 register for use in this shader. Zero indicates false. Nonzero indicates true.

Setup Four booleans separated by commas.

Results Locally sets these values into the register. A local call takes precedence over an external SetVertexShaderConstantB() call to set a shader constant. The previous values of the register are restored upon exit from the shader.

 defb b0, 0, 1, 0, 2 // false, true, false, true

dp3

vs 1.0, 1.1, 2.0

Three-component dot product (dot product 3) is computed and the result replicated in all specified channels of the destination register.

One slot

 _______________________________________________________________________________ dp3 Dest0, Source0, Source1 _______________________________________________________________________________

Computes the dot product of the Source0 and Source1 registers, and places the result in the Dest0 register. Only the x, y, and z values are used to compute the dot product; the w component is ignored.

Setup Two source registers, Source0 and Source1.

Results Unless otherwise masked, each element of Dest0 is filled with the dot product of the first three elements of registers Source0 and Source1.

   dp3     r0 , v3, c2 // fill r0 with dp3   dp3     r1.x, v3, c2 // just fill r1.x

   SetSourceRegisters();   // Simulate the dp3 instruction   TempReg.x = TempReg.y = TempReg.z = TempReg.w =       Source0.x * Source1.x +       Source0.y * Source1.y +       Source0.z * Source1.z;       // note w component ignored   WriteDestinationRegisters();

dp4

vs 1.0, 1.1, 2.0

Four-component dot product (dot product 4) is computed and the result stored in all specified channels of the destination register.

One slot

 _______________________________________________________________________________ dp4 Dest0, Source0, Source1 _______________________________________________________________________________

Computes the dot product of the Source0 and Source1 registers, and places the result in the Dest0 register. If no mask is specified on the destination, then the entire register is filled with the dot product.

Setup Two source registers, Source0 and Source1.

Results Unless otherwise masked, each element of Dest0 is filled with the dot product of the four elements of registers Source0 and Source1.

   dp4     r0,    v3,  c2   dp4     r1.x,  v3,  c2 // just fill r1.x

   SetSourceRegisters();   // Simulate the dp4 instruction   TempReg.x = TempReg.y = TempReg.z = TempReg.w =      Source0.x * Source1.x +      Source0.y * Source1.y +      Source0.z * Source1.z +      Source0.w * Source1.w;   WriteDestinationRegisters();

dst

vs 1.0, 1.1

Computes a distance vector in the format typically used for attenuated lighting calculations.

One slot

 _______________________________________________________________________________ dst   Dest0. Source0, Source1 _______________________________________________________________________________

Creates a distance vector from a set of distance squared and reciprocal distance values, and puts them in a format that can be used for attenuated lighting calculations.

Setup Two source registers are required to be set up. Source0 should be set up as [n/a, d², d², n/a]. Source1 should be set up as [n/a, 1/d, n/a, 1/d]. Elements noted as n/a are not used, and their values are ignored.

Results Dest0 will be filled with elements that correspond to [1, d, d², 1/d]. Dest0.y is computed from the product of Source0.y and Source1.y.

 dst     r2, r0, r1

   SetSourceRegisters();   // Simulate the dst instruction   TempReg.x = 1;   TempReg.y = Source0.y * Source1.y;   TempReg.z = Source0.z;   TempReg.w = Source1.w;   WriteDestinationRegisters();

else

vs 2.0

Provided an alternate path of execution for an if- else-endif block.

One slot

 _______________________________________________________________________________ else _______________________________________________________________________________

Must be inside of an if-endif block. If the boolean argument of the if statement is false, then the execution will skip to the else instruction and continue to the terminating endif statement. If the boolean was true then execution will skip over the code enclosed by the else-endif block. There can be only one else statement in an if-endif block.

Setup The else statement must be between an if and endif statement.

Results If the argument provided to the if statement was false, then the code inside the else-endif block will be executed.

 else

endif

vs 2.0

The termination point for an if-endif or ifc-endif block.

Zero slots

 _______________________________________________________________________________ endif _______________________________________________________________________________

When used with the if or ifc instruction, creates a block of instruction over which execution can be specified a number of times.

Setup You must have an if or ifc instruction in your shader prior to this instruction.

Results Execution is controlled by the if or ifc instruction that proceeds this instruction. When the argument of that statement is false then execution will jump to the statement following the endif.

 if b1   // if b1 != 0, this section gets executed else // optional else statement   // if b1 = 0, this section gets executed endif

endloop

vs 2.0

The termination point for a loop-endloop block.

One slot

 _______________________________________________________________________________ endloop _______________________________________________________________________________

When used with the loop instruction, creates a block of instructions over which execution can be specified a variable number of times.

Setup You must have a loop instruction in your shader prior to this instruction.

Results When the loop reached the endloop instruction, the loop counter (specified in the loop instruction) is incremented by the increment value (also specified in the loop instruction). See the loop instruction to see the pseudocode of a loop-endloop block.

 endloop

   // simulate the endloop instruction   // assume that LoopCounter, LoopStep, LoopInterator   // were defined in the loop instruction and   // StartLoopOffset is the instruction following   // the loop instruction   LoopCounter += LoopStep;   --LoopInterator;   if ( LoopIterater > 0 )       goto StartLoopOffset   // fall through

endrep

vs 2.0

The termination point for a rep-endrep block.

Zero slots

 _______________________________________________________________________________ endrep _______________________________________________________________________________

When used with the rep instruction, creates a block of instruction over which execution can be specified a number of times.

Setup You must have a rep instruction in your shader prior to this instruction.

Results Execution is controlled by the rep instruction that precedes this instruction. When the iteration count of that statement is zero then execution will jump to the statement following the endrep. See the rep instruction for simulation code.

 defi i0, 20, 0, 0, 0 rep    i0 // i0.x is used = 20    // this section gets executed 20 times endrep

exp(macro)

vs 1.0, 1.1, 2.0

This macro computes power of two to at least 20 bits of precision. By default, only the source register's w element is used. The results are replicated in the entire destination register. Note that the expp instruction sets the destination's w element to 1.

At least 12 instruction slots

 _______________________________________________________________________________ exp    Dest0,    Source0 _______________________________________________________________________________

Calculates for 2^Source0.w, and writes the result in Dest0. Unless otherwise specified, Source0.w is the input value, and all elements of Dest0 are written with the exponented value. This is somewhat different from the expp instruction, which always sets Dest0.w to 1.A replicate swizzle is required on the source register.

 exp     r0,   cl.w     // fill all of r0 with exp2 (c1.w) exp     r0.x, c1.y     // store exp2 (c1.y) in r0.x

   SetSourceRegisters();   // Simulate the exp macro   TempReg.x =      TempReg.y =      TempReg.z =      TempReg.w = ::pow (2, Source0.w);   WriteDestinationRegisters();

expp

vs 1.0, 1.1, 2.0

Computes partial precision power of two. For DirectX 8 the results are broken into a partial precision part and a higher precision integer and fractional parts. This allows you to use the lower precision single element or a more complicated integer/fraction calculation when you need higher precision. The destination's w element is set to 1. Only the integer part of the source register's w element is used. If Source0.w < 0 then the results are undefined.

For DirectX 9, the partial precision result fills the destination register.

Note

Don't confuse this with the exp macro!

One slot

 _______________________________________________________________________________ expp   Dest0, Source0 _______________________________________________________________________________

The DirectX 8 version computes low- and higher precision values for 2^Source0.w, where Dest0.z contains the low-precision single-element approximation, Dest0.x and Dest0.y contain the integer and fractional parts, and Dest0.w is set to 1.

You have a choice in which part of the results to use. The low-precision part will contain the exponent of the input value to 10 bits of precision. The two-part higher precision part will contain the exponent of the integer part of the input value, and the fractional part of the input value, which you will have to provide a function to compute the value of 2ⁿ for 0 <=n <= 1 to your desired precision, and then add that to the integer's exponent value.

The DirectX 9 version just computes the low precision part.

Setup Stores the value you want the exponent of in Source0.w. The value should be positive. The other register elements are ignored. A replicate swizzle is required on the Source Register.

Results Dest0.z will contain a low-precision exponential value. Dest0.x will contain the exponential of the integer part of the input. Dest0.y will contain the fractional part of the input, not the exponential of the fractional part. You have to do the conversion yourself. Dest0.w is set to 1.0.

 expp        r0,  r1.w

 // DirectX 8 version SetSourceRegisters(); // Simulate the expp instruction float wWhole = Source0.w; // take all float wInt = (int)Source0.w; // take integer part // compute the higher precision parts TempReg.x = pow(2,wInt); TempReg.y = Source0.w - wInt; // fractional part of w // calculate the 2^(Source0.w) then chop // to 10 bits precision TempReg.z = pow(2,wWhole) & 0xffffff00; // set w to 1 TempReg.w = 1; WriteDestinationRegisters();

 // DirectX 9 version SetSourceRegisters(); // Simulate the expp instruction float wWhole = Source0.w; // take all TempReg.x = TempReg.y =     TempReg.z = TempReg.w =     pow(2,wWhole) & 0xffffff00; WriteDestinationRegisters();

frs(macro)

vs 1.0, 1.1, 2.0

This macro removes the integer part of the input register's elements, and places the fractional remainder into the destination register's elements. The sign of the results is always positive.

Three instruction slots

 _______________________________________________________________________________ frc Dest0, Source0 _______________________________________________________________________________

Takes the fractional parts of Source0's elements and places them in Dest0's elements. The sign of the input arguments is ignored. Version 1.x requires an. xy write mask.

 frc    r0.xy, r1 // use r1.xy and store fractions in r0.xy // use r1.x and store fraction in r0.y, r0.x (and z & w) // remain unchanged frc    r0.y , r1.x // this has no effect on the results, since the // sign is ignored frc    r0.y , -r1.x frc    r0, r1 // Error! No write mask.

 SetSourceRegisters(); // Simulate the frc macro TempReg.x = abs( Source0.x ); TempReg.y = abs( Source0.y ); TempReg.z = abs( Source0.z ); TempReg.w = abs( Source0.w ); TempReg.x = TempReg.x - (int)TempReg.x; TempReg.y = TempReg.y - (int)TempReg.y; TempReg.z = TempReg.z - (int)TempReg.z; TempReg.w = TempReg.w - (int)TempReg.w; WriteDestinationRegisters ();

if	vs 2.0

The start of an if-else-endif block. Conditionally execute a block of code.

One slot

 _______________________________________________________________________________ if  BoolReg0 _______________________________________________________________________________

The argument must be a boolean constant register. There must be a terminating endif that follows the if instruction. The else instruction is optional and must be between the if and endif statements. If the boolean argument is true, then execution will continue immediately after the if statement, until either the else or endif statement are reached.

if blocks can be placed inside an if-endif or a loop-endloop block, but they must be entirely inside them.

Setup The argument must be a boolean register. You must have an endif instruction in your shader following to this instruction.

Results Execution is controlled by the if instruction. When the boolean of that statement is false then execution will jump to the statement following the if, which must be either an else or an endif statement.

 if b1   // if b1 != 0, this section gets executed else // optional else statement   // if b1 = 0, this section gets executed endif

label

vs 2.0

Defines a label for use with a call or callnz instruction.

Zero slots

 _______________________________________________________________________________ label <n> _______________________________________________________________________________

The label instruction marks the next instruction as having the specified label, thus making it a target for a subroutine call. The argument <n> must be integer label in the range [0,16]—that is, there can be a total of sixteen labels.

Setup The argument must be an integer.

Results When a call or callnz instruction calls the integer label, execution immediately (and conditionally for the callnz instruction) goes to the instruction following the label statement. Execution will return when a ret instruction is encountered

 // VS 2.0 call 12 // somewhere in shader call subroutine label 12 // label 12 // the subroutine instructions go here ret      // execution returns after the call

lit

vs 1.0, 1.1, 2.0

Computes the traditional diffuse and specular lighting coefficients when passed on the resulting dot products from n • 1 and n • h and a power coefficient.

One slot

 _______________________________________________________________________________ lit    Dest0, Source0 _______________________________________________________________________________

You'll need to calculate normalized n • 1 and n • h dot products, and specify a specular power value prior to using this instruction. The results will be the traditional diffuse component in Dest0.y,

and the traditional specular component (i.e., Blinn's equation) in Dest0.z,

Note that there are no k parameters in the equations. If you need them, you'll have to do the multiplication in your shader.

Setup Source0.x should contain the normalized dot product between the normal and the direction from the vertex to the light. Source0.y should contain the normalized dot product between the normal and the half-angle vector. Source0.w should contain the power value in the range −128 to +128. Source0.z is ignored.

Results Dest0.x and Dest0.w are set to 1. If Source0.x (n • l) is positive, it's stored in Dest0.y; else Dest0.y is set to 0. If both Source0.x and Source0.y (n • h) are positive, then Dest0.z is set to Source0.y raised to the Source0.w power; else it is set to zero. The power value (Source0.w) is clamped to the range [−128,128].

Note

Early versions of the SDK documentation incorrectly stated that negative exponential values would cause undefined results.

 lit     r0     r1

   SetSourceRegisters();   // Simulate the lit instruction   // these are constants   TempReg.x = TempReg.w = 1;   // if n dot l is positive...   if ( Source0.x > 0 )       {       Dest0.y = Source0.x;       // if n dot h is positive       if ( Source0.y > 0 )          {          // clamp the power value to an 8.8          // fixed point representation of the          // maximum allowable value          const float kPowerMax = 127.9961f;          float ClampedPower = Source0.w;          if      (ClampedPower < -kPowerMax )             ClampedPower = −kPowerMax;          else if (ClampedPower > kPowerMax )             ClampedPower = kPowerMax;          // actual value in shader math is only          // good to seven fractional bits of          // precision          Dest0.z = pow( Source0.y, ClampedPower );          }       else          {          Dest0.z = 0; // if n dot h was negative/zero          }       }   else       {       Dest0.y = 0; // if n dot 1 was negative/zero       }   WriteDestinationRegisters();

log(macro)

vs 1.0, 1.1, 2.0

This macro computes log₂ of the input argument in at least 20-bit precision. The absolute value the source register's w element is used. Unlike the logp instruction, the destination's w element is not set to 1.

Note

Don't confuse this with the logp instruction.

At least 12 instruction slots

 _______________________________________________________________________________ log   Dest0, Source0 _______________________________________________________________________________

Computes log₂ of the absolute value of the Source0.w element (unless otherwise specified) and places the result into all elements of Dest0. If the argument is equal to zero, all result registers are set to minus infinity. This is somewhat different from the logp instruction, which always sets Dest0.w to 1.

 log     r0,     r1

   SetSourceRegisters();   // Simulate the log macro   if ( 0 == Source0.w )      {      TempReg.x = TempReg.y = TempReg.y = TempReg.w =         MINUS_INFINITY;      }   else      {      // note we use absolute value      TempReg.x = TempReg.y = TempReg.y = TempReg.w =         log(abs(Source0.w))/log(2);      }   WriteDestinationRegisters();

logp

vs 1.0, 1.1, 2.0

Computes partial precision log₂. In DirectX 8 the results are broken into a single 10-bit precision part and a higher precision dual-element part. This allows you to use the lower precision single element or a more complicated integer/ fractional calculation if you need higher precision.

For DirectX9 the partial result fills the entire destination register.

One slot

 _______________________________________________________________________________ logp  Dest0, Source0 _______________________________________________________________________________

The DirectX 8 version computes low- and higher precision values for log2^Source0.w. The destination's w element is set to 1. Only the source register's w element is used. If Source0.w is negative, then its absolute value is used. If Source0.w is zero, then the results are negative infinity in Dest0.x and Dest0.z, and 1.0 in Dest0.y.

You have a choice in which part of the results to use. The low-precision part will contain the log₂ of the input value to 10 bits of precision.

The two-part higher precision part represents the exponent and mantissa. This allows you to use the lower precision single element or a more complicated exponent/mantissa calculation when you need higher precision.

To use the high-precision results, you'll need to provide a function that computes log₂ in the range [1,2] with your desired precision. You'd then add this result to the value returned in Dest0.x to get the log₂ of your input value.

The DirectX 9 version just computes the low-precision part.

Note

Don't confuse this with the log macro!

Setup For DirectX 8, store the value you want the log₂ of in Source0.w. The value should be positive. The other register elements are ignored. For DirectX 9, indicate the element by using a (required) replicate swizzle.

Results DirectX 8: Dest0.z contains the low precision (10-bit) single-element approximation. Dest0.x contains the most significant part of the dual-element result. This value can be negative. Dest0.y contains the mantissa of the dual-element result in the form of an exponented value in the range [1,2). You have to do the conversion yourself.

DirectX 9: All destination elements will have the low-precision result.

 logp     r0,     r1

   // DirectX 8 version:   SetSourceRegisters();   // Simulate the logp instruction   float v = abs(Source0.w); // only positive values   TempReg.y = TempReg.w = 1.0f;   if ( 0 == v )      {      TempReg.x = TempReg.z = MINUS_INFINITY;      }   else      {      float logValue = (float)(log(v)/log(2));      // store exponent      Dest0.x = (int)::floor( logValue );      // store mantissa, lop off anything more than      // 8 bits of significance      int p = (*(unsigned long*)&v          & 0x7FFFFF | 0x3F800000;      Dest0.y = *(float*)&p;      // store low-precision part to 10 bits      unsigned long temp = *(unsigned long*)&logValue;      Dest0.z = *(float*)& temp & 0xFFFFFF00;      }   WriteDestinationRegisters();

   // DirectX 9 version   SetSourceRegisters();   // Simulate the logp instruction   float v = abs(Source0.w); // only positive values   float logValue;   if ( 0 == v )      {      logValue = MINUS_INFINITY;      }   else      {      logValue = (float)(log(v)/log(2));      logValue = (int)::floor( logValue );      // store low-precision part to 10-bits      unsigned long temp = *(unsigned long*)&logValue;      logValue = *(float*)& temp & 0xFFFFFF00;      }   TempReg.x = TempReg.y =      TempReg.z = TempReg.w =      LogValue;   WriteDestinationRegisters();

loop

vs 2.0

The starting point for a loop-endloop block. Iterate over a block of code a number of times.

One slot

 _______________________________________________________________________________ loop  IntSource0 _______________________________________________________________________________

When used with the endloop instruction, creates a block of instruction over which execution can be specified a variable number of times. Each time through the loop the loop counter is incremented by the specified amount. Compare this to the rep instruction, which does not increment the loop counter independently.

Setup The argument must be integer register. IntSource0.x holds the number of times the loop is to execute. The loop counter register gets incremented at the endloop.

The counter can be used to index into the constant register array. IntSource0.y is the initial value for the loop counter register. IntSource0.z specifies the step for the loop counter register. Execution will go to the statement following the matching end loop instruction when IntSource0.x <= 0. The loop counter register, aL, is available inside the loop.

You must not nest loops, nor jump either out of or into a loop-endloop block. The endloop instruction must precede the loop instruction.

Results The instructions in the loop-endloop block are executed IntSource0.x times with the loop counter getting incremented each time through.

 loop i0 // assume i0.x, i0.y, and i0.z are set up // aL is available and incremented each time // through the loop endloop

   // simulate the loop instruction   // -- loop statement begin   aL = IntReg0.y; // loop counter   StartLoopOffset: // <- label   if ( IntSource0.x <= 0 )   goto EndLoopOffset;   // -- loop statement end   // section of code between the loop/endloop   // -- endloop statement begin   aL += IntSource0.z; // increment loop counter   IntSource0.x--;     // decrement   goto StartLoopOffset;   EndLoopOffset:      // <- label   // -- endloop statement end

lrp(macro)

vs 2.0

Linear interpolation between two registers (a.k.a. "lerp") using a fraction specified in a third register. This is done on an element-by-element basis.

Two slots

 _______________________________________________________________________________ lrp Dest0, Source0, Source1, Source2 _______________________________________________________________________________

This macro instruction interpolates between two floating point numbers, Source0 and Source1, based upon a third, Source2. When Source2 is zero, Source0 is placed in the destination. When Source2 is one, Source1 is placed in the destination. Values in the [0,1] range interpolate between Source1 and Source2. If the value is outside the range [0,1] the result is indeterminate. Dest0 must be a temporary register and cannot be the same as Source0 or Source2.

The macro expands to the following code:

 _______________________________________________________________________________ add Dest0, Source1, -Source2 mad Dest0, Dest0, Source0, Source2 _______________________________________________________________________________

Setup Dest0 must be a temporary register, and Source2 should be in the range [0,1].

Results The value between Source0 and Source1 is interpolated from the value in Source2. The result is written in Dest0.

 lrp r0, r1, r2, c14 lrp r0, r1, r2, r14.x // lrp using a single value

   SetSourceRegisters();   // simulate lrp instruction   TempReg.x = Source3.x *(Source1.x - Source2.x) + Source2.x;   TempReg.y = Source3.y *(Source1.y - Source2.y) + Source2.y;   TempReg.z = Source3.z *(Source1.z - Source2.z) + Source2.z;   TempReg.w = Source3.w *(Source1.w - Source2.w) + Source2.w;   WriteDestinationRegisters();

m3×2(macro)

vs 1.0, 1.1, 2.0

Matrix 3 by 2. Performs a matrix multiply on the input vector and input matrix and stores the result. This macro is typically used for 2D transformation calculations.

Two instruction slots

 _______________________________________________________________________________ m3x2  Dest0, Source0, Source1 _______________________________________________________________________________

Does a matrix multiply assuming that Source0 is the input vector and the matrix starts at element Source1 and there are the correct number of registers available after Source1. Only those elements actually used in the calculations are read; only those that are calculated are written to the destination registers. The w elements in the source matrix and vector are unused, and only the x and y elements of the destination are written.

This macro

 _______________________________________________________________________________ m3x2 Dest0, Source0, Source1 _______________________________________________________________________________

expands to the following:

 _______________________________________________________________________________ dp3      Dest0.x, Source0, Source1 dp3      Dest0.y, Source0, Source2 _______________________________________________________________________________

Note

You can use the swizzle or negate modifier if you expand this macro yourself.

Warning

Make sure that your Dest0 and Source0 registers aren't the same registers. It will compile and run, but your results will be incorrect.

Note

You are not allowed to use the swizzle or negate modifier on Source1.

 m3x2 r0, v0. c6 ; // will use c7 as well m3x2 r0, v0, c6.yzxw // Error! Can't use swizzle

   SetSourceRegisters();   // Simulate the m3x2 macro   TempReg.x = Source0.x * Source1.x +               Source0.y * Source1.y +               Source0.z * Source1.z;   TempReg.y = Source0.x * Source2.x +               Source0.y * Source2.y +               Source0.z * Source2.z;   WriteDestinationRegisters();

m3x3(macro)

vs 1.0, 1.1, 2.0

Matrix 3 by 3. Performs a matrix multiply on the input vector and input matrix and stores the result. This macro typically is used for normal transformations during lighting calculations.

Three instruction slots

 _______________________________________________________________________________ m3x3   Dest0, Source0, Source1 _______________________________________________________________________________

This macro

 _______________________________________________________________________________ m3x3   Dest0, Source0, Source1 _______________________________________________________________________________

expands to the following:

 _______________________________________________________________________________ dp3          Dest0.x,  Source0,  Source1 dp3          Dest0.y,  Source0,  Source2 dp3          Dest0.z,  Source0,  Source3 _______________________________________________________________________________

Note

You can use the swizzle or negate modifier if you expand this macro yourself.

Warning

Make sure that your Dest0 and Source0 registers are different. It will compile, but your results will be incorrect.

Note

You are not allowed to use the swizzle or negate modifier on Source1.

 m3x3 r0, v0, c6 ; // will use c7 & c8 as well m3x3 r0, v0, c6.yzxw // Error! Can't use swizzle

   SetSourceRegisters();   // Simulate the m3x3 macro   TempReg.x = Source0.x * Source1.x +               Source0.y * Source1.y +               Source0.z * Source1.z;   TempReg.y = Source0.x * Source2.x +               Source0.y * Source2.y +               Source0.z * Source2.z;   TempReg.z = Source0.x * Source3.x +               Source0.y * Source3.y +               Source0.z * Source3.z;   WriteDestinationRegisters();

m3x3(macro)

vs 1.0, 1.1, 2.0

Matrix 3 by 4. Performs a matrix multiply on the input vector and input matrix and stores the result.

Four instruction slots

 _______________________________________________________________________________ m3x4  Dest0. Source0. Sourcel _______________________________________________________________________________

Does a matrix multiply assuming that Source0 is the input vector and the matrix starts at element Source 1 and there are the correct number of registers available after Source1. Only those elements actually used in the calculations are read; only those that are calculated are written to the destination registers. The w elements in the source matrix and rector are unused.

This macro

 _______________________________________________________________________________ m3x4   Dest0, Source0, Sourcel _______________________________________________________________________________

expands to the following:

 _______________________________________________________________________________ dp3             Dest0.x, Source0, Source1 dp3             Dest0.y, Source0, Source2 dp3             Dest0.y, Source0, Source3 dp3             Dest0.y, Source0, Source4 _______________________________________________________________________________

Note

You can use the swizzle or negate modifier if you expand this macro yourself.

Warning

Make sure that your Dest0 and Source0 registers are different. It will compile, but your results will be incorrect.

Note

You are not allowed to use the swizzle or negate modifier on Source1.

 m3x4   r0,  v0, c6 ; // will use c7, c8, & c9 as well m3x4   r0,  v0, c6.yzxw // Error! Can't use swizzle

 SetSourceRegisters(); // Simulate the m3x4 macro TempReg.x = Source0.x * Sourcel.x +             Source0.y * Sourcel.y +             Source0.z * Sourcel.z; TempReg.y = Source0.x * Source2.x +             Source0.y * Source2.y +             Source0.z * Source2.z; TempReg.z = Source0.x * Source3.x +             Source0.y * Source3.y +             Source0.z * Source3.z; TempReg.w = Source0.x * Source4.x +             Source0.y * Source4.y +             Source0.z * Source4.z; WriteDestinationRegisters();

m4x3(macro)

vs 1.0, 1.1, 2.0

Matrix 4 by 3. Performs a matrix multiply on the input vecto and input matrix and stores the result.

Three instruction slots

 _______________________________________________________________________________ m4x3  Dest0. Source0, Sourcel _______________________________________________________________________________

Does a matrix multiply assuming that Source0 is the input vector and the matrix starts at element Source1 and there are the correct number of registers available after Source1. All source register elements are used, but Dest0.w will be unmodified.

This macro

 _______________________________________________________________________________ m4x3   Dest0, Source0, Sourcel _______________________________________________________________________________

expands to the following:

 _______________________________________________________________________________ dp4             Dest0.x, Source0, Source1 dp4             Dest0.y, Source0, Source2 dp4             Dest0.z, Source0, Source3 _______________________________________________________________________________

Note

You can use the swizzle or negate modifier if you expand this macro yourself.

Warning

Make sure that your Dest0 and Source0 registers are different. It will compile, but your results will be incorrect.

Note

You are not allowed to use the swizzle or negate modifier on Source1.

 m4x3  r0,  v0, c6 ; // will use c7 & c8 as well m4x3  r0,  v0, c6.yzxw // Error! Can't use swizzle

   SetSourceRegisters();   // Simulate the m4x3 macro   TempReg.x = Source0. x * Sourcel.x +               Source0. y * Sourcel.y +               Source0. z * Sourcel.z +               Source0. w * Sourcel.w;   TempReg.y = Source0. x * Source2.x +               Source0. y * Source2.y +               Source0. z * Source2.z +               Source0. w * Source2.w;   TempReg.z = Source0. x * Source3.x +               Source0. y * Source3.y +               Source0. z * Source3.z +               Source0. w * Source3.w;   WriteDestinationRegisters();

m4x4(macro)

vs 1.0, 1.1, 2.0

Matrix 4 by 4. Performs a matrix multiply on the input vector and input matrix and stores the result.

Four instruction slots

 _______________________________________________________________________________ m4x4  Dest0, Source0, Source1 _______________________________________________________________________________

This macro

 _______________________________________________________________________________ m4x4  Dest0, Source0, Source1 _______________________________________________________________________________

expands to the following:

 _______________________________________________________________________________ dp4                 Dest0.x, Source0, Source1 dp4                 Dest0.y, Source0, Source2 dp4                 Dest0.z, Source0, Source3 dp4                 Dest0.w, Source0, Source4 _______________________________________________________________________________

Note

You can use the swizzle or negate modifier if you expand this macro yourself.

Warning

Make sure that your Dest0 and Source0 registers are different. It will compile, but your results will be incorrect.

Note

You are not allowed to use the swizzle or negate modifier on Source1.

 m4x4   r0, v0, c6 ; // will use c7, c8, & c9 as well m4x4   r0, v0, c6.yzxw // Error! Can't use swizzle

   SetSourceRegisters();   // Simulate the m4x4 macro   TempReg.x = Source0.x * Sourcel.x +               Source0.y * Sourcel.y +               Source0.z * Sourcel.z +               Source0.w * Sourcel.w;   TempReg.y = Source0.x * Source2.x +               Source0.y * Source2.y +               Source0.z * Source2.z +               Source0.w * Source2.w;   TempReg.z = Source0.x * Source3.x +               Source0.y * Source3.y +               Source0.z * Source3.z +               Source0.w * Source3.w;   TempReg.w = Source0.x * Source4.x +               Source0.y * Source4.y +               Source0.z * Source4.z +               Source0.w * Source4.w;   WriteDestinationRegisters();

mad

vs 1.0, 1.1, 2.0

Multiply and add. Multiplies two registers, then adds a third to the result, and then stores the result.

One slot

 _______________________________________________________________________________ mad   Dest0, Source0, Source1, Source2 _______________________________________________________________________________

Multiplies Source0 by Source1, then adds Source2 to the result. The final result is stored in Dest0.

Setup Source0 and Source1 are the registers to be multiplied. Source2 is the register to be added to the result of the multiplication of Source0 and Source1.

Results Dest0 contains (Source0* Source1) + Source2.

 mad     r0, r0, r1, r2

   Set SourceRegisters();   //Simulate the mad instruction   TempReg.x = Source0.x * Source0.x + Source1.x;   TempReg.y = Source0.y * Source0.y + Source1.y;   TempReg.z = Source0.z * Source0.z + Source1.z;   TempReg.w = Source0.w * Source0.w + Source1.w;   WriteDestinationRegisters();

max

vs 1.0, 1.1, 2.0

Stores the maximum value from comparing two source registers element by element into the destination register.

One slot

 _______________________________________________________________________________ max   Dest0, Source0, Source1 _______________________________________________________________________________

Finds the maximum between elements of Source0 and Source1, then stores the results in elements Dest0. The resulting register may not be equal to either input register since the comparison is per element.

Setup Source0 and Source1 are the registers to be compared.

Result Dest0 contains the maximum elements of the two input registers done on an element-by-element basis.

 max     r0, r1, r2

   SetSourceRegisters();   // Simulate the max instruction   TempReg.x = Source0.x > Sourcel.x ?       Source0.x : Source0.x;   TempReg.y = Source0x > Source0.y ?       Source0.y : Source0.y;   TempReg.z = Source0.z > Source0.z ?       Source0.z : Source.z;   TempReg.w = Source0.w > Source0.w ?       Source0.w : Source0.w;   WriteDestinationRegisters();

min

vs 1.0, 1.1, 2.0

Stores the minimum value from comparing two source registers element by element into the destination register.

One slot

 _______________________________________________________________________________ min   Dst0, Source0. Source1 _______________________________________________________________________________

Finds the minimum between elements of Source0 Source1, then stores the results in elements Dest0. The resulting register may not be equal to either input register since the comparison is per element.

Setup Source0 and Source1 are the registers to be compared.

Results Dest0 contains the minimum of the two input registers done on an element-by-element basis.

 min     r0, r1, r2

   SetSourceRegisters();   // Simulate the min instruction   TempReg.x = Source0.x < Source0.x ?       Source0.x : Source0.x;   TempReg.y = Source0.y < Source0.y ?       Source0.y : Source0.y;   TempReg.z = Source0.z < Source0.z ?       Source0.z : Source0.z;   TempReg.w = Source0.w < Source0.w ?       Source0.w : Source0.w;   WriteDestinationRegisters();

mov

vs 1.0, 1.1, 2.0

Stores the source registers into the destination register. Useful for moving from a temporary register into an output register or for swizzling. The source and destination registers can be the same.

For DirectX 8, the mov instruction is the only instruction that can use the address register as a destination and only in vertex shaders version 1.1 or later. If the address register is the destination, then the value is rounded to the integer value that is less than or equal to the initial value. In VS 2.0 you must use the mova instruction to set the address register.

One slot

 _______________________________________________________________________________ mov   Dest0, Source0 _______________________________________________________________________________

Move Source0 into Dest0. A special case is in DirectX8 when the Dest0 is an address register. In this case, the value stored is the closest integer value that is less than the initial value. This means that it rounds the number toward negative infinity. Thus 1.5 would get stored as 1, while −1.5 would get stored as −2. In both cases, the value stored is the integer value that's closest and less than the initial value. In DirectX9, only floating point data can be moved.

Setup Source0 is the register to be copied.

Results Dest0 contains a copy of Source0, unless it's the address register in a DirectX8 shader, in which case it is the nearest integer value that's less than or equal to the initial value in the register. If the destination is the address register, then, unless otherwise specified, only the Source0.x register is used.

 mov  r0   , r1 mov  a0.x , c1w. // initializing address register, DirectX 8

   SetSourceRegisters();   // Simulate the mov instruction   // it's the address register in a DirectX 8 shader   if ( Source0 == a0 )      {      // use only integer part      TempReg.x = (int) :: floor ( Source0.x );      }   else // DirectX 9 shader or Source0 is a float      {      TempReg.x = SourceO.x;      TempReg.y = Source0.y;      TempReg.z = Source0.z;      TempReg.w = Source0.W;      } WriteDestinationRegisters();

mova

vs 2.0

Mova date from a floating point register into the address register.

One slot

 _______________________________________________________________________________ mova  Dest0, Source0 _______________________________________________________________________________

This instruction rounds Source0 to the nearest integer and places the result in Dest0. Dest0 must be the address register. Rounding is to nearest even, though this is not exactly specified and applications should not rely on this behavior. That is, for values equidistant between two integers, some implementations may round up, down, or randomly pick a direction. The _sat modifier is not supported.

Setup Source0 is the floating point register to be rounded, then placed in the address register.

Results The rounded value from Source0 is placed in the address register.

 mova a0.x, cl.w // move one element mova a0,   cl   // move all

   SetSourceRegisters();   // use only integer part   // Note: RoundToNearestInteger () is   // Implementation dependent   a0.x = RoundToNearestInteger( Source0.x );   a0.y = RoundToNearestInteger( Source0.y );   a0.z = RoundToNearestInteger( Source0.z );   a0.w = RoundToNearestInteger( Source0.w );   WriteDestinationRegisters();

mul

vs 1.0, 1.1, 2.0

Multiplies the two source registers element by element and stores them in the destination register.

One slot

 _______________________________________________________________________________ mul  Dest0, SourceO, Source1 _______________________________________________________________________________

Multiplies Source0 by Source1 and stores the result in Dest0.

Setup Source0 and Source1 are the two registers to be multiplied.

Results Dest0 contains the result of the multiplication of Source0 and Source 1.

 mul    r0, r1, r2

   SetSourceRegisters ();   // Simulate the mul instruction   TempReg.x = Source0.x * Source0.x;   TempReg.y = Source0.y * Source0.y;   TempReg.z = Source0.z * Source0.z;   TempReg.w = Source0.w * Source0.w;   WriteDestinationRegisters ();

nop

vs 1.0–2.0

Defines the null instruction (No-Operation).

One slot, possibly optimized out.

 _______________________________________________________________________________ nop _______________________________________________________________________________

You can use it to create a shader that does nothing but take up slots and/or time as it executes to see how a shader of that length would affect your rendering. It's possible that a driver might optimize away this instruction.

Setup None.

Results Nothing.

nop

nrm(macro)

vs 2.0

This macro will normalize all elements of a register.

Three slots

 _______________________________________________________________________________ nrm  Dest0, Source0 _______________________________________________________________________________

This macro will take all elements of Source0and normalize them so that the square root of the sum of squares of all elements in Dest0 is one. Dest0 cannot be the same register as Source0.

This macro

 _______________________________________________________________________________ nrm Dest0, Source0 _______________________________________________________________________________

is equivalent to the following:

 _______________________________________________________________________________ dp4     Dest0.x, Source0, Source0 rsq     Dest0.x, Dest.x mul     Dest0, Source0, Dest0.x _______________________________________________________________________________

  nrm   r0, v0

pow(macro)

vs 2.0

Computes the power function for a scalar value.

Three slots

 _______________________________________________________________________________ pow   Dest0, Source0, Source1 _______________________________________________________________________________

Only the W element of the source registers are used. Only the absolute value of the Source0 is used. Dest0 is filled with abs(Source0.x) raised to the Source1.x power. The result is replicated in all elements of the destination.

This macro

 _______________________________________________________________________________ pow   Dest0, Source0, Source1 _______________________________________________________________________________

is equivalent to the following:

 _______________________________________________________________________________ log   Dest0.w, Source0 // takes absolute value mul   Dest0.w, Dest0.w, Source1.w exp   Dest0, Dest0.w _______________________________________________________________________________

  pow r0, r3, c6 // assume r3.x and c6.x are set

rcp

vs 1.0, 1.1, 2.0

Computes the reciprocal of an element of the source register and stores it in the destination register.

One slot

 _______________________________________________________________________________ rcp   Dest0, Source0 _______________________________________________________________________________

Computes the reciprocal of a single element of the source register and stores it in all elements of the destination register. Only one element of the source is used. If no element is specified, then Source0.w is used. A value of exactly 1 on input returns 1 on output (no round-off error), whereas a value of 0 on input returns positive infinity.

Setup Source0 contains the elemente of which to take the reciprocal. If unspecified, Source0.w is used.

Results Dest0 contains the reciprocal of the specified element copied in all elements.

Note

This is one of the few instructions that will take more than one clock to execute. Use it sparingly, and when you do use it, try to arrange your code so that you don't need the results immediately.

 rcp   r0, r1

   SetSourceRegisters ();   // Simulate the rcp instruction   if ( 0.0f == Source0.w ) // if 0      {      TempReg.w = PLUS_INFINITY;      }   else if ( 1.0f == Source0.w) // if 1      {      TempReg.w ==1.0f;      }   else      {      TempReg.w = 1.0f/Source0.w;      }   TempReg.x = TempReg.y = TempReg.z = TempReg.w;   WriteDestinationRegisters ();

rep

vs 2.0

Repeat. Indicates the start of a rep-endrep block.

One slot

 _______________________________________________________________________________ rep  IntSource0 _______________________________________________________________________________

IntSource0 must be an integer register. Only the .x element is used. The maximum initial value can be 255. Execution over th block will continue for IntSource0.x times, as long as the number is positive. Compare this to the loop instruction, which additionally increments over the loop counter independently.

Setup IntSource0 must be an integer register with the .x element initialized to the number of times to iterate through the block.

Results The instructions in the rep - endrep blocks are executed IntSource0.x times.

 defi i0, 10, 0, 0, 0 // i0.x is set to the count rep i0   //the instructions here will get executed i0.x times endrep

    // Simulate the rep instruction    int LoopCounter = IntReg0.x;    if (LoopCounter <= 0) goto EndLoop    // the instructions following the loop    // instruction would go here    // Simulate endloop instruction    aL += IntReg0.z;    LoopCounter--;    goto TopLoop;    EndLoop:

ret

vs 2.0

Indicates the end of a subroutine.

One slot

 _______________________________________________________________________________ ret _______________________________________________________________________________

This instruction will return to the calling instruction (a call or callnz instruction) or return from the main function.

Setup Returns to the address following the most recent call or callnz instruction, or returns from the main function.

Results The path of execution is changed to the next instruction on the instruction stack.

ret

rsq

vs 1.0, 1.1, 2.0

Computes the reciprocal square root of specified element of the source register and stores it in all elements of the destination register.

One slot

 _______________________________________________________________________________ rsq    Dest0, Source0 _______________________________________________________________________________

Computes the reciprocal square root of the specified element of the source register and stores it in all elements of the destination register. If no element is specified, then Source0.w. is used. The absolute value of the input is used. A value of exactly 1 on input returns 1 on output (no round-off), whereas a value of 0 on input returns positive infinity.

Setup Source0 contains the element of which to take the reciprocal square root. If unspecified, Source0.w is used.

Results Dest0 contains the reciprocal square root of the absolute value of the specified element copied in all elements.

Note

This is one of the few instructions that will take more than one clock to execute. Use it sparingly, and when you do use it, try to arrange your code so that you don't need the results immediately.

 rsq    r0, r1

 SetSourceRegisters (); // Simulate the rsq instruction float v = abs (Source0.w); if ( 0.0f == v) // if 0    {    TempReg.w = PLUS_INFINITY;    } else if ( 1.0f == v) // if 1    {    TempReg.w = 1.0f;    } else    {    TempReg.w = 1.0f/sqrt (v);    } TempReg.x = TempReg.y = TempReg.z = TempReg.w; WriteDestinationRegisters();

sge

vs 1.0, 1.1, 2.0

Sets Greater-Than or Equal-To. Stores 1 in the destination register if the first source register is greater than or equal to the second source register. If not it stores 0 in the destination register. Does an element-by-element comparison and assignment.

One slot

 _______________________________________________________________________________ sge  Dest0, Source0, Source1 _______________________________________________________________________________

Compares the two source registers element by element. If the first source register's element is greater than or equal to the second source register's element, the value 1 is placed in the destination register's element. If not, it stores 0 in the destination register's element. The resulting register can be a mix of 0s and 1s.

Setup Source0 and Source1 are the registers to be compared.

Results The element Dest0.n contains 1.0 if the Source0.n is greater than or equal to Source1.n; otherwise, it contains 0.0. This is done for all elements of Dest0.

 sge         r0, r1, r2

    SetSourceRegisters ();    // Simulate the sge instruction    TempReg.x = Source0.x >= Source0.x ?       1.0f : 0.0f;    TempReg.y = Source0.y >= Source 0.y ?       1.0f : 0.0f;    TempReg.z = Source0.z >= Source0.z ?       1.0f : 0.0f;    TempReg.W = Source0.w >= Source0.w ?       1.0f : 0.0f;    WrteDestinationRegisters();

sgn (macro)

vs 2.0

Computes the sign of each element in a register.

Three slots

 _______________________________________________________________________________ sng  Dest0, Source0, Source1, Source2 _______________________________________________________________________________

Computes the sign of the elements of Source0, using two temporary scratch registers. All elements of the source registers are compared. The comparison is done element by element. Source1 and Source2 should be temporary registers and should not be the same. If an element in Source0 was > 0, then the corresponding element in Dest0 will be 1. If it was < 0, then the result will be −1. If it was 0m the result will be 0.

This macro

 _______________________________________________________________________________ sgn  Dest0, Source0, Source1, Source2 _______________________________________________________________________________

is equivalent to the following:

 _______________________________________________________________________________ slt Source1, Source0, -Source0 slt Source2, -Source0, Source0 add Dest0, Source2, -Source1 _______________________________________________________________________________

Note

Source1 and Source2 will be modified after this macro!

 sgn   r3, r1, r2

sincos (macro)

vs 2.0

Computes the sine and cosine values for a scalar argument.

Eight slots

 _______________________________________________________________________________ sincos   Dest0, Source0, Source1, Source2 _______________________________________________________________________________

Estimates the sine and cosine value inside a shader with a maximum error of 0.002 through the use of a Taylor series expansion. Source0 must have a replicate swizzle to indicate which element to use. This should be a value in radians between ±π. Dest0 should be a temporary reguster. The destination must have .x, .y, or .xy as a write mask.

Setup One element of Source0 has to have the value in radians. Source1 and Source2 have to be set up with the following values to perform the expansion.

Results The resulting sine and cosine values are written in Dest0.x and Dest0.y respectively.

   // setup values   def cl, 1.0f/(7!*128),1/0f /(6!*64),              1.ff/(4!*16), 1.0f/(5!*16)   def c2, 1.0f/(3!*8), 1.0f/(2!*8), 1.0f, 0.5f   // assume value to take sin/cos of is in r0.x   sincos r0.xy, r0.x, c1, c2

slt

vs 1.0, 1.1, 2.0

Set Less-Than. Stores 1 in the destination register if the first source register is less than the second source register. If not, it sotres 0 in the destination register. Does an element-by-element comparison and assignment.

One slot

 _______________________________________________________________________________ slt  Dest0, Source0, Source1 _______________________________________________________________________________

Comares the two source register element by element. If the first source register's element is less than the second source register's element, the value 1 is placed in the destination register's element. If not, it sotres 0 in the destination register's element. The resulting reigster may consist of a mix of 0s and 1s.

Setup Source0 and Source1 are the registers to be compared.

Results The element Dest0.n contains 1.0 if the Source0.n is less than Source1.n otherwise, it contains 0.0. This is done for all elements of Dest0.

 slt      r0. r1, r2

   SetSourceRegisters();   // Simulate the slt instruction   TempReg.x = Source0.x < Source0.x ?      1.0f : 0.0f;   TempReg.y = Source0.y < Source0.y ?      1.0f : 0.0f;   TempReg.z = Source.z < Source0.z ?      1.0f : 0.0f;   TempReg.w = Source0.w < Source0.w ?      1.0f : 0.0f;   WriteDestinationRegisters();

sub

vs 1.0, 1.1

Subtracts one register from another and places the result into the destinatoin register.

One slot

 _______________________________________________________________________________ sub   Dest0, Source0, Source1 _______________________________________________________________________________

Subtracts Source1 from Source0 and places the result in the Dest0 register.

Setup Two source registers, Source0 and Source1.

Result Each element of Des0 is filled with the element-by-element subtraction of the element of Source1 from Source0.

 sub  r0, r0, c2

   SetSourceRegisters();   // simulate the sub instruction   TempReg.x = Source0.x - Source1.x;   TempReg.y = Source0.y - Source1.x;   TempReg.z = Source0.z - Source1.z;   TempReg.w = Source0.w - Source1.w;   WriteDestinationRegisters();

vs	vs 1.0, 1.1, 2.0

Defines the version of the vertex shader code you are using.

No slots

 _______________________________________________________________________________ vs.integer1.integer2 // Directx 8 vs_interger1_integ2 // Directx 9 _______________________________________________________________________________

The argu for DirectX 8 shaders, and vs_s_y for DirectX 9 shaders, where x is the main version number, and y is the minor version number. Both values the integers.

Setup Two integers that instruct the assembler on the major and minor version numbers of the shader version you want to use. This must be the first instruction in your shader.

Results Tells the assembler what features to allow in the shader instruction to follow.

   //DirectX 8   vs.1.0 // not using the address register in this one   vs.1.1 // uses address register   //DirectX 9   vs_2_0

Register Masks, Swizzles, Negates, and Indexing

By default, a register reference in a shader expands to a reference to all the elements of the register, in x, y, z, w order. This means that whenever you write R, where R is the name of some register, it auomatically gets expanded (at least conceptually) to R.xyzw. What I want to get across is that the specific element-by-element reference to the register is made without your having to do anythying. This is merely semantic convenience and efficiency.

What this means is that it costs nothing extra to reorder or even ignore specific elements of a register if it makes sense to. I'll say it again: it costs nothing to use these masks in your shaders. They are there so you can take advantage of the single-instruction multiple-data nature of the shader language to possibly merge similar computations or save on register usage. If, in fact, you don't need to have a value stored in every element of a register, then by all means use a destination mask and have the value you need written only to the destination register.

destination mask/write mask

Note the word destination.Masks can be used only to select which element of a register is to be written to. (Hence they are often referred to as write masks.) Even if an instruction usually writes to all four elements of the destination, the mask can be used to select which element(s) of the destination are writtem. If you leave an element out of a mask, it doesn't get written. The element masks must be in order; x comes before y which comes before z which comes before w.

   mov      r1, c1 // use all   mov      r1.xyzw, c1 // use all explicitly (default)   mov      r1.xw, c1 // just move c.x and c.w   mov      r1.wx, c1 // Error! - invalid order   mov      r1.wzyx, c1 // Error! - invalid order

Source swizzle

Note the word source. Swizzles can be used only to select the order and the elements of a register to use as a source. A swizzle must specify four elements though ther is no restriction on the order. A swizzle consists of four letters, xyzw. If there are not four elements specified, the last element specification is relplicated.

   mov      r1,  c1 // use all - same as below   mov      r1,  c1.xyzw // use all in order   mov      r1,  c1.wzyx // reverse the order   mov      r1,  c1.wwww // just use the w element   mov      r1,  c1.xyzy // replace w with y   mov      r1.W,  c1.zxyx // move c.x into r1.w   mov      r1.Z,  c1.xzwy // move c.w into r1.z   mov      r1, c1.x // same as c1.xxxx   mov      r1, c1.xy // same as c1.xyyy

source absolute value

Available only in vs 2.0. The absolute modifier takes the absolute value of the source register. If the _abs is used with the negate modifier, the _abs is done first.

   vs_2_0 // vertex shader 2.0 or better   mov  r5,   r5_abs // absolute value of c5 in c5   mov  r5,  -r5_abs // all values in c5 will be negative

saturation instruction modifier

Available only in vs 2.0. The saturation instruction modifier clamps the results to the [0,1] range.

   vs_2_0 // vertex shader 2.0 or better   add_sat r5, r5, r5 // absolute value of c5 in c5

source negation

Negation can be used to negate an entire source register. They can be used with swizzles.

   mov     r1, c1         // move c1 into r1   mov     r1, -c1        // move negative c1 into r1   mov     r1.w, -r1      // negate just r1's w and store it   mov     r1.w, -c1.zzzy // move -c1.y into r1.w   mov    -r1, c1         // Error! Can't negate destination

address registers

Not available in shader version 1.0. Only register a0.x can be used for version 1.1. The address register can be used as a signed offset into the constant register file, and it can be used only in the mov or mova instruction. The values in the address register when used must compute to the legal range for the constant registers (i.e., 0 to 95 for most 1.1 vertex shaders).

   vs.1.1   mov     a0.x, c5.w    // load a0.x   mov     r1, c1        // regular move, format 1   mov     r1, c[1]      // same thing alternative format   mov     r1, c[1+a0.x] // relative move