Upcoming Events
Unite 2010
11/10 - 11/12 @ Montréal, Canada

GDC China
12/5 - 12/7 @ Shanghai, China

Asia Game Show 2010
12/24 - 12/27  

GDC 2011
2/28 - 3/4 @ San Francisco, CA

More events...
Quick Stats
64 people currently visiting GDNet.
2406 articles in the reference section.

Help us fight cancer!
Join SETI Team GDNet!
Link to us Events 4 Gamers
Intel sponsors gamedev.net search:

High Level View on Vertex Shader Programming

Only one vertex shader can be active at a time. It is a good idea to write vertex shaders on a per-task basis. The overhead of switching between different vertex shaders is smaller than for example a texture change. So if an object needs a special form of transformation or lighting it will get the proper shader for this task. Let's build an abstract example:

You are shipwrecked on a foreign planet. Dressed in your regular armor, armed only with a jigsaw, you move through the candle lit cellars. A monster appears and you crouch behind one of those crates one normally find on other planets. While thinking about your destiny as a hero who saves worlds with jigsaws, you start counting the number of vertex shaders for this scene.

There is one for the monster to animate it, light it and perhaps to reflect its environment. Other vertex shaders will be used for the floor, the walls, the crate, the camera, the candlelight and your jigsaw. Perhaps the floor, the walls, the jigsaw and the crate use the same shader, but the candlelight and the camera might each use one of their own. It depends on your design and the power of the underlying graphic hardware.

You might also use vertex shaders on a per-object or per-mesh basis. If for example a *.md3 model consists of, let's say, 10 meshes, you can use 10 different vertex shaders, but that might harm your game performance.

Every vertex shader-driven program must run through the following steps:

  • Check for vertex shader support by checking the D3DCAPS8::VertexShaderVersion field
  • Declaration of the vertex shader with the D3DVSD_* macros, to map vertex buffer streams to input registers
  • Setting the vertex shader constant registers with SetVertexShaderConstant()
  • Compiling previously written vertex shader with D3DXAssembleShader*() (Alternatives: could be pre-compiled using a Shader Assembler)
  • Creating a vertex shader handle with CreateVertexShader()
  • Setting a vertex shader with SetVertexShader() for a specific object
  • Delete a vertex shader with DeleteVertexShader()

Check for Vertex Shader Support

It is important to check the installed vertex shader software or hardware implementation of the end-user hardware. If there is a lack of support for specific features, then the application can fallback to a default behavior or give the user a hint, as to what he might do to enable the required features. The following statement checks for support of vertex shader version 1.1:

if( pCaps->VertexShaderVersion < D3DVS_VERSION(1,1) )
  return E_FAIL;

The following statement checks for support of vertex shader version 1.0:

if( pCaps->VertexShaderVersion < D3DVS_VERSION(1,0) )
  return E_FAIL;

The D3DCAPS8 structure caps must be filled in the startup phase of the application with a call to GetDeviceCaps(). If you use the Common Files Framework provided with the DirectX 8.1 SDK, this is done by the framework. If your graphics hardware does not support your requested vertex shader version, you must switch to software vertex shaders by using the D3DCREATE_SOFTWARE_VERTEXPROCESSING flag in the CreateDevice() call. The previously mentioned optimized software implementations made by Intel and AMD for their respective CPU's will then process the vertex shaders.

Supported vertex shader versions are:

Version: Functionality:
0.0 DirectX 7
1.0 DirectX 8 without address register A0
1.1 DirectX 8 and DirectX 8.1 with one address register A0
2.0 DirectX 9

The only difference between the levels 1.0 and 1.1 is the support of the a0 register. The DirectX 8.0 and DirectX 8.1 reference rasterizer and the software emulation delivered by Microsoft and written by Intel and AMD for their respective CPUs support version 1.1. At the time of this writing, only GeForce3/4TI and RADEON 8500-driven boards support version 1.1 in hardware. No known graphics card supports vs.1.0-only at the time of writing, so this is a legacy version.

Vertex Shader Declaration

You must declare a vertex shader before using it. This declaration can be called a static external interface. An example might look like this:

float c[4] = {0.0f,0.5f,1.0f,2.0f};
DWORD dwDecl0[] = {
  D3DVSD_STREAM(0),
  D3DVSD_REG(0, D3DVSDT_FLOAT3 ),    // input register v0
  D3DVSD_REG(5, D3DVSDT_D3DCOLOR ),  // input Register v5
                                     // set a few constants
  D3DVSD_CONST(0,1),*(DWORD*)&c[0],*(DWORD*)&c[1],*(DWORD*)&c[2],*(DWORD*)&c[3],
  D3DVSD_END()
};

This vertex shader declaration sets data stream 0 with D3DVSD_STREAM(0). Later, SetStreamSource() binds a vertex buffer to a device data stream by using this declaration. You are able to feed different data streams to the Direct3D rendering engine this way.

For example, one data stream could hold positions and normals, while a second held color values and texture coordinates. This also makes switching between single texture rendering and multi texture rendering trivial: just don't enable the stream with the second set of texture coordinates.

You must declare, which input vertex properties or incoming vertex data has to be mapped to which input register. D3DVSD_REG binds a single vertex register to a vertex element/property from the vertex stream. In our example a D3DVSDT_FLOAT3 value should be placed into the first input register and a D3DVSDT_D3DCOLOR color value should be placed in the sixth input register. For example the position data could be processed by the input register 0 (v0) with D3DVSD_REG(0, D3DVSDT_FLOAT3 ) and the normal data could be processed by input register 3 (v3) with D3DVSD_REG(3, D3DVSDT_FLOAT3 ).

How a developer maps each input vertex property to a specific input register is only important, if one want to use N-Patches, because the N-Patch Tessellator needs the position data in v0 and the normal data in v3. Otherwise the developer is free to define the mapping as they see fit. For example the position data could be processed by the input register 0 (v0) with D3DVSD_REG(0, D3DVSDT_FLOAT3) and the normal data could be processed by input register 3 (v3) with D3DVSD_REG(3, D3DVSDT_FLOAT3).

In contrast the mapping of the vertex data input to specific registers is fixed for the fixed-function pipeline. d3d8types.h holds a list of #defines that predefine the vertex input for the fixed-function pipeline. Specific vertex elements such as position or normal must be placed in specified registers located in the vertex input memory. For example the vertex position is bound by D3DVSDE_POSITION to Register 0, the diffuse color is bound by D3DVSDE_DIFFUSE to Register 5 etc.. Here's the whole list from d3d8types.h:
#define  D3DVSDE_POSITION      0
#define  D3DVSDE_BLENDWEIGHT   1
#define  D3DVSDE_BLENDINDICES  2
#define  D3DVSDE_NORMAL        3
#define  D3DVSDE_PSIZE         4
#define  D3DVSDE_DIFFUSE       5
#define  D3DVSDE_SPECULAR      6
#define  D3DVSDE_TEXCOORD0     7
#define  D3DVSDE_TEXCOORD1     8
#define  D3DVSDE_TEXCOORD2     9
#define  D3DVSDE_TEXCOORD3     10
#define  D3DVSDE_TEXCOORD4     11
#define  D3DVSDE_TEXCOORD5     12
#define  D3DVSDE_TEXCOORD6     13
#define  D3DVSDE_TEXCOORD7     14
#define  D3DVSDE_POSITION2     15
#define  D3DVSDE_NORMAL2       16

The second parameter of D3DVSD_REG specifies the dimensionality and arithmetic data type. The following values are defined in d3d8types.h:

// bit declarations for _Type fields
#define D3DVSDT_FLOAT1 0x00 // 1D float expanded to (value, 0., 0., 1.)
#define D3DVSDT_FLOAT2 0x01 // 2D float expanded to (value, value, 0., 1.)
#define D3DVSDT_FLOAT3 0x02 // 3D float expanded to (value, value, value, 1.)
#define D3DVSDT_FLOAT4 0x03 // 4D float

// 4D packed unsigned bytes mapped to 0. to 1. range
// Input is in D3DCOLOR format (ARGB) expanded to (R, G, B, A)
#define D3DVSDT_D3DCOLOR 0x04

#define D3DVSDT_UBYTE4 0x05 // 4D unsigned byte // 2D signed short expanded to (value, value, 0., 1.) #define D3DVSDT_SHORT2 0x06 #define D3DVSDT_SHORT4 0x07 // 4D signed short
Note. GeForce3/4TI doesn't support D3DVSDT_UBYTE4, as indicated by the D3DVTXPCAPS_NO_VSDT_UBYTE4 caps bit.

D3DVSD_CONST loads the constant values into the vertex shader constant memory. The first parameter is the start address of the constant array to begin filling data. Possible values range from 0 to 95 or in case of the RADEON 8500 from 0 - 191. We start at address 0. The second number is the number of constant vectors (quad-float) to load. One vector is 128 bit long, so we load four 32-bit FLOATs at once. If you want to load a 4x4 matrix, you would use the following statement to load four 128-bit quad-floats into the constant registers c0 - c3:

float c[16] = (0.0f, 0.5f, 1.0f, 2.0f,
               0.0f, 0.5f, 1.0f, 2.0f,
               0.0f, 0.5f, 1.0f, 2.0f,
               0.0f, 0.5f, 1.0f, 2.0f);
D3DVSD_CONST(0, 4), *(DWORD*)&c[0],*(DWORD*)&c[1],*(DWORD*)&c[2],*(DWORD*)&c[3],
                    *(DWORD*)&c[4],*(DWORD*)&c[5],*(DWORD*)&c[6],*(DWORD*)&c[7],
                    *(DWORD*)&c[8],*(DWORD*)&c[9],*(DWORD*)&c[10],*(DWORD*)&c[11],
                    *(DWORD*)&c[12],*(DWORD*)&c[13],*(DWORD*)&c[14],*(DWORD*)&c[15],

D3DVSD_END generates an END token to mark the end of the vertex shader declaration.

Another example can be:

float	c[4] = {0.0f,0.5f,1.0f,2.0f};
DWORD dwDecl[] = {
  D3DVSD_STREAM(0),
  D3DVSD_REG(0, D3DVSDT_FLOAT3 ), //input register v0
  D3DVSD_REG(3, D3DVSDT_FLOAT3 ), // input register v3
  D3DVSD_REG(5, D3DVSDT_D3DCOLOR ), // input register v5
  D3DVSD_REG(7, D3DVSDT_FLOAT2 ), // input register v7
  D3DVSD_CONST(0,1),*(DWORD*)&c[0],*(DWORD*)&c[1],*(DWORD*)&c[2],*(DWORD*)&c[3],
  D3DVSD_END()
};

Data stream 0 is set with D3DVSD_STREAM(0). The position values (value, value, value, 1.0) might be bound to v0, the normal values might be bound to v3, the diffuse color might be bound to v5 and one texture coordinate (value, value, 0.0, 1.0) might be bound to v7. The constant register c0 get one 128-bit value.

Setting the Vertex Shader Constant Registers

You will fill the vertex shader constant registers with SetVertexShaderConstant() and get the values from this registers with GetVertexShaderConstant():

// Set the vertex shader constants
m_pd3dDevice->SetVertexShaderConstant( 0, &vZero, 1 );
m_pd3dDevice->SetVertexShaderConstant( 1, &vOne, 1 );
m_pd3dDevice->SetVertexShaderConstant( 2, &vWeight, 1 );
m_pd3dDevice->SetVertexShaderConstant( 4, &matTranspose, 4 );
m_pd3dDevice->SetVertexShaderConstant( 8, &matCameraTranspose, 4 );
m_pd3dDevice->SetVertexShaderConstant( 12, &matViewTranspose, 4 );
m_pd3dDevice->SetVertexShaderConstant( 20, &fLight, 1 );
m_pd3dDevice->SetVertexShaderConstant( 21, &fDiffuse, 1 );
m_pd3dDevice->SetVertexShaderConstant( 22, &fAmbient, 1 );
m_pd3dDevice->SetVertexShaderConstant( 23, &fFog, 1 );
m_pd3dDevice->SetVertexShaderConstant( 24, &fCaustics, 1 );
m_pd3dDevice->SetVertexShaderConstant( 28, &matProjTranspose, 4 );

SetVertexShaderConstant() is declared as

HRESULT SetVertexShaderConstant(
  DWORD Register,
  CONST void* pConstantData,
  DWORD ConstantCount);

As stated earlier, there are at least 96 constant registers (RADEON 8500 has 192), that can be filled with four floating-point values before the vertex shader is executed. The first parameter holds the register address at which to start loading data into the vertex constant array. The last parameter holds the number of constants (4 x 32-bit values) to load into the vertex constant array. So in the first row above, vZero will be loaded into register 0. matTranspose will be loaded into register 4, 5, 6, and 7. matViewTranspose will be loaded into 12, 13, 14, 15. The registers 16, 17, 18, 19 are not used. fLight is loaded into register 20. The registers 25, 26, 27 are not used.

So what's the difference between D3DVSD_CONST used in the vertex shader declaration and SetVertexShaderConstant() ? D3DVSD_CONST can be used only once. SetVertexShaderConstant() can be used before every DrawPrimitive*() call.

Ok ... now we have learned how to check the supported version number of the vertex shader hardware, how to declare a vertex shader and how to set the constants in the constant registers of a vertex shader unit. Next we shall learn, how to write and compile a vertex shader program.

Writing and Compiling a Vertex Shader

Before we are able to compile a vertex shader, we must write one ... (old wisdom :-) ). I would like to give you a high-level overview of the instruction set first and then give further details of vertex shader programming in the next chapter named "Programming Vertex Shaders".

The syntax for every instruction is

OpName dest, [-]s1 [,[-]s2 [,[-]s3]] ;comment
e.g.
mov r1, r2
mad r1, r2, -r3, r4 ; contents of r3 are negated

There are 17 different instructions:

Instruction Parameters Action
add dest, src1, src2 add src1 to src2 (and the optional negation creates substraction)
dp3 dest, src1, src2 three-component dot product
dest.x = dest.y = dest.z = dest.w =
(src1.x * src2.x) + (src1.y * src2.y) + (src1.z * src2.z)
dp4 dest, src1, src2

four-component dot product
dest.w = (src1.x * src2.x) + (src1.y * src2.y) + (src1.z * src2.z) + (src1.w * src2.w);
dest.x = dest.y = dest.z = unused

What is the difference between dp4 and mul ? dp4 produces a scalar product and mul is a component by component vector product.

dst dest, src1, src2

The dst instruction works like this: The first source operand (src1) is assumed to be the vector (ignored, d*d, d*d, ignored) and the second source operand (src2) is assumed to be the vector (ignored, 1/d, ignored, 1/d).

Calculate distance vector:
dest.x = 1;
dest.y = src1.y * src2.y
dest.z = src1.z
dest.w = src2.w

dst is useful to calculate standard attenuation. Here is a code snippet that might calculate the attenuation for a point light:

; r7.w = distance * distance = (x*x) + (y*y) + (z*z)
dp3 r7.w, VECTOR_VERTEXTOLIGHT, VECTOR_VERTEXTOLIGHT

; VECTOR_VERTEXTOLIGHT.w = 1/sqrt(r7.w)
; = 1/||V|| = 1/distance
rsq VECTOR_VERTEXTOLIGHT.w, r7.w
...
; Get the attenuation
; d = distance
; Parameters for dst:
; src1 = (ignored, d * d, d * d, ignored)
; src2 = (ignored, 1/d, ignored, 1/d)
;
; r7.w = d * d
; VECTOR_VERTEXTOLIGHT.w = 1/d
dst r7, r7.wwww, VECTOR_VERTEXTOLIGHT.wwww
; dest.x = 1
; dest.y = src0.y * src1.y
; dest.z = src0.z
; dest.w = src1.w
; r7(1, d * d * 1 / d, d * d, 1/d)

; c[LIGHT_ATTENUATION].x = a0
; c[LIGHT_ATTENUATION].y = a1
; c[LIGHT_ATTENUATION].z = a2
; (a0 + a1*d + a2* (d * d))
dp3 r7.w, r7, c[LIGHT_ATTENUATION]
; 1 / (a0 + a1*d + a2* (d * d))
rcp ATTENUATION.w, r7.w
...
; Scale the light factors by the attenuation
mul r6, r5, ATTENUATION.w

expp dest, src.w Exponential 10-bit precision
------------------------------------------
float w = src.w;
float v = (float)floor(src.w);

dest.x = (float)pow(2, v);
dest.y = w - v;

// Reduced precision exponent
float tmp = (float)pow(2, w);
DWORD tmpd = *(DWORD*)&tmp & 0xffffff00;

dest.z = *(float*)&tmpd;
dest.w = 1;
--------------------------------------------
Shortcut:

dest.x = 2 **(int) src.w
dest.y = mantissa(src.w)
dest.z = expp(src.w)
dest.w = 1.0
lit dest, src

Calculates lighting coefficients from two dot products and a power.
---------------------------------------------
To calculate the lighting coefficients, set up the registers as shown:

src.x = N*L ; The dot product between normal and direction to light
src.y = N*H ; The dot product between normal and half vector
src.z = ignored ; This value is ignored
src.w = specular power ; The value must be between -128.0 and 128.0
----------------------------------------------
usage:

dp3 r0.x, rn, c[LIGHT_POSITION]
dp3 r0.y, rn, c[LIGHT_HALF_ANGLE]
mov r0.w, c[SPECULAR_POWER]
lit r0, r0
------------------------------------------------
dest.x = 1.0;
dest.y = max (src.x, 0.0, 0.0);
dest.z= 0.0;
if (src.x > 0.0 && src.w == 0.0)
  dest.z = 1.0;
else if (src.x > 0.0 && src.y > 0.0)
  dest.z = (src.y)src.w
dest.w = 1.0;

logp dest, src.w Logarithm 10-bit precision
---------------------------------------------------
float v = ABSF(src.w);
if (v != 0)
{
  int p = (int)(*(DWORD*)&v >> 23) - 127;
  dest.x = (float)p;  // exponent

  p = (*(DWORD*)&v & 0x7FFFFF) | 0x3f800000;
  dest.y = *(float*)&p; // mantissa;

  float tmp = (float)(log(v)/log(2));
  DWORD tmpd = *(DWORD*)&tmp & 0xffffff00;
  dest.z = *(float*)&tmpd;

  dest.w = 1;
}
else
{
  dest.x = MINUS_MAX();
  dest.y = 1.0f;
  dest.z = MINUS_MAX();
  dest.w = 1.0f;
}
-----------------------------------------------------
Sortcut:
dest.x = exponent((int)src.w)
dest.y = mantissa(src.w)
dest.z = log2(src.w)
dest.w = 1.0
mad dest, src1, src2, src3 dest = (src1 * src2) + src3
max dest, src1, src2 dest = (src1 >= src2)?src1:src2
min dest, src1, src2 dest = (src1 < src2)?src1:src2
mov dest, src

move
Optimization tip: question every use of mov (try to rap that !), because there might be methods that perform the desired operation directly from the source register or accept the required output register as the destination.

mul dest, src1, src2 set dest to the component by component product of src1 and src2

; To calculate the Cross Product (r5 = r7 X r8),
; r0 used as a temp
mul r0,-r7.zxyw,r8.yzxw
mad r5,-r7.yzxw,r8.zxyw,-r0

nop   do nothing
rcp dest, src.w
if(src.w == 1.0f)
{
  dest.x = dest.y = dest.z = dest.w = 1.0f;
}
else if(src.w == 0)
{
  dest.x = dest.y = dest.z = dest.w = PLUS_INFINITY();
}
else
{
  dest.x = dest.y = dest.z = m_dest.w = 1.0f/src.w;
}

Division:
; scalar r0.x = r1.x/r2.x
RCP r0.x, r2.x
MUL r0.x, r1.x, r0.x
rsq dest, src

reciprocal square root of src
(much more useful than straight 'square root'):

float v = ABSF(src.w);
if(v == 1.0f)
{
  dest.x = dest.y = dest.z = dest.w = 1.0f;
}
else if(v == 0)
{
  dest.x = dest.y = dest.z = dest.w = PLUS_INFINITY();
}
else
{
  v = (float)(1.0f / sqrt(v));
  dest.x = dest.y = dest.z = dest.w = v;
}

Square root:
; scalar r0.x = sqrt(r1.x)
RSQ r0.x, r1.x
MUL r0.x, r0.x, r1.x
sge dest, src1, src2

dest = (src1 >=src2) ? 1 : 0

useful to mimic conditional statements:
; compute r0 = (r1 >= r2) ? r3 : r4
; one if (r1 >= r2) holds, zero otherwise
SGE r0, r1, r2
ADD r1, r3, -r4
; r0 = r0*(r3-r4) + r4 = r0*r3 + (1-r0)*r4
; effectively, LERP between extremes of r3 and r4
MAD r0, r0, r1, r4

slt dest, src1, src2 dest = (src1 < src2) ? 1 : 0

You can download this list as a word file from www.shaderx.com. Check out the SDK for additional information.

The Vertex Shader ALU is a multi-threaded vector processor that operates on quad-float data. It consists of two functional units. The SIMD Vector Unit is responsible for the mov, mul, add, mad, dp3, dp4, dst, min, max, slt and sge instructions. The Special Function Unit is responsible for the rcp, rsq, log, exp and lit instructions. Most of these instructions take one cycle to execute, rcp and rsq take more than one cycle under specific circumstances. They take only one slot in the vertex shader, but they actually take longer then one cycle to execute, when the result is used immediately, because that leads to a register stall.

Application Hints

rsq is, for example, used in normalizing vectors to be used in lighting equations. The exponential instruction expp can be used for fog effects, procedural noise generation (see NVIDIA Perlin Noise example), behavior of particles in a particle system (see NVIDIA Particle System example) or to implement a system how objects in a game are damaged. You will use it in any case when a fast changing function is necessary. This is contrary of the use of logarithm functions with logp, that are useful if an extremely slow growing is necessary (also they grow at the beginning pretty fast). A log function can be the inverse of a exponential function, means it undoes the operation of the exponential function.

The lit instruction deals by default with directional lights. It calculates the diffuse & specular factors with clamping based on N * L and N * H and the specular power. There is no attenuation involved, but you can use an attenuation level separately with the result of lit by using the dst instruction. This is useful for constructing attenuation factors for point and spot lights.

The min and max instructions allow for clamping and absolute value computation.

Complex Instructions in the Vertex Shader

There are also complex instructions, that are supported by the vertex shader. The term "macro" should not be used to refer to these instructions, because they are not simply substituted like a C-preprocessor macro. You should think carefully before using these instructions. If you use them, you might lose control over your 128-instruction limit and possible optimization path(s). On the other hand, the software emulation mode provided by Intel or by AMD for their processors is able to optimize a m4x4 complex instruction (and perhaps others now or in the future). It is also possible that, in the future some graphics hardware may use gate count to optimize the m4x4. So, if you need, for example four dp4 calls in your vertex shader assembly source, it might be a good idea to replace them by m4x4. If you have decided to use for example a m4x4 instruction in your shader, you should not use a dp4 call on the same data later, because there are slightly different transformation results. If, for example, both instructions are used for position calculation, z-fighting could result:

Macro Parameters Action Clocks
expp dest, src1 provides exponential with full precision to at least 1/220 12
frc dest, src1 returns fractional portion of each input component 3
log dest, src1 provides log2(x) with full float precision of at least 1/220 12
m3x2 dest, src1, src2 computes the product of the input vector and a 3x2 matrix 2
m3x3 dest, src1, src2 computes the product of the input vector and a 3x3 matrix 3
m3x4 dest, src1, src2 computes the product of the input vector and a 3x4 matrix 4
m4x3 dest, src1, src2 computes the product of the input vector and a 4x3 matrix 3
m4x4 dest, src1, src2 computes the product of the input vector and a 4x4 matrix 4

You are able to perform all transform and lighting operations with these instructions. If it seems to you that some instructions are missing, rest assured that you can achieve them through the existing instructions for example, the division of two numbers can be realized with a reciprocal and a multiply. You can even implement the whole fixed-function pipeline by using these instructions in a vertex shader. This is shown in the NVLink example of NVIDIA.

Putting it All Together

Now let's see how these registers and instructions are typically used in the vertex shader ALU.

In vs.1.1 there are 16 input registers, 96 constant registers, 12 temporary registers, 1 address register and up to 13 output registers per rasterizer. Each register can handle 4x32-bit values. Each 32-bit value is accessible via an x, y, z and w subscript. That is, a 128-bit value consists of a x, y, z and w value. To access these register components, you must add .x, .y, .z and .w at the end of the register name. Let's start with the input registers:

Using the Input Registers

The 16 input registers can be accessed by using their names v0 to v15. Typical values provided to the input vertex registers are:

  • Position(x,y,z,w)
  • Diffuse color (r,g,b,a) -> 0.0 to +1.0
  • Specular color (r,g,b,a) -> 0.0 to +1.0
  • Up to 8 Texture coordinates (each as s, t, r, q or u, v , w, q) but normally 4 or 6, dependent on hardware support
  • Fog (f,*,*,*) -> value used in fog equation
  • Point size (p,*,*,*)

You can access the x-component of the position with v0.x, the y-component with v0.y and so on. If you need to know the green component of the RGBA diffuse color, you check v1.y. You may set the fog value for example into v7.x. The other three 32-bit components, v7.y, v7.z and v7.w would not be used. The input registers are read-only. Each instruction may access only one vertex input register. Unspecified components of the input register default to 0.0 for the x, y and z components and to 1.0 for the w component. In the following example the four-component dot product between each of c0 - c3 and v0 is stored in oPos:

dp4 oPos.x , v0 , c0
dp4 oPos.y , v0 , c1
dp4 oPos.z , v0 , c2
dp4 oPos.w , v0 , c3

Such a code fragment is usually used to map from projection space, with the help of the already concatenated world-, view- and projection matrices, to clip space. The four component dot product performs the following calculation:

oPos.x = (v0.x * c0.x) + (v0.y * c0.y) + (v0.z * c0.z) + (v0.w * c0.w)

Given that we use unit length (normalized) vectors, it is known that the dot product of two vectors will always range between [-1, 1]. Therefore oPos will always get values in that range. Alternatively, you could use:

m4x4 oPos, v0 , c0

Don't forget to use those complex instructions consistently throughtout your vertex shader, because as described above, there might be slight differences between dp4 and m4x4 results. You are restricted to using only one input register in each instruction.

All data in an input register remains persistent throughout the vertex shader execution and even longer. That means they retain their data longer than the life-time of a vertex shader. So it is possible to re-use the data of the input registers in the next vertex shader.

Using the Constant Registers

Typical uses for the constant registers include:

  • Matrix data: quad-floats are typically one row of a 4x4 matrix
  • Light characteristics, (position, attenuation etc)
  • Current time
  • Vertex interpolation data
  • Procedural data

There are 96 quad-floats (or in the case of the RADEON 8500, 192 quad-floats) for storing constant data. This reasonably large set of matrices can be used for example, for indexed vertex blending, more commonly known as "matrix palette skinning".

The constant registers are read-only from the perspective of the vertex shader, whereas the application can read and write into the constant registers. The constant registers retain their data longer than the life-time of a vertex shader so it is possible to re-use this data in the next vertex shader. This allows an app to avoid making redundant SetVertexShaderConstant() calls. Reads from out-of-range constant registers return (0.0, 0.0, 0.0, 0.0).

You can use only one constant register per instruction, but you can use it several times. For example:

; the following instruction is legal
mul r5, c11, c11 ; The product of c11 and c11 is stored in r5

; but this is illegal
add v0, c4, c3

A more complicated-looking, but legal, example is:

; dest = (src1 * src2) + src3
mad r0, r0, c20, c20 ; multiplies r0 with c20 and adds c20 to the result

Using the Address Register

You access the address registers with a0 to an (more than one address register should be available in vertex shader versions higher than 1.1). The only use of a0 in vs.1.1 is as an indirect addressing operator to offset constant memory.

c[a0.x + n] ; supported only in version 1.1 and higher
            ; n is the base address and a0.x is the address offset

Here is an example using the address register:

//Set 1
mov a0.x,r1.x
m4x3 r4,v0,c[a0.x + 9];
m3x3 r5,v3,c[a0.x + 9];

Depending on the value that is stored in temporary register r1.x, different constant registers are used in the m4x3 and m3x3 instructions. Please not that register a0 only stores whole numbers and no fractions (integers only) and that a0.x is the only valid component of a0. Further, a vertex shader may write to a0.x only via the mov instruction.

Beware of a0.x if there is only a software emulation mode: performance can be significantly reduced [Pallister].

Using the Temporary Registers

You can access the 12 temporary registers using r0 to r11. Here are a few examples:

dp3 r2, r1, -c4 ; A three-component dot product: dest.x = dest.y = dest.z =
                ; dest.w = (r1.x * -c4.x) + (r1.y * -c4.y) + (r1.z * -c4.z)
...
mov r0.x, v0.x
mov r0.y, c4.w
mov r0.z, v0.y
mov r0.w, c4.w

Each temporary register has single write and triple read access. Therefore an instruction could have the same temporary register as a source three times. Vertex shaders can not read a value from a temporary register before writing to it. If you try to read a temporary register that was not filled with a value, the API will give you an error message while creating the vertex shader (== CreateVertexShader()).

Using the Output Registers

There are up to 13 write-only output registers that can be be accessed using the following register names. They are defined as the inputs to the rasterizer and the name of each registers is preceded by a lower case 'o'. The output registers are named to suggest their use by pixel shaders.

Name Value Description
oDn 2 quad-floats Output color data directly to the pixel shader. Required for diffuse color (oD0) and specular color (oD1).
oPos 1 quad-float Output position in homogenous clipping space. Must be written by the vertex shader.
oTn up to 8 quad-floats
Geforce 3: 4
RADEON 8500: 6
Output texture coordinates. Required for maximum number of textures simultaneously bound to the texture blending stage.
oPts.x 1 scalar float Output point-size registers. Only the scalar x-component of the point size is functional
oFog.x 1 scalar float the fog factor to be interpolated and then routed to the fog table. Only the scalar x-component is functional.

Here is a typical example, that shows how to use the oPos, oD0 and oT0 registers:

dp4 oPos.x , v0 , c4 ; emit projected x position
dp4 oPos.y , v0 , c5 ; emit projected y position
dp4 oPos.z , v0 , c6 ; emit projected z position
dp4 oPos.w , v0 , c7 ; emit projected w position
mov oD0, v5          ; set the diffuse color
mov oT0, v2 ; outputs the texture coordinates to oT0 from input register v2

Using the four dp4 instructions to map from projection to clip space with the already concatenated world-, view- and projection matrices was already shown above. The first mov instruction moves the content of the v5 input register into the color output register and the second mov instruction moves the values of the v2 register into the first output texture register.

Using the oFog.x output register is shown in the following example:

; Scale by fog parameters :
; c5.x = fog start
; c5.y = fog end
; c5.z = 1/range
; c5.w = fog max
dp4 r2, v0, c2 ; r2 = distance to camera
sge r3, c0, c0 ; r3 = 1
add r2, r2, -c5.x           ; camera space depth (z) - fog start
mad r3.x, -r2.x, c5.z, r3.x ; 1.0 - (z - fog start) * 1/range
                            ; because fog=1.0 means no fog, and
                            ; fog=0.0 means full fog
max oFog.x, c5.w, r3.x      ; clamp the fog with our custom max value

Having a fog distance value permits more general fog effects, than using the position's z or w values. The fog distance value is interpolated before use as a distance in the standard fog equations used later in the pipeline.

Every vertex shader must write at least to one component of oPos or you will get an error message by the assembler.

When using vertex shaders the D3DTSS_TCI_* flags of D3DTSS_TEXCOORDINDEX are ignored. All texture coordinates are mapped in numerical order.

Optimization tip: emit to oPos as early as possible to trigger parallelism in the pixel shader. Try to reorder the vertex shader instructions to make this happen.

All iterated values transferred out of the vertex shader are clamped to [0..1]. If you need signed values in the pixel shader, you must bias them in the vertex shader, and then re-expand them in the pixel shader by using _bx2.

Swizzling and Masking

If you use the input, constant and temporary registers as source registers, you can swizzle the .x, .y, .z and .w values independently of each other. If you use the output and temporary registers as destination registers you can use the .x, .y, .z and .w values as write-masks. Here are the details:

Swizzling (only source registers: vn, cn, rn)

Swizzling is very useful for efficiently, where the source registers need to be rotated - like cross products. Another use is converting constants such as (0.5, 0.0, 1.0, 0.6) into other forms such as (0.0, 0.0, 1.0, 0.0) or (0.6, 1.0, -0.5, 0.6).

All registers, that are used in instructions as source registers can be swizzled. For example

mov R1, R2.wxyz;


Figure 15 - Swizzling

The destination register is R1, where R could be a write-enabled register like the output (o*) or any of the temporary registers (r). The source register is R2, where R could be a input (v), constant (c) or temporary register (source registers are located on the right side of the destination register in the instruction syntax).

The following instruction copies the negation of R2.x into R1.x, the negation of R2.y into R1.y and R1.z and the negation of R2.z into R1.w. As shown, all source registers can be negated and swizzled at the same time:

mov R1, -R2.xyyz


Figure 16 - Swizzling #2

Masking (only destination registers: on, rn)

A destination register can mask which components are written to it. If you use R1 as the destination register (acutally any write-enabled registers : o*, r), all the components are written from R2 to R1. If you choose for example

mov  R1.x, R2

only the x component is written to R1, whereas

mov  R1.xw, R2

writes only the x and w components of R2 to R1. No swizzling or negation is supported on the destination registers.

Here is the source for a 3-vector cross-product:

; r0 = r1 x r2 (3-vector cross-product)
mul  r0, r1.yzxw, r2.zxyw
mad  r0, -r2.yzxw, r1.zxyw, r0

This is explained in detail in [LeGrand].

The following table summarizes swizzling and masking:

Component Modifier Description
R.[x][y][z][w] Destination mask
R.xwzy (for example) Source swizzle
-R Source negation

Since any source can be negated, there is no need for a subtract instruction.

Guidelines for Writing Vertex Shaders

The most important restrictions you should remember when writing vertex shaders are the following:

  • They must write to at least one component of the output register oPos
  • There is a 128 instruction limit
  • Every instruction may source no more than one constant register, e.g. add r0, c4, c3 will fail
  • Every instruction may source no more than one input register, e.g. add r0, v1, v2 will fail
  • There are no C-like conditional statements, but you can mimic an instruction of the form r0 = (r1 >= r2) ? r3 : r4 with the sge instruction
  • All iterated values transferred out of the vertex shader are clamped to [0..1]

There are several ways to optimize vertex shaders. Here are a few rules of thumb:

  • Read the paper from Kim Pallister on optimizing software vertex shaders [Pallister]
  • When setting vertex shader constant data, try to set all data in one SetVertexShaderConstant() call
  • Pause and think about using a mov instruction; you may be able to avoid it
  • Choose instructions that perform multiple operations over instructions that perform single operations
    mad   r4,r3,c9,r4
    mov   oD0,r4
    ==
    mad   oD0,r3,c9,r4
    
  • Collapse (remove complex instructions like m4x4 or m3x3 instructions) vertex shaders before thinking about optimizations
  • A rule of thumb for load-balancing between the CPU/GPU: Many calculations in shaders can be pulled outside and reformulated per-object instead of per-vertex and put into constant registers. If you are doing some calculation which is per object rather than per vertex, then do it on the CPU and upload it on the vertex shader as a constant, rather than doing it on the GPU

    One of the most interesting methods to optimize your applications bandwidth usage, is the usage of compressed vertex data [Calver].

Now that you have an abstract overview, of how to write vertex shaders, I would like to mention at least three different ways to compile one.

Compiling a Vertex Shader

Direct3D uses byte-codes, whereas OpenGL implementations parses a string. Therefore the Direct3D developer needs to assemble the vertex shader source with an assembler. This might help you find bugs earlier in your development cycle and it also reduces load-time.

I see three different ways to compile a vertex shader:

  • write the vertex shader source into a separate ASCII file for example test.vsh and compile it with a vertex shader assembler into a binary file, for example test.vso. This file will be opened and read at game start up. This way, not every person will be able to read and modify your vertex shader source. 
  • Don't forget that NVLink can link together already compiled shader fragments at run-time.
  • write the vertex shader source into a separate ASCII file or as a char string into your *.cpp file and compile it "on the fly" while the app starts up with the D3DXAssembleShader*() functions.
  • write the vertex shader source in an effects file and open this effect file when the app starts up. The vertex shader can be compiled by reading the effect files with D3DXCreateEffectFromFile(). It is also possible to pre-compile an effects file. This way, most of the handling of vertex shaders is simplified and handled by the effect file functions.

    Another way is to use the opcodes shown in d3dtypes.h and build your own vertex assembler/disassembler.

Let's review, what we have examined so far. After we ...

  • checked the vertex shader support with the D3DCAPS8::VertexShaderVersion field
  • we declared a vertex shader with the D3DVSD_* macros
  • then we set the constant registers with SetVertexShaderConstant() and
  • wrote and compiled the vertex shader

Now we need to get a handle to call it.

Creating a Vertex Shader

The CreateVertexShader() function is used to create and validate a vertex shader:

HRESULT CreateVertexShader(
    CONST DWORD* pDeclaration,
    CONST DWORD* pFunction,
    DWORD* pHandle,
    DWORD Usage);

This function takes the vertex shader declaration (which maps vertex buffer streams to different vertex input registers) in pDeclaration as a pointer and returns the shader handle in pHandle. The second parameter pFunction gets the vertex shader instructions compiled by D3DXAssembleShader() / D3DXAssembleShaderFromFile() or the binary code pre-compiled by a vertex shader assembler. With the fourth parameter you can force software vertex processing with D3DUSAGE_SOFTWAREPROCESSING. It must be used, when D3DRS_SOFTWAREVERTEXPROCESSING is set to TRUE. By setting the software processing path explicitly, vertex shades are simulated by the CPU by using the software vertex shader implementation of the CPU vendors. If a vertex shader-capable GPU is available, using hardware vertex processing should be faster. You must use this flag or the reference rasterizer for debugging with the NVIDIA Shader Debugger.

Setting a Vertex Shader

You set a vertex shader for a specific object by using SetVertexShader() before the DrawPrimitive*() call of this object. This function dynamically loads the vertex shader between the primitive calls.

// set the vertex shader
m_pd3dDevice->SetVertexShader( m_dwVertexShader );

The only parameter you must provide is the handle of the vertex shader created by CreateVertexShader(). The overhead of this call is lower than a SetTexture() call, so you are able to use it often.

Vertex Shaders are executed with SetVertexShader() as many times as there are vertices. For example if you try to visualize a rotating quad with four vertices implemented as an indexed triangle list, you will see in the NVIDIA Shader Debugger, that the vertex shader runs four times, before the DrawPrimitive*() function is called.

Free Vertex Shader Resources

When the game shuts down or when the device is changed, the resources taken by the vertex shader must be released. This must be done by calling DeleteVertexShader() with the vertex shader handle:

// delete the vertex shader
if (m_pd3dDevice->m_dwVertexShader != 0xffffffff)
{
  m_pd3dDevice->DeleteVertexShader( m_dwVertexShader );
  m_pd3dDevice->m_dwVertexShader = 0xffffffff;
}




Conclusion

Contents
  Introduction
  Vertex Shaders in the Pipeline
  Vertex Shader Tools
  Vertex Shader Architecture
  High Level View of Vertex Shader Programming
  Conclusion

  Printable version
  Discuss this article

The Series
  Fundamentals of Vertex Shaders
  Programming Vertex Shaders
  Fundamentals of Pixel Shaders
  Programming Pixel Shaders
  Diffuse & Specular Lighting with Pixel Shaders