Introduction to Shader Programming Part III
Fundamentals of Pixel Shaders
by Wolfgang F. Engel
(Last modification: April 1st, 2002)

The final output of any 3D graphics hardware consists of pixels. Depending on the resolution, in excess of 2 million pixels may need to be rendered, lit, shaded and colored. Prior to DirectX 8.0, Direct3D used a fixed-function multitexture cascade for pixel processing. The effects possible with this approach were very limited on the implementation of the graphics card device driver and the specific underlying hardware. A programmer was restricted on the graphic algorithms implemented by these.

With the introduction of shaders in DirectX 8.0 and the improvements of pixel shaders in DirectX 8.1, a whole new programming universe can be explored by game/demo coders. Pixel shaders are small programs that are executed on individual pixels. This is an unprecedented level of hardware control for their users.

The third part in this "Introduction to Shader Programming", named "Fundamentals of Pixel Shaders", shows you

the place and the tasks of pixel shaders in the Direct3D pipeline
the architecture of the pixel shader unit
tools, that help to program pixel shaders
the ingredients for a pixel shader-driven application

The fourth and last part will present many pixel shader concepts and algorithms with example code.

Why use Pixel Shaders?

Pixel shaders are at the time of this writing supported by GeForce 3/4TI and RADEON 8500-based cards. Unlike vertex shaders, however, there is no feasible way of emulating pixel shaders using software.

The best argument for using pixel shaders is to take a look at a few demos that uses them :-) ... or just one word: per-pixel non-standardized lighting. The gain in visual experience is enormous. You can use membrane shaders, for balloons, skins, kubelka-munk shaders for translucency effects, fur/hair shaders and new lighting formulas that lead to a totally new lighting experience. Think of a foreign planet with three moons moving faster than the two suns or with a planet surface consisting of different crystalline substances, reflecting light in different ways. The following list should give you a glimpse on the many types of effects that are possible by using pixel shaders now:

Single pass, per-pixel lighting (see next part)
True phong shading [Beaudoin/Guardado]
Anisotropic lighting [Isidoro/Brennan]
Non-Photorealistic-Rendering: cartoon shading, hatching, gooch lighting, image space techniques [Card/Mitchell]
Per-pixel fresnel term [Brennan]
Volumetric effects [Kraus][Hart]
Advanced bump mapping (self-shadowing bump maps (also known as Horizon Mapping)
Procedural textures [Zecha] and texture perturbation [Isidoro/Riguer]
Bidirectional reflectance distribution functions [Moravánsky]

Not to mention the effects that are not discovered until now or that are discovered but used only in scientific journals. These visual effects are waiting until they get implemented by you :-).

One of the biggest catches of pixel shaders is that they often have to be "driven" by the vertex shader. For example to calculate per-pixel lighting the pixel shader needs the orientation of the triangle, the orientation of the light vector and in some cases the orientation of the view vector. The graphics pipeline shows the relationship between vertex and pixel shaders.

Pixel Shaders in the Pipeline

Figure 1 - Direct3D Pipeline

The following diagram shows the DX6/7 multitexturing unit and the new pixel shader unit. On pixel shader-capable hardware, the pixel shader heightens the number of transistors on the graphic chip, because it is added to the already existing DX 6/7 multitexturing hardware. Therefore it is also functionally an independent unit, that the developer can choose instead of the DX6/7 multitexturing unit.

But what happens in the 3D pipeline before the pixel shader? A vertex leaves the Vertex Shader as a transformed and colored vertex. The so-called Backface Culling removes all triangles, that are facing away from the viewer or camera. These are by default the vertices that are grouped counter-clockwise. On average, half of your game world triangles will be facing away from the camera at any given time, so this helps reducing rendering time. A critical point is the usage of translucent or transparent front triangles. Depending on what is going to be rendered, Backface culling can be switched off with the D3DCULL_NONE flag.

Backface Culling uses the crossproduct of two sides of a triangle to calculate a vector that is perpendicular to the plane that is formed by these two sides. This vector is the face normal. The direction of this normal determines whether the triangle is front- or backfacing. Because the Direct3D API always uses the same vertex order to calculate the crossproduct, it is known whether a triangle's vertices are "wound" clockwise or counter-clockwise.

User Clip Planes can be set by the developer to clip triangles with the help of the graphics card, that are outside of these planes and therefore to reduce the number of calculations. How to set User Clip Planes is shown in the Direct3D 8.1 example named "Clip Mirror" and in the ATI RADEON 8500 Nature demo. The RADEON and RADEON 8500 support six independant user clip planes in hardware. The clip planes on the GeForce 1/2/3/4TI chips are implemented using texture stages. That means that two User Clip Planes you use, eat up one texture stage.

It looks like NVIDIA no longer exposes the cap bits for these.The DirectX Caps Viewer reports MaxUserClipPlanes == 0 since the release of the 12.41 drivers.

One alternative to clip planes is shown in the "TexKillClipPlanes" example delivered in the NVIDIA Effects Browser, where the pixel shader instruction texkill is used to get a similar functionality like clip planes.

Another alternative to clip planes is guard band clipping as supported by GeForce hardware [Dietrich01]. The basic idea of guard band clipping is that hardware with a guard band region can accept triangles that are partially or totally off-screen, thus avoiding expensive clipping work. There are four cap fields for Guard Band support in the D3DCAPS8 structure: GuardBandLeft, GuardBandRight, GuardBandTop, and GuardBandBottom. These represent the boundaries of the guard band. These values have to be used in the clipping code. Guard band clipping is handled automatically by switching on clipping in Direct3D.

Frustum Clipping is performed on the viewing frustum. Primitives that lie partially or totally off-screen must be clipped to the screen or viewport boundary, which is represented by a 3D viewing frustum. The viewing frustum can be visualized as a pyramid, with the camera positioned at the tip. The view volume defines what the camera will see and won't see. The entire scene must fit between the new and far clipping planes and also be bounded by the sides, the bottom and top of the frustum. A given object in relation to the viewing frustum can fall into one of three categories [Watt92]:

The object lies completely outside the viewing in which case it is discarded.
The object lies completely inside the viewing frustum, in which case it is passed on to the Homogenous Divide.
The object intersects the viewing frustum, in which case it is clipped against it and then passed on to the Homogenous Divide.

Translating the definition of the viewing frustrum above into homogenous coordinates gives us the clipping limits:

-w <= x <= w
-w <= y <= w
0 <= z <= w

This would be a slow process, if it has to be done by the CPU, because each edge of each triangle that crosses the viewport boundary must have an intersection point calculated, and each parameter of the vertex ( x,y,z diffuse r,g,b, specular r,g,b, alpha, fog, u and v ) must be interpolated accordingly. Therefore frustrum clipping is done by modern graphics hardware on the GPU essentially for free with clip planes and guard band clipping.

After Frustrum clipping, the Homogenous or perspective Divide happens. This means that the x-, y- and z-coordinates of each vertex of the homogenous coordinates are divided by w. The perspective divide makes nearer objects larger, and further objects smaller as you would expect when viewing a scene in reality. After this division, the coordinates are in a normalized space:

-1 <= x/w <= 1
-1 <= y/w <= 1
0 <= z/w <= 1

Why do we need this divide through w? By definition every point gets a fourth component w that measures distance along an imaginary fourth-dimensional axis called w. For example, (4, 2, 8, 2) represents the same point as (2, 1, 4, 1). Such a w-value is useful for example to translate an object, because a 3x3 matrix is not able to translate an object without changing its orientation. The fourth coordinate, w, can be thought of as carrying the perspective information necessary to represent a perspective transformation.

Clip coordinates are also referred to as normalized device coordinates (NDC). These coordinates are now mapped to the screen by transforming into screen space via the so-called Viewport Mapping. To map these NDCs to screen space, the following formula is used for a resolution of 1024x768:

ScreenX(max) = NDC(X) * 512 + 512
ScreenY(max) = NDC(Y) * 384 + 384

The minimum and maximum value of X and Y is (-1, -1) to (1, 1). When the NDC's are (-1, -1), the screen coordinates are

ScreenX = -1 * 512 + 512
ScreenY = -1 * 384 + 384

This point lies in the upper-left corner. For example the lower-right corner will be reached with NDCs of (1, 1). Although z and w values are retained for depth buffering tests, screen space is essentially a 2D coordinate system, so only x and y values need to be mapped to screen resolution.

Now comes the Triangle Setup, where the life of the vertices end and the life of the pixel begins. It computes triangle for triangle the parameters required for the rasterization of the triangles, amongst other things it defines the pixel-coordinates of the triangle outline. This means that it defines the first and the last pixel of the triangle scan line by scan line:

Figure 2 - Triangle Setup

Then the Rasterizer interpolates color and depth values for each pixel from the color and depth values of the vertices. These values are interpolated using a weighted average of the color and depth values of the edge's vertex values, where the color and depth data of edge pixels closer to a given vertex more closely approximate values for that vertex. Then the rasterizer fills in pixels for each line.

Figure 3 - Filled Triangle

In addition, the texture coordinates are interpolated for use during the Multitexturing/Pixel Shader stage in a similar way.

The rasterizer is also responsible for Direct3D Multisampling. Because it is done in the Rasterizer stage, Multisampling only affects triangles and group of triangles; not lines. It increases the resolution of polygon edges and therefore as well as depth and stencil tests. The RADEON 7x00/GeForce 3 supports this kind of spatial anti-aliasing by setting the D3DRASTERCAPS_STRECHBLTMULTISAMPLE flag, but both ignore the D3DRS_MULTISAMPLEMASK render state to control rendering into the sub-pixel samples, whereas the RADEON 8500 is able to mask the sub-samples with different bit-pattern by using this render state. By affecting only specific sub-samples, effects like motion blur or depth of field and others can be realized.

Alternatively, motion blur and depth of field effects are possible with the help of vertex and pixel shaders; see the NVIDIA examples in the Effects Browser.
GeForce 3/4TI, RADEON 7x00 and RADEON 8500 chips have another stage in between the triangle setup and the pixel shader. Both chips has a visibility subsystem that is responsible for z occlusion culling. In case of the ATI cards it is called Hierarchical-Z. It determines if a pixel will be hidden behind an earlier rendered pixel or not. If the z occlusion culling should determine that the pixel is hidden it will directly be discarded and it won't enter the pixel shader, thus saving the initial z-buffer read of the pixel-rendering pipeline. During scan line conversion, the screen is divided into 8x8 pixel blocks (RADEON 7x00) or 4x4 pixel blocks (RADEON 8500), and each block is assigned a bit that determines whether that block is "not cleared", that means visible or "cleared", that means occluded. A quick z-check is performed on each of these blocks to determine whether any pixels in them are visible. If a pixel in a block is visible, the status bit is flipped to visible and the new z-value for that pixel is written to the z-buffer. If the block is not visible, its status bit remains set to occluded and the z-buffer isn't touched. The whole block will not be send to the pixel shader.
Sorting objects or object groups from front to back when rendering helps z-culling to kick out pixels from the gaphics pipeline very early, saving not only a write into the z-buffer (as with texkill), but also a read of the z-buffer.

The Pixel Shader is not involved on the sub-pixel level. It gets the already multisampled pixels along with z, color values and texture information. The already Gouraud shaded or flat shaded pixel might be combined in the pixel shader with the specular color and the texture values fetched from the texture map. For this task the pixel shader provides instructions that affect the texture addressing and instructions to combine the texture values in different ways with each other.

There are five different pixel shader standards supported by Direct3D 8.1. Currently, no pixel shader capable graphics hardware is restricted on the support of ps.1.0. Therefore all available hardware supports at least ps.1.1. So I will not mentioned the legacy ps.1.0 in this text anymore. ps.1.1 is the pixel shader version that is supported by GeForce 3. GeForce 4TI supports additionally ps.1.2 and ps.1.3. The RADEON 8x00 supports all of these pixel shader versions plus ps.1.4.

Whereas the ps.1.1, ps.1.2 and ps.1.3 are, from a syntactical point of view, build on each other, ps.1.4 is a new and more flexible approach. ps.1.1 - ps.1.3 differentiate between texture address operation instructions and texture blending instructions. As the name implies, the first kind of instructions is specialized and only usable for address calculations and the second kind of instructions is specialized and only usable for color shading operations. ps.1.4 simplifies the pixel shader language by allowing the texture blending (color shader) instruction set to be used for texture address (address shader) operations as well. It differentiate between arithmetic instructions, that modify color data and texture address instructions, that process texture coordinate data and in most cases sample a texture.

Sampling means looking up a color value in a texture at the specified up to four coordinates (u, v, w, q) while taking into account the texture stage state attributes.

One might say that the usage of the instructions in ps.1.4 happens in a more RISC-like manner, whereas the ps.1.1 - ps.1.3 instruction sets are only useable in a CISC-like manner.

This RISC-like approach will be used in ps.2.0, that will appear in DirectX 9. Syntactically ps.1.4 is compared to ps.1.1-ps.1.3 an evolutionary step into the direction of ps.2.0.

What are the other benefits of using ps.1.4?

Unlimited texture address instructions: Direct3D 8.0 with ps.1.1 is restricted to a small number of meaningful ways in which textures can be addressed. Direct3D 8.1 with ps.1.4 allows textures to be addressed and sampled in a virtually unlimited number of ways, because the arithmetic operation instructions are also usable as texture address operation instructions.
Flexible dependent texture reads.
RADEON 8x00 performs ps.1.4 faster than ps.1.1 - ps.1.3, because the latter is emulated.
All of the upcoming graphics hardware that will support ps.2.0, will support ps.1.4 for backward compatibility.

The next stage in the Direct3D pipeline is Fog. A fog factor is computed and applied to the pixel using a blending operation to combine the fog amount (color) and the already shaded pixel color, depending on how far away an object is. The distance to an object is determined by its z- or w-value or by using a separate attenuation value that measure the distance between the camera and the object in a vertex shader. If fog is computed by-vertex, it is interpolated across each triangle using Gouraud shading. The "MFC Fog" example in the DirectX 8.1 SDK shows linear and exponential fog calculated per-vertex and per-pixel. A layered range-based fog is shown in the "Height Fog" example in the NVIDIA Effects Browser. The Volume Fog example in the DirectX 8.1 SDK shows volumetric fog produced with a vertex and a pixel shader and the alpha blending pipeline. As shown in these examples fog effects can be driven by the vertex and/or pixel shader.

The Alpha Test stage kicks out pixels with a specific alpha value, because these shouldn't be visible. This is for example one way to map decals with an alpha mapped texture. The alpha test is switched on with D3DRS_ALPHATESTENABLE. The incoming alpha value is compared with the reference alpha value, using the comparison function provided by the D3DRS_ALPHAFUNC render state. If the pixel passes the test, it will be processed by the subsequent pixel operation, otherwise it will be discarded.

Alpha test does not incur any extra overhead, even if all the pixels pass the test. By tuning appropriately the reference value of Alpha Test to reject the transparent or almost transparent pixels, one can improve the application performance significantly, if the application is memory bandwidth fill-bound. Basically, varying the reference value acts like a threshold setting up how many pixels are going to be evicted. The more pixels are being discarded, the faster the application will run.

There is a trick to drive the alpha test with a pixel shader, that will be shown later in the section on swizzling. The image-space outlining technique in [Card/Mitchell] uses alpha testing to reject pixels that are not outlines and consequently improve performance.

The Stencil Test masks the pixel in the render target with the contents of the stencil buffer. This is useful for dissolves, decaling, outlining or to build shadow volumes [Brennan]. A nice example for this is "RadeonShadowShader", that could be found on the ATI web-site.

The Depth Test determines, whether a pixel is visible by comparing its depth value to the stored depth value. An application can set up its z-buffer z-min and z-max, with positive z going away from the view camera in a left-handed coordinate system. The Depth test is a pixel-by-pixel logical test that asks "Is this pixel in back of another pixel at this location?". If the answer returned is yes, that pixel gets discarded, if the answer is no, that pixel will travel further through the pipeline and the z-buffer is updated. There is also a more-precise and bandwidth saving depth buffer form called w-buffer. It saves bandwidth by having only to send x/y/w coordinates over the bus, while the z-buffer in certain circumstances has to send all that plus z.

It is the only depth buffering method used by the Savage 2000-based boards. These cards emulate the z-buffer with the help of the hardware w-buffer.
When using vertex shaders with w-buffering on a GeForce 3/4TI card, make sure the projection matrix is set in the traditional way (using SetTransform()), otherwise w-buffering won’t work correctly (Read more in the GeForce 3 FAQ, which can be found at www.nvidia.com/developer).

The pixel shader is able to "drive" the depth test with the texdepth (ps.1.4 only) or the texm3x2depth (ps.1.3 only) instructions. These instructions can calculate the depth value used in the depth buffer comparison test on a per-pixel basis.

The Alpha Blending stage blends the pixel's data with the pixel data already in the Render Target. The blending happens with the following formula:

FinalColor = SourcePixelColor * SourceBlendFactor +
             DestPixelColor * DestBlendFactor

There are different flags, that can be set with the D3DRS_SRCBLEND (SourceBlendFactor) and D3DRS_DESTBLEND (DestBlendFactor) parameters in a SetRenderState() call. Alpha Blending was formerly used to blend different texture values together, but nowadays it is more efficient to do that with the multitexturing or pixel shader unit, depending on hardware support. Alpha Blending is used on newer hardware to simulate different levels of transparency.

Dithering tries to fool the eye into seeing more colors than are actually present by placing different colored pixels next to one another to create a composite color that the eye will see. For example using a blue next to a yellow pixel would lead to a green appearance. That was a common technique in the days of the 8-bit and 4-bit color systems. You switch on the dithering stage globally with D3DRS_DITHERENABLE.

The Render Target is normally the backbuffer in which you render the finished scene, but it could be also the surface of a texture. This is useful for the creation of procedural textures or for the re-usage of results of former pixel shader executions. The render target is read by the stencil- , depth test and the alpha blending stage.

To summarize the tasks of the pixel shader in the Direct3D pipeline:

The pixel shader is fed with rasterized pixels, which means in most cases z-occlussion culled, multisampled and already gouraud-shaded pixels
The pixel shader receives an interpolated texture coordinate, interpolated diffuse and specular colors
There are mainly four flavours of pixel shaders, the ps.1.1 - ps.1.3 standards are from a syntactical point of view a more CISC-like approach, whereas the ps.1.4 standard uses a RISC-like approach, that will be used in the upcoming ps.2.0
The pixel shader delivers the shaded pixel to the fog stage
The pixel shader is able to support or drive the consecutive Direct3D stages

Before getting our feets wet with a first look at the pixel shader architecture, let's take a look at the currently available tools:

Pixel Shader Tools

I already introduced Shader Studio, Shader City, DLL Detective, 3D Studio Max 4.x/gmax, NVASM, the Effectsbrowser, the Shader Debugger and the Photoshop plug-ins from NVIDIA in the first part. There is one thing to remember specific for pixel shaders: because the GeForce 4TI supports ps.1.1 - ps.1.3, it is possible, that a few of the NVIDIA tools won't support ps.1.4. Additonally there are the following pixel shader-only tools:

Microsoft Pixel Shader Assembler

The pixel shader assembler is provided with the DirectX 8.1 SDK. Like with it's pendant the vertex shader assembler it does not come with any documentation. Its output look like this:

Figure 4 - Pixel Shader Assembler Output

The pixel shader assembler is used by the Direct3DX functions that compile pixel shaders and can also be used to pre-compile pixel shaders.

MFC Pixel Shader

The MFC Pixel Shader example provided with the DirectX 8.1 SDK comes with source. It is very useful for trying out pixel shader effects in a minute and debugging them. Just type in the pixel shader syntax you want to test and it will compile it at once. Debugging information is provided in the window at the bottom. If your graphics card doesn't support a particular pixel shader version, you can always choose the reference rasterizer and test all desired pixel shader versions. In the following picture the reference rasterizer was chosen on a GeForce 3 to simulate ps.1.3:

Figure 5 - MFCPixelShader

ATI ShadeLab

The ATI Shadelab helps designing pixel shaders. After writing the pixel shader source into the big entry field in the middle, the compilation process starts immediately. To be able to load the pixel shader later, it has to be saved with the <Save> button and loaded with the <Load> button.

Figure 6 - ATI ShadeLab

You may set up to six textures with specific texture coordinates and the eight constant registers. The main advantage over the MFC Pixel Shader tool is the possibility to load constant registers and the textures on your own. This tool is provided on the Book DVD in the directory <Tools>.

With that overview on the available tools in mind, we can go one step further by examining a diagram with the pixel shader workflow.

Pixel Shader Architecture

The following diagram shows the logical pixel shader data workflow. All the grey fields mark functionality specific for ps.1.1 - ps.1.3. The blue field marks functionality that is specific to ps.1.4.

Figure 7 - Pixel Shader logical Data Flow

On the right half of the diagram the pixel shader arithmetic logic unit (ALU) is surrounded by four kinds of registers. The Color Registers stream iterated vertex color data from a vertex shader or a fixed-function vertex pipeline to a pixel shader. The Constant Registers provide constants to the shader, that are loaded by using the SetPixelShaderConstant() function or in the pixel shader with the def instruction. The Temporary Registers rn are able to store temporay data. The r0 register also serves as the Output register of the pixel shader.

The Texture Coordinates can be supplied as part of the vertex format or can be read from certain kind of texture maps. Texture coordinates are full precision and range as well as perspective correct, when used in a pixel shader. There are D3DTSS_* Texture Operations that are not replaced by the pixel shader functionality, they can be used on the up to four (ps.1.1 - ps.1.3) or six textures (ps.1.4). The Texture Stages are holding a reference to the texture data that might be a one-dimensional (for example in a cartoon shader), two-dimensional or three-dimensional texture (volume textures or cube map). Each value in a texture is called a texel. These texels are most commonly used to store color values, but they can contain any kind of data desired including normal vectors for bump maps, shadow values, or general look-up tables.

Sampling occurs, when a texture coordinate is used to address the texel data at a particular location with the help of the Texture Registers. The usage of the texture registers tn differ between the ps.1.1 - ps.1.3 (t0 - t3) and the ps.1.4 (t0 - t5) implementations.

In case of ps.1.1 - ps.1.3 the association between the texture coordinate set and the texture data is a one-to-one mapping, which is not changeable in the pixel shader. Instead this association can be changed by using the oTn registers in a vertex shader or by using the texture stage state flag TSS_TEXCOORDINDEX together with SetTextureStageState(), in case the fixed function pipeline is used.

In ps.1.4, the texture data and the texture coordinate set can be used independent of each other in a pixel shader. The texture stage from which to sample the texture data is determined by the register number rn and the texture coordinate set, that should be used for the texture data is determined by the number of the tn register specified.

Let's take a closer look at the different registers shown in the upper diagram:

Type	Name	ps.1.1	ps.1.2	ps.1.3	ps.1.4	Read/Write Caps
Constant Registers	c_n	8	8	8	8	RO
Texture Registers	t_n	4	4	4	6	RW / ps.1.4: RO
Temporary Registers	r_n	2	2	2	6	RW
Color Registers	v_n	2	2	2	2 in Phase 2	RO

Constant Registers (c0 - c7)

There are eight constant registers in every pixel shader specification. Every constant register contains four floating point values or channels. They are read-only from the perspective of the pixel shader, so they could be used as a source register, but never as destination registers in the pixel shader. The application can write and read constant registers with calls to SetPixelShaderContant() and GetPixelShaderConstant(). A def instruction used in the pixel shader to load a constant register, is effectively translated into a SetPixelShaderConstant() call by executing SetPixelShader().

The range of the constant registers goes from -1 to +1. If you pass anything outside of this range, it just gets clamped. Constant registers are not usable by ps.1.1 - ps.1.3 texture address instructions except for the texm3x3spec, which uses a constant register to get an eye-ray vector.

Output and Temporary Registers (ps.1.1 - ps.1.3: r0 + r1; ps.1.4: r0 - r5)

The temporary registers r0 - rn are used to store intermediate results. The output register r0 is the destination argument for the pixel shader instruction. So r0 is able to serve as a temporary and output register. In ps.1.4 r0 - r5 are also used to sample texture data from the texture stages 0 - 5 in conjunction with the texture registers. In ps.1.1 - ps.1.3, the temporary registers are not usable by texture address instructions.

CreatePixelShader() will fail in shader pre-processing if a shader attempts to read from a temporary register that has not been written to by a previous instruction. All shaders have to write to r0.rgba the final result or the shader will not assemble or validate.

Texture Registers (ps.1.1 - ps.1.3: t0 - t3; ps.1.4: t0 - t5)

The texture registers are used in different ways in ps.1.1 - ps.1.3 and in ps.1.4. In ps.1.1 - ps.1.3 the usage of one of the t0 - t3 texture registers determine the usage of a specific pair of texture data and texture coordinates. You can't change this one-to-one mapping in the pixel shader:

ps.1.1     // version instruction
tex t0     // samples the texture at stage 0
           // using texture coordinates from stage 0
mov r0, t0 // copies the color in t0 to output register r0

tex samples the texture data from the texture stage 0 and uses the texture coordinates set, that is set in the vertex shader with the oTn registers. In ps.1.4, having texture coordinates in their own registers means that the texture coordinate set and the texture data are independant from each other. The texture stage number with the texture data from which to sample is determined by the destination register number (r0 - r5) and the texture coordinate set is determined by the source register (t0 - t5) specified in phase 1.

ps.1.4          // version instruction
texld r4, t5
mov r0, r4

The texld instruction samples the map set via SetTexture (4, lpTexture) using the sixth set of texture coordinates (set in the vertex shader with oT5) and puts the result into the fifth temporary register r4.

Texture registers that doesn't hold any values will be sampled to opaque black (0.0, 0.0, 0.0, 1.0). They can be used as temporary registers in ps.1.1 - ps.1.3. The texture coordinate registers in ps.1.4 are read-only and therefore not usable as temporary registers.

The maximum number of textures is the same as the maximum number of simultaneous textures supported (MaxSimultaneousTextures flag in D3D8CAPS).

Color Registers (ps.1.1 - ps.1.4: v0 + v1)

The color registers can contain per-vertex color values in the range 0 through 1 (saturated). It is common to load v0 with the vertex diffuse color and v1 with the specular color.

Using a constant color (flat shading) is more efficient than using an per-pixel Gouraud shaded vertex color. If the shade mode is set to D3DSHADE_FLAT, the application iteration of both vertex colors (diffuse and specular) is disabled. But regardless of the shade mode, fog will still be iterated later in the pipeline.

Pixel shaders have read-only access to color registers. In ps.1.4 color registers are only available during the second phase, which is the default phase. All of the other registers are available in every of the two phases of ps.1.4.

Range

One reason for using pixel shaders is compared to the multitexturing unit, its higher precision that is used by the pixel shader arithmetic logic unit.

Register Name	Range	Versions
cn	-1 to +1	all versions
rn	-D3DCAPS8.MaxPixelShaderValue to D3DCAPS8.MaxPixelShaderValue	all versions
tn	-D3DCAPS8.MaxPixelShaderValue to D3DCAPS8.MaxPixelShaderValue	ps.1.1 - ps.1.3
tn	-D3DCAPS8.MaxTextureRepeat to D3DCAPS8.MaxTextureRepeat	ps.1.4
vn	0 to +1	all versions

The color register vn are 8bit precision per channel, ie 8bit red, 8bit green etc.. For ps.1.1 to ps.1.3, D3DCAPS8.MaxPixelShaderValue is a minimum of one, whereas in ps.1.4 D3DCAPS8.MaxPixelShaderValue is a minimum of eight. The texture coordinate registers provided by ps.1.4 use high precision signed interpolators. The DirectX caps viewer reports a MaxTextureRepeat value of 2048 for the RADEON 8500. This value will be clamped to MaxPixelShaderValue, when used with texcrd, because of the usage of a rn register as the destination register. In this case it is safest to stick with source coordinates within the MaxPixelShaderValue range. However, if tn registers are used for straight texture lookups (i.e. texld r0, t3), then the MaxTextureRepeat range should be expected to be honored by hardware.

Using textures to store color values leads to a much higher color precision with ps.1.4.

High Level View on Pixel Shader Programming

Pixel Shading takes place on a per-pixel, per-object basis during a rendering pass.

Let's start by focusing on the steps required to build a pixel shader-driven application. The following list ordered in the sequence of execution shows the necessary steps to build up a pixel shader driven application:

Check for Pixel Shader Support
Set Texture Operation Flags (with D3DTSS_* flags)
Set Texture (with SetTexture())
Define Constants (with SetPixelShaderConstant()/def)
Pixel Shader Instructions
- Texture Address Instructions
- Arithmetic Instructions
Assemble Pixel Shader
Create Pixel Shader
Set Pixel Shader
Free Pixel Shader Resources

The following text will work through this list step-by-step:

Check for Pixel Shader Support

It is important to check for the proper pixel shader support, because there is no feasible way to emulate pixel shaders. So in case there is no pixel shader support or the required pixel shader version is not supported, there have to be fallback methods to a default behaviour (ie the multitexturing unit or ps.1.1). The following statement checks the supported pixel shader version:

if( pCaps->PixelShaderVersion < D3DPS_VERSION(1,1) )
  return E_FAIL;

This example checks the support of the pixel shader version 1.1. The support of at least ps.1.4 in hardware might be checked with D3DPS_VERSION(1,4). The D3DCAPS structure has to be filled in the startup phase of the application with a call to GetDeviceCaps(). In case the Common Files Framework which is provided with the DirectX 8.1 SDK is used, this is done by the framework. If you graphics card does not support the requested pixel shader version and there is no fallback mechanism that switches to the multitexturing unit, the reference rasterizer will jump in. This is the default behaviour of the Common Files Framework, but it is not useful in a game, because the REF is too slow.

Set Texture Operation Flags (D3DTSS_* flags)

The pixel shader functionality replaces the D3DTSS_COLOROP and D3DTSS_ALPHAOP operations and their associated arguments and modifiers that were used with the fixed-function pipeline. For example the following four SetTextureStageState() calls could be handled now by the pixel shader:

m_pd3dDevice->SetTextureStageState( 0, D3DTSS_COLORARG1, D3DTA_TEXTURE );
m_pd3dDevice->SetTextureStageState( 0, D3DTSS_COLORARG2, D3DTA_DIFFUSE); 
m_pd3dDevice->SetTextureStageState( 0, D3DTSS_COLOROP, D3DTOP_MODULATE); 
m_pd3dDevice->SetTexture( 0, m_pWallTexture);

But the following texture stage states are still observed.

D3DTSS_ADDRESSU 
D3DTSS_ADDRESSV
D3DTSS_ADDRESSW
D3DTSS_BUMPENVMAT00
D3DTSS_BUMPENVMAT01
D3DTSS_BUMPENVMAT10
D3DTSS_BUMPENVMAT11
D3DTSS_BORDERCOLOR
D3DTSS_MAGFILTER
D3DTSS_MINFILTER
D3DTSS_MIPFILTER
D3DTSS_MIPMAPLODBIAS
D3DTSS_MAXMIPLEVEL
D3DTSS_MAXANISOTROPY
D3DTSS_BUMPENVLSCALE
D3DTSS_BUMPENVLOFFSET
D3DTSS_TEXCOORDINDEX
D3DTSS_TEXTURETRANSFORMFLAGS

The D3DTSS_BUMP* states are used with the bem, texbem and texbeml instructions.

In ps.1.1 - ps.1.3 all D3DTSS_TEXTURETRANSFORMFLAGS flags are available and have to be properly set for a projective divide, whereas in ps.1.4 the texture transform flag D3DTTFF_PROJECTED is ignored. It is accomplished by using source register modifiers with the texld and texcrd registers.

The D3DTSS_TEXCOORDINDEX flag is valid only for fixed-function vertex processing. when rendering with vertex shaders, each stages's texture coordinate index must be set to its default value. The default index for each stage is equal to the stage index.

ps.1.4 gives you the ability to change the association of the texture coordinates and the textures in the pixel shader.

The texture wrapping, filtering, color border and mip mapping flags are fully functional in conjunction with pixel shaders.

A change of these texture stage states doesn't require the regeneration of the currently bound shader, because they are not available to shader compile time and the driver can therefore make no assumption about them.

Set Texture (with SetTexture()

After checking the pixel shader support and setting the proper texture operation flags, all textures have to be set by SetTexture(), as with the DX6/7 multitexturing unit. The prototype of this call is:

HRESULT SetTexture(DWORD Stage, IDirect3DBaseTexture8* pTexture);

The texture stage that should be used by the texture is provided in the first parameter and the pointer to the texture interface is provided in the second parameter. A typical call might look like:

m_pd3dDevice->SetTexture( 0, m_pWallTexture);

This call sets the already loaded and created wall texture to texture stage 0.

Define Constants (with SetPixelShaderConstant() / def)

The constant registers can be filled with SetPixelShaderConstant() or the def instruction in the pixel shader. Similar to the SetVertexShaderConstant() call, the prototype of the pixel shader equivalent looks like this

HRESULT SetPixelShaderConstant(
    DWORD Register,
    CONST void* pConstantData,
    DWORD ConstantCount
);

First the constant register must be specified in Register. The data to transfer into the constant register is provided in the second argument as a pointer. The number of constant registers, that have to be filled is provided in the last parameter. For example to fill c0 - c4, you might provide c0 as the Register and 4 as the ConstantCount.

The def instruction is an alternative to SetPixelShaderConstant(). When SetPixelShader() is called, it is effectively translated into a SetPixelShaderConstant() call. Using the def instruction makes the pixel shader easier to read. A def instruction in the pixel shader might look like this:

def c0, 0.30, 0.59, 0.11, 1.0

Each value of the constant source registers has to be in the range [-1.0..1.0].

Pixel Shader Instructions

Using vertex shaders the programmer is free to choose the order of the used instructions in any way that makes sense, whereas pixel shaders need a specific arrangement of the used instructions. This specific instruction flow differs between ps.1.1 - ps.1.3 and ps.1.4.

ps.1.1 - ps.1.3 allow four types of instructions, that must appear in the order shown below:

Figure 8 - ps.1.1 - ps.1.3 Pixel Shader Instruction Flow
(Specular Lighting with Lookup table)

This example shows a per-pixel specular lighting model, that evaluates the specular power with a lookup table. Every pixel shader starts with the version instruction. It is used by the assembler to validate the instructions which follow. Below the version instruction a constant definition could be placed with def. Such a def instruction is translated into a SetPixelShaderConstant() call, when SetPixelShader() is executed.

The next group of instructions are the so-called texture address instructions. They are used to load data from the tn registers and additionally in ps.1.1 - ps.1.3 to modify texture coordinates. Up to four texture address instructions could be used in a ps.1.1 - ps.1.3 pixel shader.

In this example the tex instruction is used to sample the normal map, that holds the normal data. texm* instructions are always used at least as pairs:

texm3x2pad t1, t0_bx2
texm3x2tex t2, t0_bx2

Both instructions calculate the proper u/v texture coordinate pair with the help of the normal map in t0 and sample the texture data from t2 with it. This texture register holds the light map data with the specular power values. The last texture addressing instruction samples the color map into t3.

The next type of instructions are the arithmetic instructions. There could be up to eight arithmetic instructions in a pixel shader.

mad adds t2 and c0, the ambient light, and multiplies the result with t3 and stores it into the output register r0.

Instructions in a ps.1.4 pixel shader must appear in the order shown below:

Figure 9 - ps.1.4 Pixel Shader Instruction Flow
(Simple transfer function for sepia or heat signature effects)

This is a simple transfer function, which could be useful for sepia or heat signature effects. It is explained in detail in [Mitchell]. The ps.1.4 pixel shader instruction flow has to start with the version instruction ps.1.4. After that, as much def instructions as needed might be placed into the pixel shader code. This example sets a Luminance constant value with one def.

There could be up to six texture addressing instructions used after the constants. The texld instruction loads a texture from texture stage 0, with the help of the texture coordinate pair 0, which is chosen by using t0. In the following up to eight arithmetic instructions, color, texture or vector data might be modified. This shader uses only the arithmetic instruction to convert the texture map values to luminance values.

So far a ps.1.4 pixel shader has the same instruction flow like a ps.1.1 - ps.1.3 pixel shader, but the phase instruction allows it to double the number of texture addressing and arithmetic instructions. It divides the pixel shader in two phases: phase 1 and phase 2. That means as of ps.1.4 a second pass through the pixel shader hardware can be done.

Another way to re-use the result of a former pixel shader pass is to render into a texture and use this texture in the next pixel shader pass. This is accomplished by rendering into a seperate render target.

The additional six texture addressing instruction slots after the phase instruction are only used by the texld r5, r0 instruction. This instruction uses the color in r0, which was converted to Luminance values before as a texture coordinate to sample a 1D texture (sepia or heat signature map), which is referenced by r5. The result is moved with a mov instruction into the output register r0.

Adding the number of arithmetic and addressing instructions shown in the pixel shader instruction flow above, leads to 28 instructions. If no phase marker is specified, the default phase 2 allows up to 14 addressing and arithmetic instructions.

Both of the preceding examples show the ability to use dependant reads. A dependant read is a read from a texture map using a texture coordinate which was calculated earlier in the pixel shader. More details on dependant reads will be presented in the next section.

Texture Address Instructions

Texture address instructions are used on texture coordinates. The texture coordinate address is used to sample data from a texture map. Controlling the u/v pair, u/v/w triplet or a u/v/w/q quadruplet of texture coordinates with address operations, gives the ability to choose different areas of a texture map. Texture coordinate "data storage space" can also be used for other reasons than sampling texture data. The registers that reference texture coordinate data, are useful to "transport" any kind of data from the vertex shader to the pixel shader via the oTn registers of a vertex shader. For example the light or half-angle vector or a 3x2, 3x3 or a 4x4 matrix might be provided to the pixel shader this way.

ps.1.1 - ps.1.3 texture addressing

The following diagram shows the ways that texture address instructions work in ps.1.1 - ps.1.3 for texture addressing:

Figure 10 - Texture Addressing in ps.1.1 - ps.1.3

All of the texture addressing happens "encapsulated" in the texture address instructions masked with a grey field. That means results of texture coordinate calculations are not accessible in the pixel shader. The texture address instruction uses these results internally to sample the texture. The only way to get access to texture coordinates in the pixel shader is the texcoord instruction. This instruction converts texture coordinate data to color values, so that they can be manipulated by texture addressing or arithmetic instructions. These color values could be used as texture coordinates to sample a texture with the help of the texreg2* instructions.

The following instructions are texture address instructions in ps.1.1 - ps.1.3. The d and s in the column named Para are the destination and source parameters of the instruction. The usage of texture coordinates is shown by two brackets around the texture register, for example (t0).

Instruction	1.1	1.2	1.3	Para	Action
tex	x	x	x	d	Loads tn with color data (RGBA) sampled from the texture
texbem	x	x	x	d, s	Transforms red and green components as du, dv signed values of the source register using a 2-D bump mapping matrix, to modify the texture address of the destination register. Can be used for a variety of techniques based on address perturbation such as fake per-pixel environment mapping, diffuse lighting (bump mapping), environment matting etc.. There is a difference in the usage of the texture stages between environment mapping with a pixel shader (means texbem, texbeml or bem) and environment mapping with the DX 6/7 multitexturing unit. texbem (texbeml or bem) needs the matrix data connected to the texture stage that is sampled. This is the environment map. Environment mapping with the DX 6/7 multitexturing unit needs the matrix data connected to the texture stage used by the bump map (see also the example program Earth Bump on the DVD): ------------------ // texture stages for environment mapping // with a pixel shader: SetRenderState(D3DRS_WRAP0,D3DWRAP_U\|D3DWRAP_V); // color map SetTexture( 0, m_pEarthTexture ); SetTextureStageState(0, D3DTSS_TEXCOORDINDEX, 1); // bump map SetTexture(1, m_psBumpMap); SetTextureStageState(1, D3DTSS_TEXCOORDINDEX, 1); // enviroment map SetTexture(2, m_pEnvMapTexture); SetTextureStageState(2, D3DTSS_TEXCOORDINDEX, 0); SetTextureStageState(2, D3DTSS_BUMPENVMAT00, F2DW(0.5f)); SetTextureStageState(2, D3DTSS_BUMPENVMAT01, F2DW(0.0f)); SetTextureStageState(2, D3DTSS_BUMPENVMAT10, F2DW(0.0f)); SetTextureStageState(2, D3DTSS_BUMPENVMAT11, F2DW(0.5f)); texbem performs the following operations: u += 2x2 matrix(du) v += 2x2 matrix(dv) Then sample (u, v) Read more in the section "Bump Mapping Formulas" in the DirectX 8.1 SDK documentation. Some rules for texbem/texbeml: The s register can not be read by any arithmetic instruction, until it is overwritten again by another instruction: ... texbem/l t1, t0 mov r0, t0 ; not legal ... texbem/l t1, t0 mov t0, c0 mov r0, t0 ; legal ... The s register of texbem/l can not be read by any texture instruction except for the texbem/l instruction: ... texbem/l t1, t0 texreg2ar t2, t0 ; not legal ... texbem/l t1, t0 texbem/l t2, t0 ; legal ... --------------------------------- ; t2 environment map ; (t2) texture coordinates environment map ; t1 du/dv perturbation data ps.1.1 tex t0 ; color map tex t1 ; bump map texbem t2, t1 ; Perturb and then sample the ; environment map. add r0, t2, t0 See the bem instruction for the ps.1.4 equivalent. See also the chapter on particle flow from Daniel Weiskopf and Matthias Hopf [Weiskopf] for an interesting use of texbem.
texbeml	x	x	x	d, s	Same as above, applys additionally luminance. Now three components of the source register are used red, green and blue as du, dv and l for luminance. u += 2x2 matrix(du) v += 2x2 matrix(dv) Then sample (u, v) & apply Luminance. See the texbem/l rules in the texbem section. -------------------------------- ; t1 holds the color map ; bump matrix set with the ; D3DTSS_BUMPENVMAT* flags ps.1.1 tex t0 ; bump map with du, dv, l data texbeml t1, t0 ; compute u, v ; sample t1 using u, v ; apply luminance correction mov r0, t1 ; output result
texcoord	x	x	x	d	Clamps the texture coordinate to the range [0..1.0] and outputs it as color. If the texture coordinate set contains fewer than three components, the missing components are set to 0. The fourth component is always 1. Provides a way to pass vertex data interpolated at high precision directly into the pixel shader. -------------------------------- ps.1.1 texcoord t0 ; convert texture coordinates ; to color mov r0, t0 ; move color into output ; register r0
texdp3		x	x	d, s	Performs a three-component dot product between the texture coordinate set corresponding to the d register number and the texture data in s and replicate clamped values to all four color channels of d. -------------------------------- ; t0 holds color map ; (t1) hold texture coordinates ps.1.2 tex t0 ; color map texdp3 t1, t0 ; t1 = (t1) dot t0 mov r0, t1 ; output result
texdp3tex		x	x	d, s	Performs a three-component dot product between the texture coordinate set corresponding to the d register number and the texture data in s. Uses the result to do a 1D texture lookup in d and places the result of the lookup into d. A common application is to lookup into a function table stored in a 1-D texture for procedural specular lighting terms. --------------------------------- ; t1 holds 1D color map ps.1.2 tex t0 ; vector data (x, y, z, w) texdp3tex t1, t0 ; u = (t1) dot t0 ; lookup data in t1 ; result in t1 mov r0, t1 ; output result
texkill	x	x	x	s	Cancels rendering of the current pixel if any of the first three components of the texture coordinates in s is less than zero. When using vertex shaders, the application is responsible for applying the perspective transform. If the arbitrary clipping planes contain anisomorphic scale factors, you have to apply the perspective transform to these clip planes as well. --------------------------------- ps.1.1 texkill t0 ; mask out pixel using ; uvw texture coordinates < 0.0 mov r0, v0
texm3x2depth			x	d, s	Calculates together with a texm3x2pad instruction the depth value to be used in depth testing for this pixel. Performs a three component dot product between the texture coordinate set corresponding to the d register number and the second row of a 3x2 matrix in the s register and stores the resulting w into d. After execution the d register is no longer available for use in the shader. The benefit of a higher resolution of the depth buffer resulting from multisampling is eliminated, because texm3x2depth (same with ps.1.4: texdepth) will output the single depth value to each of the sub-pixel depth comparison tests. Needs clamped to [0..1] w and z values or the result stored in the depth buffer will be undefined. --------------------------------- ; (t1) holds row #1 of the 3x2 matrix ; (t2) holds row #2 of the 3x2 matrix ; t0 holds normal map ps.1.3 tex t0 ; normal map texm3x2pad t1, t0 ; calculates z from row #1 ; calculates w from row #2 ; stores a result in t2 depending on ; if (w == 0) ; t2 = 1.0; ; else ; t2 = z/w; texm3x2depth t2, t0
texm3x2pad	x	x	x	d, s	This instruction cannot be used by itself. It must be combined with either texm3x2tex or texm3x2depth. It performs a three component dot product between the texture coordinate set corresponding to the d register number and the data of the s register and stores the result in d. See example shown for the texm3x2depth instruction or the example shown for the texm3x2tex instruction.
texm3x2tex	x	x	x	d, s	It calculates the second row of a 3x2 matrix by performing a three component dot product between the texture coordinate set corresponding to the d register number and the data of the s register to get a v value, which is used to sample a 2D texture. This instruction is used in conjunction with texm3x2pad, that calculates the u value. -------------------------------- ; Dot3 specular lighting with a lookup table ps.1.1 ; t0 holds normal map ; (t1) holds row #1 of the 3x2 matrix (light vector) ; (t2) holds row #2 of the 3x2 matrix (half vector) ; t2 holds a 2D texture (lookup table) ; t3 holds color map tex t0 ; sample normal texm3x2pad t1, t0_bx2 ; calculates u from first row texm3x2tex t2, t0_bx2 ; calculates v from second row ; samples texel with u,v ; from t2 (lookup table) tex t3 ; sample base color mul r0,t2,t3 ; blend terms ------- ; A ps.1.4 equivalent to the ; texm3x2pad/texm3x2tex pair could be ; specular power from a lookup table ps.1.4 ; r0 holds normal map ; t1 holds light vector ; t2 holds half vector ; r2 holds 2D texture (lookup table) ; r3 holds color map texld r0, t0 texcrd r1.rgb, t1 texcrd r4.rgb, t2 dp3 r1.r, r1, r0_bx2 ; calculates u dp3 r1.g, r4, r0_bx2 ; calculates v phase texld r3, t3 texld r2, r1 ; samples texel with u,v ; from r2 (lookup table) mul r0, r2, r3
texm3x3		x	x	d, s	Performs a 3x3 matrix multiply similar to texm3x3tex, except that it does not automatically perform a lookup into a texture. The returned result vector is placed in d with no dependent read. The .a value in d is set to 1.0. The 3x3 matrix is comprised of the texture coordinates of the third texture stage, and by the two preceding texture stages. Any texture assigned to any of the three texture stages is ignored. This instruction must be used with two texm3x3pad instructions. ----------------------------------- ; (t1) holds row #1 of the 3x3 matrix ; (t2) holds row #2 of the 3x3 matrix ; (t3) holds row #3 of the 3x3 matrix ps.1.2 tex t0 ; normal map texm3x3pad t1, t0 ; calculates u from row #1 texm3x3pad t2, t0 ; calculates v from row #2 texm3x3 t3, t0 ; calculates w from row #3 ; store u, v , w in t3 mov r0, t3 ; ps.1.4 equivalent ; r1 holds row #1 of the 3x3 matrix ; r2 holds row #2 of the 3x3 matrix ; r3 holds row #3 of the 3x3 matrix ps.1.4 def c0, 1.0, 1.0, 1.0, 1.0 texld r0, t0 ; r0 normal map texcrd r1.rgb, t1 ; calculates u from row #1 texcrd r2.rgb, t2 ; calculates v from row #2 texcrd r3.rgb, t3 ; calculates w from row #3 dp3 r4.r, r1, r0 dp3 r4.g, r2, r0 dp3 r4.b, r3, r0 ; store u, w, w in r4.rgb mov r0.a, c0 +mov r0.rgb, r4
texm3x3pad	x	x	x	d, s	Performs the first or second row of a 3x3 matrix multiply. The instruction can not be used by itself and must be used with texm3x3, texm3x3spec, texm3x3vspec or texm3x3tex.
texm3x3spec	x	x	x	d,s1, s2	Performs together with two texm3x3pad instructions a 3x3 matrix multiply. The resulting vector is used as a normal vector to reflect the eye-ray vector from a constant register in s2 and then uses the reflected vector as a texture address for a texture lookup in d. The 3x3 matrix is typically useful for orienting a normal vector of the correct tangent space for the surface being rendered. The 3x3 matrix is comprised of the texture coordinates of the third texture stage and the results in the two preceding texture stages. Any texture assigned to the two preceding texture stages is ignored. This can be used for specular reflection and environment mapping. --------------------------------------- ; (t1) holds row #1 of the 3x3 matrix ; (t2) holds row #2 of the 3x3 matrix ; (t3) holds row #3 of the 3x3 matrix ; t3 is assigned a cube or volume map with ; color data (RGBA) ; t0 holds a normal map ; c0 holds the eye-ray vector E ps.1.1 tex t0 texm3x3pad t1, t0 ; calculates u from row #1 texm3x3pad t2, t0 ; calculates v from row #2 ; calculates w from row #3 ; reflect u, v and w by the ; eye-ray vector in c0 ; use reflected vector to lookup texture in t3 texm3x3spec t3, t0, c0 mov r0, t3 ; output result ; A similar effect is possible with the following ; ps.1.4 pixel shader. ; The eye vector is stored as a normalized ; vector in a cube map ps.1.4 texld r0, t0 ; Look up normal map. texld r1, t4 ; Eye vector through normalizer cube map texcrd r4.rgb, t1 ; 1st row of environment matrix texcrd r2.rgb, t2 ; 2st row of environment matrix texcrd r3.rgb, t3 ; 3rd row of environment matrix dp3 r4.r, r4, r0_bx2 ; 1st row of matrix multiply dp3 r4.g, r2, r0_bx2 ; 2nd row of matrix multiply dp3 r4.b, r3, r0_bx2 ; 3rd row of matrix multiply dp3 r5, r4, r1_bx2 ; (N.Eye) mov r0, r5
texm3x3tex	x	x	x	d, s	Performs together with two texm3x3pad instructions a 3x3 matrix multiply and uses the result to lookup the texture in d. The 3x3 matrix is typically useful for orienting a normal vector to the correct tangent space for the surface being rendered. The 3x3 matrix is comprised of the texture coordinates of the third texture stage and the two preceding texture stages. The resulting u, v and w is used to sample the texture in stage 3. Any textures assigned to the preceding textures is ignored. ------------------------------------ ; (t1) holds row #1 of the 3x3 matrix ; (t2) holds row #2 of the 3x3 matrix ; (t3) holds row #3 of the 3x3 matrix ; t3 is assigned a cube or volume map with ; color data (RGBA) ps.1.1 tex t0 ; normal map texm3x3pad t1, t0 ; calculates u from row #1 texm3x3pad t2, t0 ; calculates v from row #2 ; calculates w from row #3 ; uses u, v and w to sample t3 texm3x3tex t3, t0 mov r0, t3 ; output result ; ps.1.4 equivalent ; r1 holds row #1 of the 3x3 matrix ; r2 holds row #2 of the 3x3 matrix ; r3 holds row #3 of the 3x3 matrix ; r3 is assigned a cube or volume map with ; color data (RGBA) ps.1.4 texld r0, t0 texcrd r1.rgb, t1 texcrd r2.rgb, t2 texcrd r3.rgb, t3 dp3 r4.r, r1, r0 ; calculates u from row #1 dp3 r4.g, r2, r0 ; calculates v from row #2 dp3 r4.b, r3, r0 ; calculates w from row #3 phase texld r3, r4 mov r0, r3
texm3x3vspec	x	x	x	d, s	Performs together with two texm3x3pad instructions a 3x3 matrix multiply. The resulting vector is used as a normal vector to reflect the eye-ray vector and then uses the reflected vector as a texture address for a texture lookup. It works just like texm3x3spec, except that the eye-vector is taken from the q coordinates of the three sets of 4D textures. The 3x3 matrix is typically useful for orienting a normal vector of the correct tangent space for the surface being rendered. The 3x3 matrix is comprised of the texture coordinates of the third texture stage and the results in the two preceding texture stages. Any texture assigned to the two preceding texture stages is ignored. This can be used for specular reflection and environment mapping, where the eye-vector is not constant. ----------------------------------------- ; (t1) holds row #1 of the 3x3 matrix ; (t2) holds row #2 of the 3x3 matrix ; (t3) holds row #3 of the 3x3 matrix ; t3 is assigned a cube or volume map with ; color data (RGBA) ; t0 holds a normal map ; used for Cubic bump mapping ; the per-vertex eye vector is derived using ; the camera position in the vertex shader ps.1.1 tex t0 texm3x3pad t1, t0 ; calculates u from row #1 texm3x3pad t2, t0 ; calculates v from row #2 ; calculates w from row #3 ; calculates eye-ray vector from the q texture ; coodinate values of t1 - t3 ; reflect u, v and w by the eye-ray vector ; use reflected vector to lookup texture in t3 texm3x3vspec t3, t0 mov r0, t3 ; output result
texreg2ar	x	x	x	d, s	General dependant texture read operation that takes the alpha and red color component of s as texture address data (u, v) consisting of unsigned values, to sample a texture at d. --------------------------------------- ps.1.1 tex t0 ; color map texreg2ar t1, t0 mov r0, t1
texreg2gb	x	x	x	d, s	General dependant texture read operation that takes the green and blue color component of s as texture address data (u, v) consisting of unsigned values, to sample a texture at d. --------------------------------------- ps.1.1 tex t0 ; color map texreg2gb t1, t0 mov r0, t1
texreg2rgb		x	x	d, s	General dependant texture read operation that takes the red, green and blue color component of s as texture address data (u, v, w) consisting of unsigned values, to sample a texture at d. This is useful for color-space remapping operations. --------------------------------------- ps.1.2 tex t0 ; color map texreg2rgb t1, t0 mov r0, t1

All of these texture address instructions use only the tn registers, with the exception of tex3x3spec, that uses a constant register for the eye-ray vector. In a ps.1.1 - ps.1.3 pixel shader, the destination register numbers for texture addressing instructions had to be in increasing order.

In ps.1.1 - ps.1.3, the ability to re-use a texture coordinate after modifying it in the pixel shader is available through specific texture address instructions, that are able to modify the texture coordinates and sample a texture with these afterwards. The following diagram shows this reliance:

Figure 11 - Dependant Read in ps.1.1 - ps.1.3

The texture address operations that sample a texture after modifying the texture coordinates are:

texbem/texbeml
texdp3tex
texm3x2tex
texm3x3spec
texm3x3tex
texm3x3vspec

The following instructions sample a texture with the help of color values as texture coordinates. If one of these color values are manipulated before, the sampling happens to be a dependant read.

texreg2ar
texreg2gb
texreg2rgb

Therefore these instructions are called general dependant texture read instructions.

As already stated above, each ps.1.1 - ps.1.3 pixel shader has a maximum of 8 arithmetic instructions and 4 texture address instructions. All texture address instructions uses one slot of the supplied slots, with the exception of texbeml, that uses one texture address slot plus one arithmetic slot.

ps.1.4 Texture Addressing

To use texels or texture coordinates in ps.1.4, you always have to load them first with texld or texcrd. These instructions are the only way to get access to texels or texture coordinates. Texture coordinates can be modified after a conversion to color data via texcrd, with all available arithmetic instructions. As a result, texture addressing is more straightforward with ps.1.4.

The following instructions are texture address instructions in ps.1.4:

Instruction	Para	Action
texcrd	d, s	Copies the texture coordinate set corresponding to s into d as color data (RGBA). No texture is sampled. Clamps the texture coordinates in tn with a range of [-MaxTextureRepeat, MaxTextureRepeat] (RADEON 8500: 2048) to the range of rn [-8, 8] (MaxPixelShaderValue). This clamp might behave differently on different hardware. To be safe, provide data in the range of [-8, 8]. A .rgb or .rg modifier should be provided to d. The fourth channel of d is unset/undefined in all cases. The third channel is unset/undefined for a projective divide with _dw.xyz (D3DTFF_PROJECTED is ignord in ps.1.4). The allowed syntax taking into account all valid source modifier/selector and destination write mask combinations, is shown below: texcrd rn.rgb, tn.xyz texcrd rn.rgb, tn texcrd rn.rgb, tn.xyw texcrd rn.rg, tn_dw.xyw
texdepth	d	Calculates the depth value used in the depth buffer comparison test for the pixel by using the r5 register. The r5 register is then unavailable for any further use in the pixel shader. texdepth updates the depth buffer with the value of r5.r and r5.g. The .r channel is treated as the z-value and the .g channel is treated as the w-value. The value in the depth buffer is replaced by the result of the .r channel divided by the .g channel == z/w. If the value in .g channel is zero then the depth buffer is updated with 1.0. texdepth is only available in phase 2. Using this instruction eleminates the benefit of a higher resolution of the depth buffer resulting from multisampling, because texdepth (same with texm3x2depth) will output the single depth value to each of the sub-pixel depth comparison tests. ---------------------------------------- ps.1.4 texld r0, t0 ; samples from texture stage 0 with ; texture coordinates set t0 texcrd r1.rgb, t1 ; load texture coordinates from ; t1 into r1 as color values add 5.rg, r0, r1 ; add both values phase texdepth r5 ; calculate pixel depth as r5.r/r5.g // do other color calculation here and output it to r0
texkill	s	Cancels rendering of the current pixel if any of the first three components of the texture data (ps.1.1 - ps.1.3: texture coordinates) in s is less than zero. You can use this instruction to implement arbitrary clipping planes in the rasterizer. When using vertex shaders, the application is responsible for applying the perspective transform. If the arbitrary clipping planes contain anisomorphic scale factors, you have to apply the perspective transform to the clip planes as well. texkill is only available in phase 2 and sources rn or tn registers. --------------------------------- ps.1.4 ... ; include other shader instructions here phase texkill t0 ; mask out pixel using ; uvw texture coordinates < 0.0 mov r0, v0 ; move diffuse into r0
texld	d, s	Loads d with the color data (RGBA) sampled using the texture coordinates from s. The texture stage number with the texture data from which to sample is determined by the number of d (r0 - r5) and the texture coordinate set is determined by the number of src (t0 - t5) in phase 1. texld is able to use additionally rn as s in phase 2. The allowed syntax is: texld rn, tn texld rn, tn.xyz ; same as previous texld rn, tn.xyw texld rn, tn_dw.xyw texld rn, rn texld rn, rn.xyz ; same as previous texld rn, rn_dz ; only valid on rn ; no more than two times per shader texld rn, rn_dz.xyz ; same as previous ---------------------------------- ; Simple transfer function for sepia or ; heat signature effects [Mitchell] ; c0 holds the luminance value ; t0 holds the texture coordinates ; r0 holds the original image ; r5 holds the 1D sepia or heat signature map ps.1.4 def c0, 0.30, 0.59, 0.11, 1.0 texld r0, t0 dp3 r0, r0, c0 phase texld r5, r0 ; dependent read mov r0, r5 ; ps.1.2 equivalent ; t0 holds the original image ; t1 holds the 1D sepia or heat signature map ; (t1) holds 0.30, 0.59, 0.11, 1.0 ps.1.2 tex t0 ; color map texdp3tex t1, t0 ; u = (t1) dot t0 ; lookup data in t1 ; result in t1 mov r0, t1 ; output result

In ps.1.4, there are only four texture address instructions but, as mentioned before, all the arithmetic instructions can be used to manipulate texture address information. So there are plenty of instruments to solve texture addressing tasks.

Valid source registers for first phase texture address instructions are tn. Valid source registers for second phase texture address instructions are tn and also rn. Each rn register may be specified as a destination to a texture instruction only once per phase. Aside from this, destination register numbers for texture instructions do not have to be in any particular order (as opposed to previous pixel shader versions in which destination register numbers for texture instructions had to be in increasing order).

No dependencies are allowed in a block of tex* instructions. The destination register of a texture address instruction cannot be used as a source in a subsequent texture address instruction in the same block of texture address instruction (same phase).

Dependent reads with ps.1.4 are not difficult to locate in the source. Pseudo code of the two possible dependent read scenarios in ps.1.4 might look like:

; transfer function
texld ; load first texture
modify color data here
phase
texld ; sample second texture with changed color data as address

texcrd ; load texture coordinates
modify texture coordinates here
phase<
texld ; sample texture with changed address

Another way to think of it is that if the second argument to a texld after the phase marker is rn (not tn) then it's a dependent read, because the texture coordinates are in a temp register so they must have been computed:

.....
phase
texld rn, rn

Set first three channels of a rn register, which is used as a source register, has to be set before it is used as a source parameter. Otherwise the shader will fail.

To manipulate texture coordinates with arithmetic instructions, they have to be loaded into texture data registers (ps.1.1 - ps.1.3: tn; ps.1.4: rn) via texcoord or texcrd. There is one important difference between these two instructions. texcoord clamps to [0..1] and texcrd does not clamp at all.

If you compare texture addressing used in ps.1.1 - ps.1.3 and texture addressing used in ps.1.4, it is obvious that the more CISC-like approach uses much more powerful instructions to address textures compared to the more RISC-like ps.1.4 approach. On the other hand, ps.1.4 offers a greater flexibility in implementing different texture addressing algorithms by using all of the arithmetic instructions compared to ps.1.1 - ps.1.3.

Arithmetic Instructions

The arithmetic instructions are used by ps.1.1 - ps.1.3 and ps.1.4 in a similar way, to manipulate texture or color data. Here is an overview of all available instructions in these implementations:

Instruction	Arguments	Registers				Version
Instruction	Arguments	v_n	c_n	t_n	r_n	Version
add	dest
				x	x	ps.1.1 - ps.1.3
					x	ps.1.4
	src0, src1	x	x	x	x	ps.1.1 - ps.1.3
			x		x	ps.1.4 phase 1
		x	x		x	ps.1.4 phase 2
Performs a component-wise addition of register src0 and src1: dest.r = src0.r + src1.r dest.g = src0.g + src1.g dest.b = src0.b + src1.b dest.a = src0.a + src1.a -------------------------------------- ; glow mapping ps.1.1 tex t0 ; color map tex t1 ; glow map add r0, t0, t1 ; add the color values ; increase brightness lead to a glow effect ; glow mapping ps.1.4 texld r0, t0 ; color map texld r1, t1 ; glow map add r0, r0, r1 ; add the color values ; increase brightness lead to a glow effect --------- ; detail mapping ps.1.1 tex t0 ; color map tex t1 ; detail map add r0, t0, t1_bias ; detail map is add-signed to the color map ; watch out for the used texture coordinates of ; the detail map ; detail mapping ps.1.4 texld r0, t0 ; color map texld r1, t0 ; sample detail map with the texture coords of the color map add r0, r0, r1_bias ; detail map is add-signed to the color map You may increase the detail map effect by using _bx2 as a modifier in the add instruction.

Instruction	Arguments	Registers				Version
Instruction	Arguments	v_n	c_n	t_n	r_n	Version
bem	dest
	dest				x	ps.1.4 phase 1
	src0
	src0		x		x	ps.1.4 phase 1
	src1				x	ps.1.4 phase 1
Apply a fake bump environment transform. There is a difference in the usage of the texture stages between environment mapping with a pixel shader (means texbem, texbeml or bem) and environment mapping with the DX 6/7 multitexturing unit. bem (texbeml or texbem) needs the matrix data connected to the texture stage that is sampled. This is the environment map. Environment mapping with the DX 6/7 multitexturing unit needs the matrix data connected to the texture stage used by the bump map (see the example code for the texbem instruction). bem has a lot of restrictions when used in a pixel shader: bem must appear in the first phase of a shader (that is, before a phase marker) bem consumes two arithmetic instruction slots Only one use of this instruction is allowed per shader Destination writemask must be .rg /.xy This instruction cannot be co-issued Aside from the restriction that destination write mask be .rg, modifiers on source src0, src1, and instruction modifiers are unconstrained bem performs the following calculation: (Given n == dest register #) dest.r = src0.r + D3DTSS_BUMENVMAT00(stage n) * src1.r + D3DTSS_BUMPENVMAT10(stage n) * src1.g dest.g = src0.g + D3DTSS_BUMENVMAT01(stage n) * src1.r + D3DTSS_BUMPENVMAT11(stage n) * src1.g ------------------------------------ ps.1.4 ; r2 environment map texture coordinates ; r1 du/dv perturbation data texld r1, t1 ; bump map texcrd r2.rgb, t2 bem r2.rg, r2, r1 ; perturb ; r2.rg = tex coordinates to sample environment map phase texld r0, t0 ; color map texld r2, r2 ; environment map add r0, r0, r2 See the example program BumpEarth on the ShaderX DVD. See also the articles on improved environment mapping techniques as Cube Mapping [Hurley][Brennan2][Brennan3] and Per-Fresnel Term [Brennan].

Instruction	Arguments	Registers				Version
Instruction	Arguments	v_n	c_n	t_n	r_n	Version
cmp	dest
				x	x	ps.1.2, ps.1.3
					x	ps.1.4
	src0, src1, src2	x	x	x	x	ps.1.2, ps.1.3
			x		x	ps.1.4 phase 1
		x	x		x	ps.1.4 phase 2
Conditionally chooses between src1 and src2 based on a per-channel comparison src0 >= 0. For ps.1.2 and ps.1.3 cmp uses two arithmetic instruction slots. CreatePixelShader() erroneously assumes, that this instruction consumes only one instruction slot. So you have to check the instruction count of a pixel shader, which uses this instruction, manually. Another validation problem is, that in ps.1.2 and ps.1.3 the destination register of cmp cannot be the same as any of the source registers. ------------------------------------------ // Compares all four components. ps.1.2 ... fill t1, t2 and t3 // t1 holds -0.6, 0.6, 0, 0.6 // t2 holds 0, 0, 0, 0 // t3 holds 1, 1, 1, 1 cmp r0, t1, t2, t3 // r0 is assigned 1,0,0,0 based on the following: // r0.x = t3.x because t1.x < 0 // r0.y = t2.y because t1.y >= 0 // r0.z = t2.z because t1.z >= 0 // r0.w = t2.w because t1.w >= 0 ---------- // Compares all four components. ps.1.4 texcrd r1, t1 texcrd r2, t2 texcrd r3, t3 // r1 holds -0.6, 0.6, 0, 0.6 // r2 holds 0, 0, 0, 0 // r3 holds 1, 1, 1, 1 cmp r0, r1, r2, r3 // r0 is assigned 1,0,0,0 based on the following: // r0.x = r3.x because r1.x < 0 // r0.y = r2.y because r1.y >= 0 // r0.z = r2.z because r1.z >= 0 // r0.w = r2.w because r1.w >= 0 ---------- ; Cartoon pixel shader ; explained in detail in [Card/Mitchell] ; c0 holds falloff 1 ; c1 holds falloff 2 ; c2 holds dark ; c3 holds average ; c4 holds bright ; t0 holds normal information ; t1 holds the light vector ps.1.4 def c0, 0.1f, 0.1f, 0.1f, 0.1f def c1, 0.8f, 0.8f, 0.8f, 0.8f def c2, 0.2f, 0.2f, 0.2f, 1.0f def c3, 0.6f, 0.6f, 0.6f, 1.0f def c4, 0.9f, 0.9f, 1.0f, 1.0f texcrd r0.xyz, t0 ; place normal vector in r0 texcrd r1.xyz, t1 ; place light vector in r1 dp3 r3, r0, r1 ; n.l sub r4, r3, c0 ; subtract falloff #1 from n.l cmp_sat r0, r4, c3, c2 ; check if n.l is > zero ; if yes use average color ; otherwise darker color sub r4, r3, c1 ; subtract falloff #2 from n.l cmp_sat r0, r4 c4, r0 ; check if n.l is > zero ; if yes use bright color ; otherwise use what is there ; ps.1.2 equivalent with less precision ps.1.2 def c0, 0.1f, 0.1f, 0.1f, 0.1f def c1, 0.8f, 0.8f, 0.8f, 0.8f def c2, 0.2f, 0.2f, 0.2f, 1.0f def c3, 0.6f, 0.6f, 0.6f, 1.0f def c4, 0.9f, 0.9f, 1.0f, 1.0f texcoord t0 ; place normal vector in t0 texcoord t1 ; place light vector in t1 dp3 r1, t0, t1 ; n.l sub t3, r1, c0 ; subtract falloff #1 from n.l cmp_sat r0, t3, c3, c2 ; check if n.l is > zero ; if yes use average color ; otherwise darker color sub t3, r1, c1 ; subtract falloff #2 from n.l cmp_sat r0, t3, c4, r0 ; check if n.l is > zero ; if yes use bright color ; otherwise use what is there The ps.1.2 version is not able to provide the same precision as the ps.1.4 version, because of texcoord, which clamps to [0..1]. texcrd do not clamp at all. It is able to handle values in the range of its source registers rn [-8..+8].

Instruction	Arguments	Registers				Version
Instruction	Arguments	v_n	c_n	t_n	r_n	Version
cnd	dest
				x	x	ps.1.1 - ps.1.3
					x	ps.1.4
	src0				r0.a	ps.1.1 - ps.1.3
	src1, src2	x	x	x	x	ps.1.1 - ps.1.3
	src0, src1, src2		x		x	ps.1.4 phase 1
	src0, src1, src2	x	x		x	ps.1.4 phase 2
Conditionally chooses between src1 and src2 based on the comparison r0.a > 0.5, whereas ps.1.4 conditionally chooses on between src1 and src2 based on the comparison src0 > 0.5 by comparing all channels of src0. // Version 1.1 to 1.3 if (r0.a > 0.5) dest = src1 else dest = src2 // Version 1.4 compares each channel separately. for each channel in src0 { if (src0.channel > 0.5) dest.channel = src1.channel else dest.channel = src2.channel } ------------------------------------------ // Compares r0.a > 0.5 ps.1.1 def c0, -0.5, 0.5, 0, 0.6 def c1, 0, 0, 0, 0 def c2, 1, 1, 1, 1 mov r0, c0 cnd r1, r0.a, c1, c2 // r1 is assigned 0,0,0,0 based on the following: // r0.a > 0.5, therefore r1.xyzw = c1.xyzw ----------- // Compares all four components. ps.1.4 // r1 holds -0.6, 0.5, 0, 0.6 // r2 holds 0, 0, 0, 0 // r3 holds 1, 1, 1, 1 texcrd r1, t1 texcrd r2, t2 texcrd r3, t3 cnd r0, r1, r2, r3 // r0 is assigned 1,1,1,0 based on the following: // r0.x = r3.x because r1.x < 0.5 // r0.y = r3.y because r1.y = 0.5 // r0.z = r3.z because r1.z < 0.5 // r0.w = r2.w because r1.w > 0.5 See the chapter of Steffen Bendel [Bendel] for an intersting usage of cnd to smooth fonts.

Instruction	Arguments	Registers				Version
Instruction	Arguments	v_n	c_n	t_n	r_n	Version
dp3	dest
				x	x	ps.1.1 - ps.1.3
					x	ps.1.4
	src0, src1	x	x	x	x	ps.1.1 - ps.1.3
			x		x	ps.1.4 phase 1
		x	x		x	ps.1.4 phase 2
Calculates a three-component dot product. The scalar result is replicated to all four channels: dest.w = (src0.x * src1.x) + (src0.y * src1.y) + (src0.z * src1.z); dest.x = dest.y = dest.z = dest.w = the scalar result of dp3 It does not automatically clamp the result to [0..1]. This instruction executes in the vector pipeline. So it can be paired or co-issued with an instruction that executes in the alpha pipeline (More on co-issuing below). dp3 r0.rgb, t0, v0 +mov r2.a, t0 In ps.1.1 - ps.1.3 dp3 always writes out to .rgb. In ps.1.4 you are free to choose three channels of the four rgba channels in any combination by masking the destination register. ---------------------------------------------- ; Dot3 per-pixel specular lighting ; specular power comes from a lookup table ps.1.4 ; r0 holds normal map ; t1 holds light vector ; t2 holds half vector ; r2 holds a 2D texture (lookup table) ; r3 holds color map texld r0, t0 ; normal map texcrd r1.rgb, t1 texcrd r4.rgb, t2 dp3 r1.r, r1, r0_bx2 ; calculates u dp3 r1.g, r4, r0_bx2 ; calculates v phase texld r3, t3 texld r2, r1 ; samples texel with u,v from t2 (lookup table) mul r0, r2, r3 You will find the ps.1.1 equivalent as an example for the texm3x2tex instruction. See the example RacorX8 and RacorX9 in part 3 of this introduction.

Instruction	Arguments	Registers				Version
Instruction	Arguments	v_n	c_n	t_n	r_n	Version
dp4	dest
					x	ps.1.2, ps.1.3
					x	ps.1.4
	src0, src1	x	x	x	x	ps.1.2, ps.1.3
			x		x	ps.1.4 phase 1
		x	x		x	ps.1.4 phase 2
Calculates a four-component dot product. It does not automatically clamp the result to [0..1]. This instruction executes in the vector and alpha pipeline. So it can not be co-issued. Unfortunately, CreatePixelShader() assumes that this instruction consumes only one instruction slot, whereas it really consumes two. So the instruction count in a pixel shader that uses dp4 must be checked manually. Additionally in ps.1.2 and ps.1.3, the destination register for dp4 cannot be the same as any of the source registers. CreatePixelShader() will not catch a wrong usage. A maximum of 4 dp4 commands are allowed in a single pixel shader. dp4 is useful to handle 4x4 matrices or quaternions in a pixel shader. dp4 does not seem to be useful in conjunction with most of the texture address instructions of ps.1.2 and ps.1.3, because these instructions support only matrices with three columns.

Instruction	Arguments	Registers				Version
Instruction	Arguments	v_n	c_n	t_n	r_n	Version
lrp	dest
				x	x	ps.1.1 - ps.1.3
					x	ps.1.4
	src0, src1, src2	x	x	x	x	ps.1.1 - ps.1.3
			x		x	ps.1.4 phase 1
		x	x		x	ps.1.4 phase 2
Performs a linear interpolation based on the following formula: dest = src2 + src0 * (src1 - src2) src0 determines the amount for the blend of src1 and src2. ------------------------------------------- ps.1.1 def c0, 0.4, 0.2, 0.5, 1.0 tex t0 tex t3 lrp r0, c0, t3, t0 ; the texture values of t3 and t0 are ; interpolated depending on c0 ----------------------- ps.1.4 def c0, 0.4, 0.2, 0.5, 1.0 texld r0, t0 texld r3, t3 lrp r0, c0, r3, r0 ; the texture values of t3 and t0 are ; interpolated depending on c0 [Vlachos] shows how to programmatically interpolate with lrp between two textures based on their normal.

Instruction	Arguments	Registers				Version
Instruction	Arguments	v_n	c_n	t_n	r_n	Version
mad	dest
				x	x	ps.1.1 - ps.1.3
					x	ps.1.4
	src0, src1, src2	x	x	x	x	ps.1.1 - ps.1.3
			x		x	ps.1.4 phase 1
		x	x		x	ps.1.4 phase 2
Performs a multiply accumulate operation based on the following forumula dest = src0 * src1 + src2 This might be useful for example for dark mapping with diffuse color. ------------------------------------------- ; The following examples show a modulation of a light map with a color map. ; This technique is often used to darken parts of a color map. In this case ; it is called Dark Mapping. Additionally a diffuse color is added. ps.1.1 tex t0 ; color map tex t3 ; light map mad r0, t3, t0, v0 ; Dark Mapping + diffuse color ps.1.4 texld r0, t0 ; color map texld r3, t3 ; light map mad r0, r3, r0, v0

ps.1.1 - ps.1.3

Instruction	Arguments	Registers				Version
Instruction	Arguments	v_n	c_n	t_n	r_n	Version
mov	dest
				x	x	ps.1.1 - ps.1.3
					x	ps.1.4
	src	x	x	x	x
			x		x	ps.1.4 phase 1
		x	x		x	ps.1.4 phase 2
Copies the content of the source to the destination register. Question every use of move, because there might be better suitable instructions.

Instruction	Arguments	Registers				Version
Instruction	Arguments	v_n	c_n	t_n	r_n	Version
mul	dest
				x	x	ps.1.1 - ps.1.3
					x	ps.1.4
	src0, src1	x	x	x	x	ps.1.1 - ps.1.3
			x		x	ps.1.4 phase 1
		x	x		x	ps.1.4 phase 2
Performs the following operation: dest = src0 * src1 --------------------------------------- ; The following examples show a modulation of a light map with a color map. ; This technique is often used to darken parts of a color map. In this case ; it is called Dark Mapping ps.1.1 tex t0 ; color map tex t3 ; light map mul r0, t0, t3 ; Dark Mapping ps.1.4 texld r0, t0 ; color map texld r3, t3 ; light map mul r0, r0, r3

Instruction	Arguments	Registers				Version
Instruction	Arguments	v_n	c_n	t_n	r_n	Version
nop
Performs no operation in ps.1.1 - ps.1.4.

Instruction	Arguments	Registers				Version
Instruction	Arguments	v_n	c_n	t_n	r_n	Version
sub	dest
				x	x	ps.1.1 - ps.1.3
					x	ps.1.4
	src0, src1	x	x	x	x	ps.1.1 - ps.1.3
			x		x	ps.1.4 phase 1
		x	x		x	ps.1.4 phase 2
Performs the following operation: dest = src0 - src1 --------------------------------- ps.1.1 tex t0 ; color map #1 tex t1 ; color map #2 sub r0, t1, t0 ; subtract t0 from t1

All arithmetic instructions can use the temporary registers rn. The rn registers are initially unset, and cannot be used as source operands until they are written. This requirement is enforced independently per each channel of each rn register. In ps.1.4 the tn registers can not be used with any arithmetic instruction, so they are restricted on texture addressing instructions (exception: texdepth).

Valid source registers for first phase arithmetic instructions are rn and cn. Valid source registers for second phase arithmetic instructions are rn, vn, and cn.

The comparison of ps.1.1 - ps.1.3 to ps.1.4 shows only a few differences. The ps.1.4-only instruction is bem. It substitutes the texbem and texbeml capabilities in an arithmetic operation in ps.1.4. Furthermore the cmd and cnd instructions are more powerful in ps.1.4. The scope of the arithmetic instructions is much bigger in ps.1.4, than in ps.1.1 - ps.1.3, because they are used for all texture addressing and blending tasks in ps.1.4.

As with the vertex shader, the pixel shader arithmetic instructions provide no if-statement, but this functionality can be emulated with cmp or cnd.

All of the rn.a channels are marked as unset at the end of the first phase, and thus cannot be used as a source operand until written. As a result, the fourth channel of color data will be lost during the phase transition. This problem can be partly solved by re-ordering the instructions. For example the following code snippet will loose the alpha value in r3.

ps.1.4
...
texld r3, t3
phase
...
mul r0, r2, r3

The next code snippet will not lose the alpha value:

ps.1.4
...
phase
texld r3, t3
...
mul r0, r2, r3

If no phase marker is present, then the entire shader is validated as being in the second phase.

All four channels of the shader result r0 must be written.

ps.1.1 - ps.1.3 and ps.1.4 are limited in different ways regarding the maximum number of source registers of the same type, that can be read.

Read Port Limit

The read port limit gives you the maximum number of registers of the same register type, that can be used as a source register in a single instruction.

Register Name	Version
Register Name	ps.1.1	ps.1.2	ps.1.3	ps.1.4
cn	2	2	2	2
rn	2	2	2	3
tn	2	3	3	1
vn	2	2	2	2

The color registers have a read port limit of two in all versions. In the following code snippet, mad uses v0 and v1 as a source register:

ps.1.1 // Version instruction
tex t0 // Declare texture
mad r0, v0, t0, v1

This is example exposes a readport limit of 2. The following example exposes a readport limit of 1, because v0 is used twice:

ps.1.1 // Version instruction
tex t0 // Declare texture
mad r0, v0, t0, v0

The following pixel shader fails in ps.1.1:

ps.1.1
tex t0
tex t1
tex t2
mad r0, t0, t1, t2

It exceeds the readport limit of 2 for the texture registers. This shader won't fail with ps.1.2 and ps.1.3, because these versions have a readport limit of 3 for the tn registers. The functional equivalent in ps.1.4 won't fail too:

ps.1.4
texld r0, t0
texld r1, t1
texld r2, t2
mad r0, r0, r1, r2

Another example for the usage of three temporary registers in ps.1.4 in the same instruciton is shown in the examples for the cmp and cnd instructions. In ps.1.4 the tn registers cannot be used with arithmetic instructions and none of the texture address instructions can use more than one tn register as a source, therefore it is not possible to cross the readport limit of the tn registers in ps.1.4.

There is no write port limit, because every instruction has only one destination register.

Instruction Modifiers

How instructions can be modified is best shown with the following diagram in mind:

Figure 12 - Pixel Shader ALU

This diagram shows the parallel pipeline structure of the pixel shader ALU. The vector or color pipeline handles the color values and the scalar or alpha pipeline handles the alpha value of a 32-bit value. There are "extensions" for instructions that enable the programmer to change the way data is read and/or written by the instruction. They are called Swizzling, Source Register Modifiers, Instruction Modifiers and Destination Register Modifiers.

We will work through all instruction "extensions" shown by Figure 12 in the following paragraphs from top to bottom.

Swizzling/Using Selectors

In contrast to the more powerful swizzles that can be used in vertex shaders, the swizzling supported in pixel shader is only able to replicate a single channel of a source-register to all channels. This is done by so called source register selectors.

Source Register Selectors	Syntax	ps.1.1	ps.1.2	ps.1.3	ps.1.4
Red replicate	source.r				x
Green replicate	source.g				x
Blue replicate	source.b	x	x	x	x
Alpha replicate	source.a	x	x	x	x

The .r and .g selectors are only available in ps.1.4. The following instruction replicates the red channel to all channels of the source register.

r1.r ; r1.rgba = r1.r

As shown in Figure 12, selectors are applied first in the pixel shader ALU. They are only valid on source registers of arithmetic instructions.

The .b replicate functionality is available in ps.1.1 - ps.1.3 since the release of DirectX 8.1, but this swizzle is only valid together with an alpha write mask in the destination register of an instruction like this:

mul r0.a, r0, r1.b

ps.1.1 does not support the .b replicate in DirectX 8.0.

This means that the .b source swizzle cannot be used with dp3 in ps.1.1 - ps.1.3, because the only valid write destination masks for dp3 are .rgb or .rgba (write masks will be presented later):

dp3 r0.a, r0, c0.b ; fails

The ability to replicate the blue channel to alpha opens the door to a bandwidth optimization method, described in an NVIDIA OpenGL paper named "Alpha Test Tricks" [Dominé01]. A pixel shader allow a dot product operation at the pixel level between two RGB vectors. Therefore, one can set one of the vectors to be (1.0, 1.0, 1.0), turning the dot product into a summation of the other's vectors components:

(R, G, B) dot  (1.0, 1.0, 1.0) = R + G + B

In the following code snippet the pixel shader instruction dp3 calculates the sum of the RGB color by replicating the scalar result into all four channels of r1.

ps.1.1
def c0, 1.0, 1.0, 1.0, 1.0
tex t0
dp3 r1, t0, c0
mov r0.a, r1.b
+mov r0.rgb, t0

An appropriate blending function might look like this:

dev->SetRenderState(D3DRS_ALPHAREF, (DWORD)0x00000001);
dev->SetRenderState(D3DRS_ALPHATESTENABLE, TRUE);
dev->SetRenderState(D3DRS_ALPHAFUNC, D3DCMP_GREATEREQUAL);

If the color data being rasterized is more opaque than the color at a given pixel (D3DPCMPCAPS_GREATEREQUAL), then the pixel is written. Otherwise, the rasterizer ignores the pixel altogether, saving the processing required for example to blend the two colors.

An even more clever method to utilize the alpha test for fillrate optimization was shown by ShaderX author Dean Calver on the public Microsoft DirectX discussion forum [Calver]. He uses an alpha map to lerp three textures this way:

; Dean Calver 
; 2 lerps from a combined alpha texture
ps.1.1
tex t0 ; combined alpha map
tex t1 ; texture0
tex t2 ; texture1
tex t3 ; texture2
mov r1.a, t0.b ; copy blue to r1.a
lrp r0, r1.a, t2, t1 ; lerp between t1 and t2
lrp r0, t0.a, t3, r0 ; lerp between result and t3

The .a replicate is analogous to the D3DTA_ALPHAREPLICATE flag in the DX6/7 multitexturing unit.

To move any channel to any channel, use dp3 to replicate the channel across, and then mask it out with the help of def instructions. The following pixel shader move the blue channel to the alpha channel:
; move the red to blue and output combined ps.1.1 def c0, 1.f, 0.f, 0.f, 0.f ; select red channel def c1, 0.f, 0.f, 1.f, 0.f ; mask for blue channel def c2, 1.f, 1.f, 0.f, 1.f ; mask for all channels but blue tex t0 dp3 r0, t0, c0 ; copy red to all channels mul r0, r0, c1 ; mask so only blue is left mad r0, t0, c2, r0 ; remove blue from original texture and ; add red shifted into blue

This trick was shown by ShaderX author Dean Calver on the Microsoft DirectX discussion forum [Calver].

In ps.1.4, there are specific source register selectors for texld and texcrd:

Source Register Selectors	Description	Syntax
.xyz/.rgb	maps x, y, z to x, y, z ,z	source.xyz/source.rgb
.xyw/.rga	maps x, y, w to x, y, w, w	source.xyw/source.rga

texld and texcrd are only able to use three channels in their source registers, so these selectors provide the option of taking the third component from either the third or the fourth component of the source register. Here are two examples on how to use these selectors:

texld r0, t1.rgb
...
texcrd r1.rgb, t1.rga
texcrd r4.rgb, t2.rgb

An overview on all possible source register selectors, modifiers and destination write masks is provided with the description of the texcrd and texld instructions above.

Source Register Modifiers

Source register modifiers are useful to adjust the range of register data in preparation for the instruction or to scale the value.

Modifier	Syntax	ps.1.1	ps.1.2	ps.1.3	ps.1.4
Bias	r0_bias	x	x	x	x
Invert	1-r0	x	x	x	x
Negate	-r0	x	x	x	x
Scale x2	r0_x2				x
Signed Scaling	r0_bx2	x	x	x	x

All modifiers can be used on arithmetic instructions. In ps.1.1 you can use the signed scale modifier _bx2 on the source register of any texm3x2* and texm3x3* instruction. In ps.1.2 and ps.1.3 it can be used on the source register of any texture address instruction.

bias subtracts 0.5 from all components. It allows the same operation as D3DTOP_ADDSIGNED in the DX6/7 multitexturing unit.

It is used to change the range of data from 0 to 1 to -0.5 to 0.5. Applying bias to data outside this range may produce undefined results. Data outside this range might be saturated with the _sat instruction modifier to the range [0..1], before used with a biased source register (more on instruction modifiers in the next section).

A typical example for this modifier is detail mapping, shown in the add example.

Invert inverts (1 - value) the value for each channel of the specified source register. The following code snippet uses inversion to complement the source register r1:

mul r0, r0, 1-r1 ; multiply by (1.0 - r1)

Negate negates all source register components by using a subtract sign before a register. This modifier is mutually exclusive with the invert modifier so it can not be applied to the same register.

Scale with the _x2 modifier is only available in ps.1.4. It multiplies a value by two before using it in the instruction and it is mutually exclusive to the invert modifier.

Signed scaling with the _bx2 modifier is a combination of bias and scale so it subtracts 0.5 from each channel and scales the result by 2. It remaps input data from unsigned to signed values. As with bias, using data outside of the range 0 to 1 may produce undefined results. This modifier is typically used in dp3 instructions. An example for this is presented with the description of the dp3 instruction above. Signed scaling is mutually exclusive with the invert modifier.

None of these modifiers change the content of the source register they are applied to. These modifiers are applied only to the data read from the register, so the value stored in the source register is not changed.

Modifiers and selectors may be combined freely. In the following example r1 uses the negative, bias and signed scaling modifiers as well as a red selector

-r1_bx2.r

With the help of the source register modifiers, per-pixel normalization is possible in the pixel shader. Per-pixel normalization can be used instead of a cubemap:

; Assuming v0 contains the unnormalized biased & scaled vector ( just
; like a normal map), r0 will end up with a very close to normalized 
; result.
; This trick is also useful to do 'detail normal maps' by adding 
; a vector to a normal map, and then renormalizing it.
dp3 r0, v0_bx2, v0_bx2 ; r0 = N . N
mad r0, v0_bias, 1-r0, v0_bx2 ; (v0_bias * (1-r0)) + v0_bx2
; ((N - 0.5) * (1 - N.N)) + (N - 0.5) * 2

Normalization requires calculating 1/sqrt(N). This code snippet normalizes the normal vector with a standard Newton-Raphson iteration for approximating a square root. This trick was shown by Sim Dietrich in the Microsoft DirectX forum [Dietrich-DXDev].

In the first line, the normal vector N is biased and scaled by _bx2 and then multiplied with itself via a dot product. The result is stored in r0. In the next line, the normal is again biased and scaled and added to the inverse of r0. The result of this addition is multiplied with the biased normal.

There are additional modifiers specific to the texld and texcrd instructions in the ps.1.4 implementation. These modifiers provide projective divide functionality by dividing the x and y values by either the z or w values and therefore projective dependant reads are possible in ps.1.4.

Source Register Modifiers	Description	Syntax
_dz/_db	divide x, y components by z (pseudo-code): if (z==0) x' = 1.0; else x' = x/z; if (z==0) y' = 1.0; else y' = y/z; // z' and w' are undefined	source_dz/source_db
_dw/_da	divide x, y components by w (pseudo-code): if (w==0) x' = 1.0; else x' = x/w; if (w==0) y' = 1.0; else y' = y/w; // z and w are undefined	source_dw/source_da

These modifiers provide a functional replacement of the D3DTFF_PROJECTED flag of the D3DTSS_TEXTURETRANSFORMFLAGS texture stage state flag in the pixel shader. A typical instruction would look like this:

texcrd r2.rg, t1_dw.xyw ; third channel unset

The modifier copies x/w and y/w from t1 into the first two channels of r2. The 3rd and 4th channels of r2 are uninitialized. Any previous data written to these channels will be lost. The per-pixel perspective divide is useful for example for projective textures.

The restriction for the two texture addressing registers are:

texld	texcrd
_dz/_db only valid with rn (second phase)	does not support_dz/_db
_dz/_db may be used no more that two times per shader	The 4th channel result of texcrd is unset/undefined in all cases.
The 3rd channel is unset/undefined for the .xyw case.
The same .xyz or .xyw modifier must be applied to every read of an individual tn register within a shader. If .xyw is being used on tn register read(s), this can be mixed with other read(s) of the same tn register using .xyw.

Instruction Modifiers

After the swizzling of the source register channels and the modification of the values read out from a source register with source register modifiers, the instruction starts executing. As shown in figure 12, now the instruction modifiers are applied. These are indicated as an appendix to the instruction connected via an underscore. Instruction modifiers are used to change the output of an instruction. They can multiply or divide a result or clamp a result to [0..1]:

Source Register Modifiers	Description	Syntax	ps.1.1 - ps.1.3	ps.1.4
_x2	multiply by 2	instruction_x2	x	x
_x4	multiply by 4	instruction_x4	x	x
_x8	multiply by 8	instruction_x8		x
_d2	divide by 2	instruction_d2	x	x
_d4	divide by 4	instruction_d4		x
_d8	divide by 8	instruction_d8		x
_sat	clamp to [0..1]	instruction_sat	x	x

Instruction modifiers can be used only on arithmetic instructions. The _x8, _d4, and _d8 modifiers are new to the 1.4 shading model. _sat may be used alone or combined with one of the other modifiers. i.e. mad_d8_sat.

Multiplier modifiers are useful to scale a value. Note that any such scaling reduces accuracy of results. The following examples scale the results by using _x2 or _x4:

ps.1.1
tex t0
tex t1
mul_x2 r0, t1, t0 ; (t1 * t0) * 2
... 
mul_x2 r0, 1-t1, t0 ; t0 * inverse(t1) * 2
... 
mad_x2 r0, t1, v0, t0 ; (t1 + ( v0 * t0)) * 2
... 
mul_x4 r0, v0, t0 ; (v0 * t0) * 4
mul_x4 r1, v1, t1 ; (v1 * t1) * 4
add r0, r0, r1 ; (v0*t0 * 4)+(v1*t1 * 4)

The _x2 modifer does the same as a shift left in C/C++.

The _d2 modifer does the same as a right shift in C/C++. Here is a more complex example:

; Here is an example for per-pixel area lighting
ps.1.1
def c1, 1.0, 1.0, 1.0, 1.0 ; sky color
def c2, 0.15, 0.15, 0.15, 1.0 ; ground color
def c5, 0.5, 0.5, 0.5, 1.0

tex t0 ; normal map
tex t1 ; base texture
dp3_d2 r0, v0_bx2, t0_bx2 ; v0.rgb is hemi axis in tangent space
; dot normal with hemi axis
add r0, r0, c5 ; map into range
lrp r0, r0, c1, c2
mul r0, r0, t1 ; modulate base texture

This pixel shader biases the hemisphere axis in v0 and scales it by 2. The same is done to the values of the normal map. dp3_bx2 divides the result through 2. The add instruction adds 0.5 to the vector of r0. lrp uses r0 as the proportion to linearly interpolate between sky color in c1 and ground color in c2.

The saturation modifer _sat clamps each component or channel of the result to the range [0..1]. It is most often used to clamp dot products like in the code snippet:

dp3_sat r0, t1_bx2, r0 ; N.H
dp3_sat r1.rgb, t1_bx2, r1 ; N.L

The result of the dot product operation of the normal vector with the half angle vector and the result of the dot product operation of the normal and the light vector are saturated. That means the values in r0 and r1.rgb are clamped to [0..1].

Destination Register Modifiers/Masking

A destination register modifer or write mask controls which channel in a register is updated. So it only alters the value of the channel it is applied to.

Write masks are supported for arithmetic instructions only. The following destination write masks are available for all arithmetic instructions:

Write Mask	Syntax	ps.1.1 - ps.1.3	ps.1.4
color	destination register.rgb	x	x
alpha	destination register.a	x	x
red	destination register.r		x
green	destination register.g		x
blue	destination register.b		x
arbitrary			x

In ps.1.1 - ps.1.3 a pixel shader can only use the .rgb or .a write masks. The arbitrary write mask in ps.1.4 allows any set of channels in the order r, g, b, a to be combined. It is possible to choose for example:

mov r0.ra, r1

If no destination write mask is specified, the destination write mask defaults to the .rgba case, which updates all channels in the destination register. An alternate syntax for the r, g, b, a channels is x, y, z, w.

As with the source register selectors and source register modifiers, the texld and texcrd instructions have additional write masks and write mask rules. texcrd can write only to the .rgb channels. It supports additionally a write mask that masks the first two channels with .rg or .xy. texld uses all four channels of the destination register. There is no alternative write mask available.

The usage of write masks is shown in the following ps.1.4 pixel shader that handles diffuse bump mapping with two spotlights (taken from the file 14_bumpspot.sha of the ATI Treasure Chest example program):

ps.1.4

def c0, 1.0f, 1.0f, 1.0f, 1.0f ; Light 1 Color
def c1, 1.0f, -0.72f, 1.0f, 1.0f ; Light 1 Angle scale(x) and bias(Y)
def c2, 1.0f, 1.0f, 1.0f, 1.0f ; Light 2 Color
def c3, 0.25f, 0.03f, 1.0f, 1.0f ; Light 2 Angle scale(x) and bias(Y)

texcrd r0.rgb, t2 ; Spot light 1 direction
texcrd r1.rgb, t4 ; Spot light 2 direction
texld r2, t1 ; Light 1 to Point vector
texld r3, t3 ; Light 2 to Point vector
texcrd r4.rgb, t1 ; Light 1 space position for attenuation
texcrd r5.rgb, t3 ; Light 2 space position for attenuation

dp3_sat r4.x, r4, r4 ; Light 1 Distance^2
dp3_sat r5.x, r5, r5 ; Light 2 Distance^2
dp3_sat r4.y, r0, r2_bx2 ; Light 1 Angle from center of spotlight
dp3_sat r5.y, r1, r3_bx2 ; Light 2 Angle from center of spotlight

mad_x4 r4.y, r4.y, c1.x, c1.y ; Light 1 scale and bias for angle
mad_x4 r5.y, r5.y, c3.x, c3.y ; Light 2 scale and bias for angle

phase

texld r0, t0 ; Base Map
texld r1, t0 ; Normal Map
texld r4, r4 ; Distance/Angle lookup map
texld r5, r5 ; Distance/Angle lookup map

dp3_sat r2.rgb, r1_bx2, r2_bx2 ; *= (N.L1)
mul_x2 r2.rgb, r2, r4.r ; Attenuation from distance and angle
mad r2.rgb, r2, c0, c7 ; * Light Color + Ambient

dp3_sat r3.rgb, r1_bx2, r3_bx2 ; *= (N.L2)
mul_x2 r3.rgb, r3, r5.r ; Attenuation from distance and angle
mad r3.rgb, r3, c2, r2 ; * Light 2 Color + Light 1 Color + Ambient

mul r0.rgb, r3, r0 ; Modulate by base map
+mov r0.a, c0

There are four different write masks used throughout this shader. These are the .rgb, .x, .y and the .a write masks. The first write mask used for the texcrd instructions are imperative. texld can't handle write masks other than .rgba, which is the same as applying no explicit write mask. The first four dp3 and the next two mad instructions write to a the x respectively y values of the r4 and r5 registers. These write masks are not supported by ps.1.1 - ps.1.3. The usage of the .rgb write mask in the second phase of this shader is supported by all implementations. The last two lines of this shader show the pairing of two instructions using co-issue. We will discuss instruction paring or "co-issuing" in the next section.

Instruction Pairing

As shown above in Figure 12, there are two pipelines one for the color data and one for the alpha data. Because of the parallel nature of these pipelines, the instructions that write color data and instructions that write only alpha data can be paired. This helps reducing the fill-rate. Only arithmetic instructions can be co-issued, with the exception of dp4. Pairing, or co-issuing, is indicated by a plus sign (+) preceding the second instruction of the pair. The following shader fragment shows three pairs of co-issued instructions:

dp3_sat r1.rgb, t1_bx2, r1
+mul r1.a, r0, r0
mul r1.rgb, r1, v0
+mul r1.a, r1, r1
add r1.rgb, r1, c7
+mul_d2 r1.a, r1, r1

First a dp3 instruction is paired with a mul, than a mul instruction with a mul and as the last an add instruction with a mul. Pairing happens in ps.1.1 - ps.1.3 always with the help of a pair of .rgb and .a write masks. In ps.1.4, a pairing of the .r, .g. or .b write masks together with an .a masked destination register is possible. The line

mul r1.a, r0, r0

only writes the alpha value of the result of the multiplication of r0 with itself into r1.a.

Co-issued instructions are considered a single entity, the result from the first instruction is not available until both instructions are finished and vice versa. The following shader will fail shader validation:

ps.1.1
def c0, 1, 1, 1, 1
mov r0, v0
dp3 r1.rgb, r0, c0
+mov r0.a, r1.b

mov tries to read r1.b, but dp3 did not write to r1.b at that time. The shader will fail, because r1.b was not initialized before.

This could be troublesome, when r1.b is initialized before by any instruction. Then the validator will not catch the bug and the results will not look as expected.

Another restriction to pay attention is the maximum number of three different register types, that can be used across two co-issued instructions.

Geforce3/4TI has a problem with co-issuing instructions in the 8th arithemtic instruction slot. It stops showing the results, when a co-issue happens in the 8th arithmetic instruction, whereas the REF works as expected. The following meaningless pixel shader doesn't show something with the driver version 28.32:
ps.1.1 tex t0 ; color map tex t1 ; normal map dp3 r0,t1_bx2,v1_bx2; ; dot(normal,half) mul r1,r0,r0; ; raise it to 32nd power mul r0,r1,r1; mul r1,r0,r0; mul r0,r1,r1; mul r1, r0, r0 mul r0, r1, r1 ; assemble final color mul r0.rgb,t0,r0 +mov r0.a, r1

If you use the REF or remove the last line, the results are as expected.

Assemble Pixel Shader

After checking for pixel shader support, setting the proper textures with SetTexture() and after writing a pixel shader and setting the needed constant values, the pixel shader has to be assembled. This is needed, because Direct3D uses pixel shaders as byte-code.

Assembling the shader is helpful in finding bugs earlier in the development cycle.

At the time of this writing there are three different ways to compile a pixel shader:

Pre-Compiled Shaders

Use the pixel shader in a separate ASCII file for example test.psh and compile it with a pixel shader assembler (Microsoft Pixel Shader Assembler or NVASM) to produce a byte-code file which could be named test.pso. This way, not every person will be able to read and modify your source.

On the Fly Compiled Shaders

Write the pixel shader in a separate ASCII file or as a char string into your *.cpp file and compile it "on the fly" while the application starts up with the D3DXAssembleShader*() functions.

Shaders in Effect Files

Write the pixel shader source in an effect file and open this effect file when the application starts up. The pixel shader will be compiled by reading the effects file with D3DXCreateEffectFromFile(). It is also possible to pre-compile an effects file. This way most of the handling of pixel shaders is simplified and handled by the effects file functions.

The pre-compiled shader should be the preferred way of compiling shaders, since compilation happens during development of the code i.e. at the same time that the *.cpp files are compiled.

Creating a Pixel Shader

The CreatePixelShader() function is used to create and validate a pixel shader.

HRESULT CreatePixelShader(
    CONST DWORD* pFunction,
    DWORD* pHandle
);

This function takes the pointer to the pixel shader byte-code in pFunction and returns a handle to the pixel shader in pHandle. A typical piece of source might look like this:

TCHAR Shad[255];
LPD3DXBUFFER pCode = NULL;

DXUtil_FindMediaFile(Shad,_T("environment.psh"));
if(FAILED(D3DXAssembleShaderFromFile(Shad,0,NULL, &pCode,NULL) ) )
  return E_FAIL;
if( FAILED(m_pd3dDevice->CreatePixelShader((DWORD*)pCode->GetBufferPointer(), 
  &m_dwPixShader) ) )
return E_FAIL;

DXUtil_FindMediaFile() helps you finding the ASCII file. D3DXAssembleShaderFromFile() compiles it before CreatePixelShader() returns the handle in m_dwPixShader.

The pointer pCode to the ID3DXBuffer interface is used to store the object code and to return a pointer to this object code with GetBufferPointer().

Set Pixel Shader

You set a pixel shader for a specific amount of vertices by using the SetPixelShader() function before the DrawPrimitive*() call for these vertices:

m_pd3dDevice->SetPixelShader(m_dwPixShader);

The only parameter that has to be provided is the handle of the pixel shader created by CreatePixelShader(). The pixel shader is executed for every pixel that is covered by the vertices in the DrawPrimitve*() call.

Free Pixel Shader resources

While the game shuts down or before a device change, the resources taken by the pixel shader has to be freed. This must be done by calling DeletePixelShader() with the pixel shader handle like this:

if(m_dwPixShader)
  m_pd3dDevice->DeletePixelShader(m_dwPixShader);

Summarize

We have walked step by step through a vertex shader creation process. Let's summarize what was shown so far:

First, the pixel shader support of end-user hardware has to be checked with the caps bit PixelShaderVersion.
All textures have to be set with SetTexture(), like with the multitexturing unit.
The constant values for a pixel shader have to be set afterwards in the application code with the SetPixelShaderConstant() or in the pixel shader code with def.
There are texture address and arithmetic instructions. The scope of the texture address instruction in ps.1.1 - ps.1.3 enfolds loading texture data and changing of texture data. The scope of the texture address instructions in ps.1.4 enfolds only the loading of texture data. Changing of texture data is done in ps.1.4 and in ps.1.1 - ps.1.3 with the arithmetic instructions.
After a pixel shader is written into a source file, it has to be compiled.
To get a handle to a pixel shader it has to be created with CreatePixelShader().
To use a pixel shader it has to be set with a call to SetPixelShader().
At the end of a pixel shader driven application, the resources occupied by the pixel shader must be freed with a call to DeletePixelShader().

What happens next?

In the next part of this introduction named "Programming Pixel Shaders", we will start with a first basic pixel shader program and discuss a few basic algorithms and the way how to implement them with pixel shaders.

References

[Bendel] Steffen Bendel, "Hallo World - Font Smoothing with Pixel Shaders", ShaderX, Wordware Inc., pp. ?? - ??, 2002, ISBN 1-55622-041-3

[Beaudoin/Guardado] Philippe Beaudoin , Juan Guardado, "Non-integer Power Function on the Pixel Shader", ShaderX, Wordware Inc., pp. ?? - ??, 2002, ISBN 1-55622-041-3

[Brennan] Chris Brennan, "Per-Pixel Fresnel Term", ShaderX, Wordware Inc., pp ?? - ??, 2002, ISBN 1-55622-041-3

[Brennan2] Chris Brennan, "Diffuse Cube Mapping", ShaderX, Wordware Inc., pp ?? - ??, 2002, ISBN 1-55622-041-3

[Brennan3] Chris Brennan, "Accurate Environment Mapped Reflections and Refractions by Adjusting for Object Distance", ShaderX, Wordware Inc., pp ?? - ??, 2002, ISBN 1-55622-041-3

[Card/Mitchell] Drew Card, Jason L. Mitchell, "Non-Photorealistic Rendering with Pixel and Vertex Shaders", ShaderX, Wordware Inc., pp ?? - ??, 2002, ISBN 1-55622-041-3

[Calver] Dean Calver, Microsoft DirectX discussion forum, mail from Fri, 21 Sep 2001 10:00:55, http://discuss.microsoft.com/SCRIPTS/WA-MSD.EXE?A2=ind0109C&L=DIRECTXDEV&P=R25479

[Dietrich01] Sim Dietrich, "Guard Band Clipping in Direct3D", NVIDIA web-site

[Dietrich-DXDev] Sim Dietrich, Microsoft DirectX discussion forum, mail from Tue, 14 Aug 2001 20:36:02, http://discuss.microsoft.com/SCRIPTS/WA-MSD.EXE?A2=ind0108B&L=DIRECTXDEV&P=R13431

[Dominé01] Sébastien Dominé, "Alpha Test Tricks", NVIDIA web-site

[Hart] Evan Hart, "3D Textures and Pixel Shaders", ShaderX, Wordware Inc., pp ?? - ??, 2002, ISBN 1-55622-041-3

[Isidoro/Brennan] John Isidoro and Chris Brennan, "Per-Pixel Strand Based Anisotropic Lighting", ShaderX, Wordware Inc., pp ?? - ??, 2002, ISBN 1-55622-041-3

[Isidoro/Riguer] John Isidoro and Guennadi Riguer, "Texture Perturbation Effects", ShaderX, Wordware Inc., pp ?? - ??, 2002, ISBN 1-55622-041-3

[Kraus] Martin Kraus, "TVolumetric Effects", ShaderX, Wordware Inc., pp ?? - ??, 2002, ISBN 1-55622-041-3

[Mitchell] Jason L. Mitchell, "Image Processing with 1.4 Pixel Shaders in Direct3D", ShaderX, Wordware Inc. pp ?? - ??, 2002, ISBN 1-55622-041-3

[Moravánsky] Ádám Moravánszky, "Bump Mapped BRDF Rendering", ShaderX, Wordware Inc., pp ?? - ??, 2002, ISBN 1-55622-041-3

[Vlachos] Alex Vlachos, "Blending Textures for Terrain", ShaderX, Wordware Inc. pp ?? - ??, 2002, ISBN 1-55622-041-3

[Watt92] Alan Watt, Mark Watt, "Advanced Animation and Rendering Techniques", Addison Wesley, 1992, ISBN 1-55622-041-3

[Weiskopf] Daniel Weiskopf and Matthias Hopf, "Real-Time Simulation and Rendering of Particle Flows", ShaderX, Wordware Inc. pp ?? - ??, 2002, ISBN 1-55622-041-3

[Zecha] Oliver Zecha, "Procedural Textures", ShaderX, Wordware Inc. pp ?? - ??, 2002, ISBN 1-55622-041-3

Additional Resources

The best resource to accompany this article is the pixel shader assembler reference in the Direct3D 8.1 documentation at

DirectX Graphics->Reference->Pixel Shader Assembler Reference

A lot of valuable information on pixel shaders can be found at the web-sites of NVIDIA (developer.nvidia.com) and ATI (http://www.ati.com/developer/). I would like to name a few:

Author	Article	Published at
Philip Taylor	Per-Pixel Lighting	http://msdn.microsoft.com/directx
Miscancellous	Meltdown Power Point Slids	http://www.microsoft.com/mscorp/ corpevents/meltdown2001/presentations.asp
Sim Dietrich	Intro To Pixel Shading in DX8	NVIDIA web-site (Courseware)
Sim Dietrich	DX8 Pixel Shader Details	NVIDIA web-site (Courseware)
Sim Dietrich	DX8 Pixel Shaders	NVIDIA web-site (Courseware)
Sim Dietrich	AGDC Per-Pixel Shading	NVIDIA web-site (Courseware)
Jason L. Mitchell	1.4 Pixel Shaders	ATI web-site
Jason L. Mitchell	Advanced Vertex and Pixel Shader Techniques	ATI web-site
Alex Vlachos	Preparing Sushi - How Hardware Guys Write a 3D Graphics Engine	ATI web-site
Rich	Direct3D 8.0 pipeline	http://www.xmission.com/~legalize/book/
Dave Salvator	3D Pipeline Part I - III	http://www.extremetech.com and there <3D Graphics, Gaming & Audio> -> <Analysis & Tutorials>

Acknowledgements

I'd like to recognize a couple of individuals that were involved in proof-reading and improving this paper (in alphabetical order):

Jason L. Mitchell (ATI)
Ádám Moravánszky (Swiss Federal Institute of Technology)
Matthias Wloka (NVIDIA)

Discuss this article in the forums

Date this article was posted to GameDev.net: 5/15/2002
(Note that this date does not necessarily correspond to the date the article was written)