Introduction to Shader Programming Part III
The final output of any 3D graphics hardware consists of pixels. Depending on the resolution, in excess of 2 million pixels may need to be rendered, lit, shaded and colored. Prior to DirectX 8.0, Direct3D used a fixed-function multitexture cascade for pixel processing. The effects possible with this approach were very limited on the implementation of the graphics card device driver and the specific underlying hardware. A programmer was restricted on the graphic algorithms implemented by these.
With the introduction of shaders in DirectX 8.0 and the improvements of pixel shaders in DirectX 8.1, a whole new programming universe can be explored by game/demo coders. Pixel shaders are small programs that are executed on individual pixels. This is an unprecedented level of hardware control for their users.
The third part in this "Introduction to Shader Programming", named "Fundamentals of Pixel Shaders", shows you
The fourth and last part will present many pixel shader concepts and algorithms with example code.
Why use Pixel Shaders?
Pixel shaders are at the time of this writing supported by GeForce 3/4TI and RADEON 8500-based cards. Unlike vertex shaders, however, there is no feasible way of emulating pixel shaders using software.
The best argument for using pixel shaders is to take a look at a few demos that uses them :-) ... or just one word: per-pixel non-standardized lighting. The gain in visual experience is enormous. You can use membrane shaders, for balloons, skins, kubelka-munk shaders for translucency effects, fur/hair shaders and new lighting formulas that lead to a totally new lighting experience. Think of a foreign planet with three moons moving faster than the two suns or with a planet surface consisting of different crystalline substances, reflecting light in different ways. The following list should give you a glimpse on the many types of effects that are possible by using pixel shaders now:
Not to mention the effects that are not discovered until now or that are discovered but used only in scientific journals. These visual effects are waiting until they get implemented by you :-).
One of the biggest catches of pixel shaders is that they often have to be "driven" by the vertex shader. For example to calculate per-pixel lighting the pixel shader needs the orientation of the triangle, the orientation of the light vector and in some cases the orientation of the view vector. The graphics pipeline shows the relationship between vertex and pixel shaders.
Pixel Shaders in the Pipeline
The following diagram shows the DX6/7 multitexturing unit and the new pixel shader unit. On pixel shader-capable hardware, the pixel shader heightens the number of transistors on the graphic chip, because it is added to the already existing DX 6/7 multitexturing hardware. Therefore it is also functionally an independent unit, that the developer can choose instead of the DX6/7 multitexturing unit.
But what happens in the 3D pipeline before the pixel shader? A vertex leaves the Vertex Shader as a transformed and colored vertex. The so-called Backface Culling removes all triangles, that are facing away from the viewer or camera. These are by default the vertices that are grouped counter-clockwise. On average, half of your game world triangles will be facing away from the camera at any given time, so this helps reducing rendering time. A critical point is the usage of translucent or transparent front triangles. Depending on what is going to be rendered, Backface culling can be switched off with the D3DCULL_NONE flag.
Backface Culling uses the crossproduct of two sides of a triangle to calculate a vector that is perpendicular to the plane that is formed by these two sides. This vector is the face normal. The direction of this normal determines whether the triangle is front- or backfacing. Because the Direct3D API always uses the same vertex order to calculate the crossproduct, it is known whether a triangle's vertices are "wound" clockwise or counter-clockwise.
User Clip Planes can be set by the developer to clip triangles with the help of the graphics card, that are outside of these planes and therefore to reduce the number of calculations. How to set User Clip Planes is shown in the Direct3D 8.1 example named "Clip Mirror" and in the ATI RADEON 8500 Nature demo. The RADEON and RADEON 8500 support six independant user clip planes in hardware. The clip planes on the GeForce 1/2/3/4TI chips are implemented using texture stages. That means that two User Clip Planes you use, eat up one texture stage.
It looks like NVIDIA no longer exposes the cap bits for these.The DirectX Caps Viewer reports MaxUserClipPlanes == 0 since the release of the 12.41 drivers.
One alternative to clip planes is shown in the "TexKillClipPlanes" example delivered in the NVIDIA Effects Browser, where the pixel shader instruction texkill is used to get a similar functionality like clip planes.
Another alternative to clip planes is guard band clipping as supported by GeForce hardware [Dietrich01]. The basic idea of guard band clipping is that hardware with a guard band region can accept triangles that are partially or totally off-screen, thus avoiding expensive clipping work. There are four cap fields for Guard Band support in the D3DCAPS8 structure: GuardBandLeft, GuardBandRight, GuardBandTop, and GuardBandBottom. These represent the boundaries of the guard band. These values have to be used in the clipping code. Guard band clipping is handled automatically by switching on clipping in Direct3D.
Frustum Clipping is performed on the viewing frustum. Primitives that lie partially or totally off-screen must be clipped to the screen or viewport boundary, which is represented by a 3D viewing frustum. The viewing frustum can be visualized as a pyramid, with the camera positioned at the tip. The view volume defines what the camera will see and won't see. The entire scene must fit between the new and far clipping planes and also be bounded by the sides, the bottom and top of the frustum. A given object in relation to the viewing frustum can fall into one of three categories [Watt92]:
Translating the definition of the viewing frustrum above into homogenous coordinates gives us the clipping limits:
-w <= x <= w -w <= y <= w 0 <= z <= w
This would be a slow process, if it has to be done by the CPU, because each edge of each triangle that crosses the viewport boundary must have an intersection point calculated, and each parameter of the vertex ( x,y,z diffuse r,g,b, specular r,g,b, alpha, fog, u and v ) must be interpolated accordingly. Therefore frustrum clipping is done by modern graphics hardware on the GPU essentially for free with clip planes and guard band clipping.
After Frustrum clipping, the Homogenous or perspective Divide happens. This means that the x-, y- and z-coordinates of each vertex of the homogenous coordinates are divided by w. The perspective divide makes nearer objects larger, and further objects smaller as you would expect when viewing a scene in reality. After this division, the coordinates are in a normalized space:
-1 <= x/w <= 1 -1 <= y/w <= 1 0 <= z/w <= 1
Why do we need this divide through w? By definition every point gets a fourth component w that measures distance along an imaginary fourth-dimensional axis called w. For example, (4, 2, 8, 2) represents the same point as (2, 1, 4, 1). Such a w-value is useful for example to translate an object, because a 3x3 matrix is not able to translate an object without changing its orientation. The fourth coordinate, w, can be thought of as carrying the perspective information necessary to represent a perspective transformation.
Clip coordinates are also referred to as normalized device coordinates (NDC). These coordinates are now mapped to the screen by transforming into screen space via the so-called Viewport Mapping. To map these NDCs to screen space, the following formula is used for a resolution of 1024x768:
ScreenX(max) = NDC(X) * 512 + 512 ScreenY(max) = NDC(Y) * 384 + 384
The minimum and maximum value of X and Y is (-1, -1) to (1, 1). When the NDC's are (-1, -1), the screen coordinates are
ScreenX = -1 * 512 + 512 ScreenY = -1 * 384 + 384
This point lies in the upper-left corner. For example the lower-right corner will be reached with NDCs of (1, 1). Although z and w values are retained for depth buffering tests, screen space is essentially a 2D coordinate system, so only x and y values need to be mapped to screen resolution.
Now comes the Triangle Setup, where the life of the vertices end and the life of the pixel begins. It computes triangle for triangle the parameters required for the rasterization of the triangles, amongst other things it defines the pixel-coordinates of the triangle outline. This means that it defines the first and the last pixel of the triangle scan line by scan line:
Then the Rasterizer interpolates color and depth values for each pixel from the color and depth values of the vertices. These values are interpolated using a weighted average of the color and depth values of the edge's vertex values, where the color and depth data of edge pixels closer to a given vertex more closely approximate values for that vertex. Then the rasterizer fills in pixels for each line.
In addition, the texture coordinates are interpolated for use during the Multitexturing/Pixel Shader stage in a similar way.
The rasterizer is also responsible for Direct3D Multisampling. Because it is done in the Rasterizer stage, Multisampling only affects triangles and group of triangles; not lines. It increases the resolution of polygon edges and therefore as well as depth and stencil tests. The RADEON 7x00/GeForce 3 supports this kind of spatial anti-aliasing by setting the D3DRASTERCAPS_STRECHBLTMULTISAMPLE flag, but both ignore the D3DRS_MULTISAMPLEMASK render state to control rendering into the sub-pixel samples, whereas the RADEON 8500 is able to mask the sub-samples with different bit-pattern by using this render state. By affecting only specific sub-samples, effects like motion blur or depth of field and others can be realized.
Alternatively, motion blur and depth of field effects are possible with the help of vertex and pixel shaders; see the NVIDIA examples in the Effects Browser.
The Pixel Shader is not involved on the sub-pixel level. It gets the already multisampled pixels along with z, color values and texture information. The already Gouraud shaded or flat shaded pixel might be combined in the pixel shader with the specular color and the texture values fetched from the texture map. For this task the pixel shader provides instructions that affect the texture addressing and instructions to combine the texture values in different ways with each other.
There are five different pixel shader standards supported by Direct3D 8.1. Currently, no pixel shader capable graphics hardware is restricted on the support of ps.1.0. Therefore all available hardware supports at least ps.1.1. So I will not mentioned the legacy ps.1.0 in this text anymore. ps.1.1 is the pixel shader version that is supported by GeForce 3. GeForce 4TI supports additionally ps.1.2 and ps.1.3. The RADEON 8x00 supports all of these pixel shader versions plus ps.1.4.
Whereas the ps.1.1, ps.1.2 and ps.1.3 are, from a syntactical point of view, build on each other, ps.1.4 is a new and more flexible approach. ps.1.1 - ps.1.3 differentiate between texture address operation instructions and texture blending instructions. As the name implies, the first kind of instructions is specialized and only usable for address calculations and the second kind of instructions is specialized and only usable for color shading operations. ps.1.4 simplifies the pixel shader language by allowing the texture blending (color shader) instruction set to be used for texture address (address shader) operations as well. It differentiate between arithmetic instructions, that modify color data and texture address instructions, that process texture coordinate data and in most cases sample a texture.
Sampling means looking up a color value in a texture at the specified up to four coordinates (u, v, w, q) while taking into account the texture stage state attributes.
One might say that the usage of the instructions in ps.1.4 happens in a more RISC-like manner, whereas the ps.1.1 - ps.1.3 instruction sets are only useable in a CISC-like manner.
This RISC-like approach will be used in ps.2.0, that will appear in DirectX 9. Syntactically ps.1.4 is compared to ps.1.1-ps.1.3 an evolutionary step into the direction of ps.2.0.
What are the other benefits of using ps.1.4?
The next stage in the Direct3D pipeline is Fog. A fog factor is computed and applied to the pixel using a blending operation to combine the fog amount (color) and the already shaded pixel color, depending on how far away an object is. The distance to an object is determined by its z- or w-value or by using a separate attenuation value that measure the distance between the camera and the object in a vertex shader. If fog is computed by-vertex, it is interpolated across each triangle using Gouraud shading. The "MFC Fog" example in the DirectX 8.1 SDK shows linear and exponential fog calculated per-vertex and per-pixel. A layered range-based fog is shown in the "Height Fog" example in the NVIDIA Effects Browser. The Volume Fog example in the DirectX 8.1 SDK shows volumetric fog produced with a vertex and a pixel shader and the alpha blending pipeline. As shown in these examples fog effects can be driven by the vertex and/or pixel shader.
The Alpha Test stage kicks out pixels with a specific alpha value, because these shouldn't be visible. This is for example one way to map decals with an alpha mapped texture. The alpha test is switched on with D3DRS_ALPHATESTENABLE. The incoming alpha value is compared with the reference alpha value, using the comparison function provided by the D3DRS_ALPHAFUNC render state. If the pixel passes the test, it will be processed by the subsequent pixel operation, otherwise it will be discarded.
Alpha test does not incur any extra overhead, even if all the pixels pass the test. By tuning appropriately the reference value of Alpha Test to reject the transparent or almost transparent pixels, one can improve the application performance significantly, if the application is memory bandwidth fill-bound. Basically, varying the reference value acts like a threshold setting up how many pixels are going to be evicted. The more pixels are being discarded, the faster the application will run.
There is a trick to drive the alpha test with a pixel shader, that will be shown later in the section on swizzling. The image-space outlining technique in [Card/Mitchell] uses alpha testing to reject pixels that are not outlines and consequently improve performance.
The Stencil Test masks the pixel in the render target with the contents of the stencil buffer. This is useful for dissolves, decaling, outlining or to build shadow volumes [Brennan]. A nice example for this is "RadeonShadowShader", that could be found on the ATI web-site.
The Depth Test determines, whether a pixel is visible by comparing its depth value to the stored depth value. An application can set up its z-buffer z-min and z-max, with positive z going away from the view camera in a left-handed coordinate system. The Depth test is a pixel-by-pixel logical test that asks "Is this pixel in back of another pixel at this location?". If the answer returned is yes, that pixel gets discarded, if the answer is no, that pixel will travel further through the pipeline and the z-buffer is updated. There is also a more-precise and bandwidth saving depth buffer form called w-buffer. It saves bandwidth by having only to send x/y/w coordinates over the bus, while the z-buffer in certain circumstances has to send all that plus z.
It is the only depth buffering method used by the Savage 2000-based boards. These cards emulate the z-buffer with the help of the hardware w-buffer.
The pixel shader is able to "drive" the depth test with the texdepth (ps.1.4 only) or the texm3x2depth (ps.1.3 only) instructions. These instructions can calculate the depth value used in the depth buffer comparison test on a per-pixel basis.
The Alpha Blending stage blends the pixel's data with the pixel data already in the Render Target. The blending happens with the following formula:
FinalColor = SourcePixelColor * SourceBlendFactor + DestPixelColor * DestBlendFactor
There are different flags, that can be set with the D3DRS_SRCBLEND (SourceBlendFactor) and D3DRS_DESTBLEND (DestBlendFactor) parameters in a SetRenderState() call. Alpha Blending was formerly used to blend different texture values together, but nowadays it is more efficient to do that with the multitexturing or pixel shader unit, depending on hardware support. Alpha Blending is used on newer hardware to simulate different levels of transparency.
Dithering tries to fool the eye into seeing more colors than are actually present by placing different colored pixels next to one another to create a composite color that the eye will see. For example using a blue next to a yellow pixel would lead to a green appearance. That was a common technique in the days of the 8-bit and 4-bit color systems. You switch on the dithering stage globally with D3DRS_DITHERENABLE.
The Render Target is normally the backbuffer in which you render the finished scene, but it could be also the surface of a texture. This is useful for the creation of procedural textures or for the re-usage of results of former pixel shader executions. The render target is read by the stencil- , depth test and the alpha blending stage.
To summarize the tasks of the pixel shader in the Direct3D pipeline:
Before getting our feets wet with a first look at the pixel shader architecture, let's take a look at the currently available tools:
Pixel Shader Tools
I already introduced Shader Studio, Shader City, DLL Detective, 3D Studio Max 4.x/gmax, NVASM, the Effectsbrowser, the Shader Debugger and the Photoshop plug-ins from NVIDIA in the first part. There is one thing to remember specific for pixel shaders: because the GeForce 4TI supports ps.1.1 - ps.1.3, it is possible, that a few of the NVIDIA tools won't support ps.1.4. Additonally there are the following pixel shader-only tools:
Microsoft Pixel Shader Assembler
The pixel shader assembler is provided with the DirectX 8.1 SDK. Like with it's pendant the vertex shader assembler it does not come with any documentation. Its output look like this:
The pixel shader assembler is used by the Direct3DX functions that compile pixel shaders and can also be used to pre-compile pixel shaders.
MFC Pixel Shader
The MFC Pixel Shader example provided with the DirectX 8.1 SDK comes with source. It is very useful for trying out pixel shader effects in a minute and debugging them. Just type in the pixel shader syntax you want to test and it will compile it at once. Debugging information is provided in the window at the bottom. If your graphics card doesn't support a particular pixel shader version, you can always choose the reference rasterizer and test all desired pixel shader versions. In the following picture the reference rasterizer was chosen on a GeForce 3 to simulate ps.1.3:
The ATI Shadelab helps designing pixel shaders. After writing the pixel shader source into the big entry field in the middle, the compilation process starts immediately. To be able to load the pixel shader later, it has to be saved with the <Save> button and loaded with the <Load> button.
You may set up to six textures with specific texture coordinates and the eight constant registers. The main advantage over the MFC Pixel Shader tool is the possibility to load constant registers and the textures on your own. This tool is provided on the Book DVD in the directory <Tools>.
With that overview on the available tools in mind, we can go one step further by examining a diagram with the pixel shader workflow.
Pixel Shader Architecture
The following diagram shows the logical pixel shader data workflow. All the grey fields mark functionality specific for ps.1.1 - ps.1.3. The blue field marks functionality that is specific to ps.1.4.
On the right half of the diagram the pixel shader arithmetic logic unit (ALU) is surrounded by four kinds of registers. The Color Registers stream iterated vertex color data from a vertex shader or a fixed-function vertex pipeline to a pixel shader. The Constant Registers provide constants to the shader, that are loaded by using the SetPixelShaderConstant() function or in the pixel shader with the def instruction. The Temporary Registers rn are able to store temporay data. The r0 register also serves as the Output register of the pixel shader.
The Texture Coordinates can be supplied as part of the vertex format or can be read from certain kind of texture maps. Texture coordinates are full precision and range as well as perspective correct, when used in a pixel shader. There are D3DTSS_* Texture Operations that are not replaced by the pixel shader functionality, they can be used on the up to four (ps.1.1 - ps.1.3) or six textures (ps.1.4). The Texture Stages are holding a reference to the texture data that might be a one-dimensional (for example in a cartoon shader), two-dimensional or three-dimensional texture (volume textures or cube map). Each value in a texture is called a texel. These texels are most commonly used to store color values, but they can contain any kind of data desired including normal vectors for bump maps, shadow values, or general look-up tables.
Sampling occurs, when a texture coordinate is used to address the texel data at a particular location with the help of the Texture Registers. The usage of the texture registers tn differ between the ps.1.1 - ps.1.3 (t0 - t3) and the ps.1.4 (t0 - t5) implementations.
In case of ps.1.1 - ps.1.3 the association between the texture coordinate set and the texture data is a one-to-one mapping, which is not changeable in the pixel shader. Instead this association can be changed by using the oTn registers in a vertex shader or by using the texture stage state flag TSS_TEXCOORDINDEX together with SetTextureStageState(), in case the fixed function pipeline is used.
In ps.1.4, the texture data and the texture coordinate set can be used independent of each other in a pixel shader. The texture stage from which to sample the texture data is determined by the register number rn and the texture coordinate set, that should be used for the texture data is determined by the number of the tn register specified.
Let's take a closer look at the different registers shown in the upper diagram:
Constant Registers (c0 - c7)
There are eight constant registers in every pixel shader specification. Every constant register contains four floating point values or channels. They are read-only from the perspective of the pixel shader, so they could be used as a source register, but never as destination registers in the pixel shader. The application can write and read constant registers with calls to SetPixelShaderContant() and GetPixelShaderConstant(). A def instruction used in the pixel shader to load a constant register, is effectively translated into a SetPixelShaderConstant() call by executing SetPixelShader().
The range of the constant registers goes from -1 to +1. If you pass anything outside of this range, it just gets clamped. Constant registers are not usable by ps.1.1 - ps.1.3 texture address instructions except for the texm3x3spec, which uses a constant register to get an eye-ray vector.
Output and Temporary Registers (ps.1.1 - ps.1.3: r0 + r1; ps.1.4: r0 - r5)
The temporary registers r0 - rn are used to store intermediate results. The output register r0 is the destination argument for the pixel shader instruction. So r0 is able to serve as a temporary and output register. In ps.1.4 r0 - r5 are also used to sample texture data from the texture stages 0 - 5 in conjunction with the texture registers. In ps.1.1 - ps.1.3, the temporary registers are not usable by texture address instructions.
CreatePixelShader() will fail in shader pre-processing if a shader attempts to read from a temporary register that has not been written to by a previous instruction. All shaders have to write to r0.rgba the final result or the shader will not assemble or validate.
Texture Registers (ps.1.1 - ps.1.3: t0 - t3; ps.1.4: t0 - t5)
The texture registers are used in different ways in ps.1.1 - ps.1.3 and in ps.1.4. In ps.1.1 - ps.1.3 the usage of one of the t0 - t3 texture registers determine the usage of a specific pair of texture data and texture coordinates. You can't change this one-to-one mapping in the pixel shader:
ps.1.1 // version instruction tex t0 // samples the texture at stage 0 // using texture coordinates from stage 0 mov r0, t0 // copies the color in t0 to output register r0
tex samples the texture data from the texture stage 0 and uses the texture coordinates set, that is set in the vertex shader with the oTn registers. In ps.1.4, having texture coordinates in their own registers means that the texture coordinate set and the texture data are independant from each other. The texture stage number with the texture data from which to sample is determined by the destination register number (r0 - r5) and the texture coordinate set is determined by the source register (t0 - t5) specified in phase 1.
ps.1.4 // version instruction texld r4, t5 mov r0, r4
The texld instruction samples the map set via SetTexture (4, lpTexture) using the sixth set of texture coordinates (set in the vertex shader with oT5) and puts the result into the fifth temporary register r4.
Texture registers that doesn't hold any values will be sampled to opaque black (0.0, 0.0, 0.0, 1.0). They can be used as temporary registers in ps.1.1 - ps.1.3. The texture coordinate registers in ps.1.4 are read-only and therefore not usable as temporary registers.
The maximum number of textures is the same as the maximum number of simultaneous textures supported (MaxSimultaneousTextures flag in D3D8CAPS).
Color Registers (ps.1.1 - ps.1.4: v0 + v1)
The color registers can contain per-vertex color values in the range 0 through 1 (saturated). It is common to load v0 with the vertex diffuse color and v1 with the specular color.
Using a constant color (flat shading) is more efficient than using an per-pixel Gouraud shaded vertex color. If the shade mode is set to D3DSHADE_FLAT, the application iteration of both vertex colors (diffuse and specular) is disabled. But regardless of the shade mode, fog will still be iterated later in the pipeline.
Pixel shaders have read-only access to color registers. In ps.1.4 color registers are only available during the second phase, which is the default phase. All of the other registers are available in every of the two phases of ps.1.4.
One reason for using pixel shaders is compared to the multitexturing unit, its higher precision that is used by the pixel shader arithmetic logic unit.
The color register vn are 8bit precision per channel, ie 8bit red, 8bit green etc.. For ps.1.1 to ps.1.3, D3DCAPS8.MaxPixelShaderValue is a minimum of one, whereas in ps.1.4 D3DCAPS8.MaxPixelShaderValue is a minimum of eight. The texture coordinate registers provided by ps.1.4 use high precision signed interpolators. The DirectX caps viewer reports a MaxTextureRepeat value of 2048 for the RADEON 8500. This value will be clamped to MaxPixelShaderValue, when used with texcrd, because of the usage of a rn register as the destination register. In this case it is safest to stick with source coordinates within the MaxPixelShaderValue range. However, if tn registers are used for straight texture lookups (i.e. texld r0, t3), then the MaxTextureRepeat range should be expected to be honored by hardware.
Using textures to store color values leads to a much higher color precision with ps.1.4.
High Level View on Pixel Shader Programming
Pixel Shading takes place on a per-pixel, per-object basis during a rendering pass.
Let's start by focusing on the steps required to build a pixel shader-driven application. The following list ordered in the sequence of execution shows the necessary steps to build up a pixel shader driven application:
The following text will work through this list step-by-step:
Check for Pixel Shader Support
It is important to check for the proper pixel shader support, because there is no feasible way to emulate pixel shaders. So in case there is no pixel shader support or the required pixel shader version is not supported, there have to be fallback methods to a default behaviour (ie the multitexturing unit or ps.1.1). The following statement checks the supported pixel shader version:
if( pCaps->PixelShaderVersion < D3DPS_VERSION(1,1) ) return E_FAIL;
This example checks the support of the pixel shader version 1.1. The support of at least ps.1.4 in hardware might be checked with D3DPS_VERSION(1,4). The D3DCAPS structure has to be filled in the startup phase of the application with a call to GetDeviceCaps(). In case the Common Files Framework which is provided with the DirectX 8.1 SDK is used, this is done by the framework. If you graphics card does not support the requested pixel shader version and there is no fallback mechanism that switches to the multitexturing unit, the reference rasterizer will jump in. This is the default behaviour of the Common Files Framework, but it is not useful in a game, because the REF is too slow.
Set Texture Operation Flags (D3DTSS_* flags)
The pixel shader functionality replaces the D3DTSS_COLOROP and D3DTSS_ALPHAOP operations and their associated arguments and modifiers that were used with the fixed-function pipeline. For example the following four SetTextureStageState() calls could be handled now by the pixel shader:
m_pd3dDevice->SetTextureStageState( 0, D3DTSS_COLORARG1, D3DTA_TEXTURE ); m_pd3dDevice->SetTextureStageState( 0, D3DTSS_COLORARG2, D3DTA_DIFFUSE); m_pd3dDevice->SetTextureStageState( 0, D3DTSS_COLOROP, D3DTOP_MODULATE); m_pd3dDevice->SetTexture( 0, m_pWallTexture);
But the following texture stage states are still observed.
D3DTSS_ADDRESSU D3DTSS_ADDRESSV D3DTSS_ADDRESSW D3DTSS_BUMPENVMAT00 D3DTSS_BUMPENVMAT01 D3DTSS_BUMPENVMAT10 D3DTSS_BUMPENVMAT11 D3DTSS_BORDERCOLOR D3DTSS_MAGFILTER D3DTSS_MINFILTER D3DTSS_MIPFILTER D3DTSS_MIPMAPLODBIAS D3DTSS_MAXMIPLEVEL D3DTSS_MAXANISOTROPY D3DTSS_BUMPENVLSCALE D3DTSS_BUMPENVLOFFSET D3DTSS_TEXCOORDINDEX D3DTSS_TEXTURETRANSFORMFLAGS
The D3DTSS_BUMP* states are used with the bem, texbem and texbeml instructions.
In ps.1.1 - ps.1.3 all D3DTSS_TEXTURETRANSFORMFLAGS flags are available and have to be properly set for a projective divide, whereas in ps.1.4 the texture transform flag D3DTTFF_PROJECTED is ignored. It is accomplished by using source register modifiers with the texld and texcrd registers.
The D3DTSS_TEXCOORDINDEX flag is valid only for fixed-function vertex processing. when rendering with vertex shaders, each stages's texture coordinate index must be set to its default value. The default index for each stage is equal to the stage index.
ps.1.4 gives you the ability to change the association of the texture coordinates and the textures in the pixel shader.
The texture wrapping, filtering, color border and mip mapping flags are fully functional in conjunction with pixel shaders.
A change of these texture stage states doesn't require the regeneration of the currently bound shader, because they are not available to shader compile time and the driver can therefore make no assumption about them.
Set Texture (with SetTexture()
After checking the pixel shader support and setting the proper texture operation flags, all textures have to be set by SetTexture(), as with the DX6/7 multitexturing unit. The prototype of this call is:
HRESULT SetTexture(DWORD Stage, IDirect3DBaseTexture8* pTexture);
The texture stage that should be used by the texture is provided in the first parameter and the pointer to the texture interface is provided in the second parameter. A typical call might look like:
m_pd3dDevice->SetTexture( 0, m_pWallTexture);
This call sets the already loaded and created wall texture to texture stage 0.
Define Constants (with SetPixelShaderConstant() / def)
The constant registers can be filled with SetPixelShaderConstant() or the def instruction in the pixel shader. Similar to the SetVertexShaderConstant() call, the prototype of the pixel shader equivalent looks like this
HRESULT SetPixelShaderConstant( DWORD Register, CONST void* pConstantData, DWORD ConstantCount );
First the constant register must be specified in Register. The data to transfer into the constant register is provided in the second argument as a pointer. The number of constant registers, that have to be filled is provided in the last parameter. For example to fill c0 - c4, you might provide c0 as the Register and 4 as the ConstantCount.
The def instruction is an alternative to SetPixelShaderConstant(). When SetPixelShader() is called, it is effectively translated into a SetPixelShaderConstant() call. Using the def instruction makes the pixel shader easier to read. A def instruction in the pixel shader might look like this:
def c0, 0.30, 0.59, 0.11, 1.0
Each value of the constant source registers has to be in the range [-1.0..1.0].
Pixel Shader Instructions
Using vertex shaders the programmer is free to choose the order of the used instructions in any way that makes sense, whereas pixel shaders need a specific arrangement of the used instructions. This specific instruction flow differs between ps.1.1 - ps.1.3 and ps.1.4.
ps.1.1 - ps.1.3 allow four types of instructions, that must appear in the order shown below:
This example shows a per-pixel specular lighting model, that evaluates the specular power with a lookup table. Every pixel shader starts with the version instruction. It is used by the assembler to validate the instructions which follow. Below the version instruction a constant definition could be placed with def. Such a def instruction is translated into a SetPixelShaderConstant() call, when SetPixelShader() is executed.
The next group of instructions are the so-called texture address instructions. They are used to load data from the tn registers and additionally in ps.1.1 - ps.1.3 to modify texture coordinates. Up to four texture address instructions could be used in a ps.1.1 - ps.1.3 pixel shader.
In this example the tex instruction is used to sample the normal map, that holds the normal data. texm* instructions are always used at least as pairs:
texm3x2pad t1, t0_bx2 texm3x2tex t2, t0_bx2
Both instructions calculate the proper u/v texture coordinate pair with the help of the normal map in t0 and sample the texture data from t2 with it. This texture register holds the light map data with the specular power values. The last texture addressing instruction samples the color map into t3.
The next type of instructions are the arithmetic instructions. There could be up to eight arithmetic instructions in a pixel shader.
mad adds t2 and c0, the ambient light, and multiplies the result with t3 and stores it into the output register r0.
Instructions in a ps.1.4 pixel shader must appear in the order shown below:
This is a simple transfer function, which could be useful for sepia or heat signature effects. It is explained in detail in [Mitchell]. The ps.1.4 pixel shader instruction flow has to start with the version instruction ps.1.4. After that, as much def instructions as needed might be placed into the pixel shader code. This example sets a Luminance constant value with one def.
There could be up to six texture addressing instructions used after the constants. The texld instruction loads a texture from texture stage 0, with the help of the texture coordinate pair 0, which is chosen by using t0. In the following up to eight arithmetic instructions, color, texture or vector data might be modified. This shader uses only the arithmetic instruction to convert the texture map values to luminance values.
So far a ps.1.4 pixel shader has the same instruction flow like a ps.1.1 - ps.1.3 pixel shader, but the phase instruction allows it to double the number of texture addressing and arithmetic instructions. It divides the pixel shader in two phases: phase 1 and phase 2. That means as of ps.1.4 a second pass through the pixel shader hardware can be done.
Another way to re-use the result of a former pixel shader pass is to render into a texture and use this texture in the next pixel shader pass. This is accomplished by rendering into a seperate render target.
The additional six texture addressing instruction slots after the phase instruction are only used by the texld r5, r0 instruction. This instruction uses the color in r0, which was converted to Luminance values before as a texture coordinate to sample a 1D texture (sepia or heat signature map), which is referenced by r5. The result is moved with a mov instruction into the output register r0.
Adding the number of arithmetic and addressing instructions shown in the pixel shader instruction flow above, leads to 28 instructions. If no phase marker is specified, the default phase 2 allows up to 14 addressing and arithmetic instructions.
Both of the preceding examples show the ability to use dependant reads. A dependant read is a read from a texture map using a texture coordinate which was calculated earlier in the pixel shader. More details on dependant reads will be presented in the next section.
Texture Address Instructions
Texture address instructions are used on texture coordinates. The texture coordinate address is used to sample data from a texture map. Controlling the u/v pair, u/v/w triplet or a u/v/w/q quadruplet of texture coordinates with address operations, gives the ability to choose different areas of a texture map. Texture coordinate "data storage space" can also be used for other reasons than sampling texture data. The registers that reference texture coordinate data, are useful to "transport" any kind of data from the vertex shader to the pixel shader via the oTn registers of a vertex shader. For example the light or half-angle vector or a 3x2, 3x3 or a 4x4 matrix might be provided to the pixel shader this way.
ps.1.1 - ps.1.3 texture addressing
The following diagram shows the ways that texture address instructions work in ps.1.1 - ps.1.3 for texture addressing:
All of the texture addressing happens "encapsulated" in the texture address instructions masked with a grey field. That means results of texture coordinate calculations are not accessible in the pixel shader. The texture address instruction uses these results internally to sample the texture. The only way to get access to texture coordinates in the pixel shader is the texcoord instruction. This instruction converts texture coordinate data to color values, so that they can be manipulated by texture addressing or arithmetic instructions. These color values could be used as texture coordinates to sample a texture with the help of the texreg2* instructions.
The following instructions are texture address instructions in ps.1.1 - ps.1.3. The d and s in the column named Para are the destination and source parameters of the instruction. The usage of texture coordinates is shown by two brackets around the texture register, for example (t0).
All of these texture address instructions use only the tn registers, with the exception of tex3x3spec, that uses a constant register for the eye-ray vector. In a ps.1.1 - ps.1.3 pixel shader, the destination register numbers for texture addressing instructions had to be in increasing order.
In ps.1.1 - ps.1.3, the ability to re-use a texture coordinate after modifying it in the pixel shader is available through specific texture address instructions, that are able to modify the texture coordinates and sample a texture with these afterwards. The following diagram shows this reliance:
The texture address operations that sample a texture after modifying the texture coordinates are:
The following instructions sample a texture with the help of color values as texture coordinates. If one of these color values are manipulated before, the sampling happens to be a dependant read.
Therefore these instructions are called general dependant texture read instructions.
As already stated above, each ps.1.1 - ps.1.3 pixel shader has a maximum of 8 arithmetic instructions and 4 texture address instructions. All texture address instructions uses one slot of the supplied slots, with the exception of texbeml, that uses one texture address slot plus one arithmetic slot.
ps.1.4 Texture Addressing
To use texels or texture coordinates in ps.1.4, you always have to load them first with texld or texcrd. These instructions are the only way to get access to texels or texture coordinates. Texture coordinates can be modified after a conversion to color data via texcrd, with all available arithmetic instructions. As a result, texture addressing is more straightforward with ps.1.4.
The following instructions are texture address instructions in ps.1.4:
In ps.1.4, there are only four texture address instructions but, as mentioned before, all the arithmetic instructions can be used to manipulate texture address information. So there are plenty of instruments to solve texture addressing tasks.
Valid source registers for first phase texture address instructions are tn. Valid source registers for second phase texture address instructions are tn and also rn. Each rn register may be specified as a destination to a texture instruction only once per phase. Aside from this, destination register numbers for texture instructions do not have to be in any particular order (as opposed to previous pixel shader versions in which destination register numbers for texture instructions had to be in increasing order).
No dependencies are allowed in a block of tex* instructions. The destination register of a texture address instruction cannot be used as a source in a subsequent texture address instruction in the same block of texture address instruction (same phase).
Dependent reads with ps.1.4 are not difficult to locate in the source. Pseudo code of the two possible dependent read scenarios in ps.1.4 might look like:
; transfer function texld ; load first texture modify color data here phase texld ; sample second texture with changed color data as address
texcrd ; load texture coordinates modify texture coordinates here phase< texld ; sample texture with changed address
Another way to think of it is that if the second argument to a texld after the phase marker is rn (not tn) then it's a dependent read, because the texture coordinates are in a temp register so they must have been computed:
..... phase texld rn, rn
Set first three channels of a rn register, which is used as a source register, has to be set before it is used as a source parameter. Otherwise the shader will fail.
To manipulate texture coordinates with arithmetic instructions, they have to be loaded into texture data registers (ps.1.1 - ps.1.3: tn; ps.1.4: rn) via texcoord or texcrd. There is one important difference between these two instructions. texcoord clamps to [0..1] and texcrd does not clamp at all.
If you compare texture addressing used in ps.1.1 - ps.1.3 and texture addressing used in ps.1.4, it is obvious that the more CISC-like approach uses much more powerful instructions to address textures compared to the more RISC-like ps.1.4 approach. On the other hand, ps.1.4 offers a greater flexibility in implementing different texture addressing algorithms by using all of the arithmetic instructions compared to ps.1.1 - ps.1.3.
The arithmetic instructions are used by ps.1.1 - ps.1.3 and ps.1.4 in a similar way, to manipulate texture or color data. Here is an overview of all available instructions in these implementations:
All arithmetic instructions can use the temporary registers rn. The rn registers are initially unset, and cannot be used as source operands until they are written. This requirement is enforced independently per each channel of each rn register. In ps.1.4 the tn registers can not be used with any arithmetic instruction, so they are restricted on texture addressing instructions (exception: texdepth).
Valid source registers for first phase arithmetic instructions are rn and cn. Valid source registers for second phase arithmetic instructions are rn, vn, and cn.
The comparison of ps.1.1 - ps.1.3 to ps.1.4 shows only a few differences. The ps.1.4-only instruction is bem. It substitutes the texbem and texbeml capabilities in an arithmetic operation in ps.1.4. Furthermore the cmd and cnd instructions are more powerful in ps.1.4. The scope of the arithmetic instructions is much bigger in ps.1.4, than in ps.1.1 - ps.1.3, because they are used for all texture addressing and blending tasks in ps.1.4.
As with the vertex shader, the pixel shader arithmetic instructions provide no if-statement, but this functionality can be emulated with cmp or cnd.
All of the rn.a channels are marked as unset at the end of the first phase, and thus cannot be used as a source operand until written. As a result, the fourth channel of color data will be lost during the phase transition. This problem can be partly solved by re-ordering the instructions. For example the following code snippet will loose the alpha value in r3.
ps.1.4 ... texld r3, t3 phase ... mul r0, r2, r3
The next code snippet will not lose the alpha value:
ps.1.4 ... phase texld r3, t3 ... mul r0, r2, r3
If no phase marker is present, then the entire shader is validated as being in the second phase.
All four channels of the shader result r0 must be written.
ps.1.1 - ps.1.3 and ps.1.4 are limited in different ways regarding the maximum number of source registers of the same type, that can be read.
Read Port Limit
The read port limit gives you the maximum number of registers of the same register type, that can be used as a source register in a single instruction.
The color registers have a read port limit of two in all versions. In the following code snippet, mad uses v0 and v1 as a source register:
ps.1.1 // Version instruction tex t0 // Declare texture mad r0, v0, t0, v1
This is example exposes a readport limit of 2. The following example exposes a readport limit of 1, because v0 is used twice:
ps.1.1 // Version instruction tex t0 // Declare texture mad r0, v0, t0, v0
The following pixel shader fails in ps.1.1:
ps.1.1 tex t0 tex t1 tex t2 mad r0, t0, t1, t2
It exceeds the readport limit of 2 for the texture registers. This shader won't fail with ps.1.2 and ps.1.3, because these versions have a readport limit of 3 for the tn registers. The functional equivalent in ps.1.4 won't fail too:
ps.1.4 texld r0, t0 texld r1, t1 texld r2, t2 mad r0, r0, r1, r2
Another example for the usage of three temporary registers in ps.1.4 in the same instruciton is shown in the examples for the cmp and cnd instructions. In ps.1.4 the tn registers cannot be used with arithmetic instructions and none of the texture address instructions can use more than one tn register as a source, therefore it is not possible to cross the readport limit of the tn registers in ps.1.4.
There is no write port limit, because every instruction has only one destination register.
How instructions can be modified is best shown with the following diagram in mind:
This diagram shows the parallel pipeline structure of the pixel shader ALU. The vector or color pipeline handles the color values and the scalar or alpha pipeline handles the alpha value of a 32-bit value. There are "extensions" for instructions that enable the programmer to change the way data is read and/or written by the instruction. They are called Swizzling, Source Register Modifiers, Instruction Modifiers and Destination Register Modifiers.
We will work through all instruction "extensions" shown by Figure 12 in the following paragraphs from top to bottom.
In contrast to the more powerful swizzles that can be used in vertex shaders, the swizzling supported in pixel shader is only able to replicate a single channel of a source-register to all channels. This is done by so called source register selectors.
The .r and .g selectors are only available in ps.1.4. The following instruction replicates the red channel to all channels of the source register.
r1.r ; r1.rgba = r1.r
As shown in Figure 12, selectors are applied first in the pixel shader ALU. They are only valid on source registers of arithmetic instructions.
The .b replicate functionality is available in ps.1.1 - ps.1.3 since the release of DirectX 8.1, but this swizzle is only valid together with an alpha write mask in the destination register of an instruction like this:
mul r0.a, r0, r1.b
ps.1.1 does not support the .b replicate in DirectX 8.0.
This means that the .b source swizzle cannot be used with dp3 in ps.1.1 - ps.1.3, because the only valid write destination masks for dp3 are .rgb or .rgba (write masks will be presented later):
dp3 r0.a, r0, c0.b ; fails
The ability to replicate the blue channel to alpha opens the door to a bandwidth optimization method, described in an NVIDIA OpenGL paper named "Alpha Test Tricks" [Dominé01]. A pixel shader allow a dot product operation at the pixel level between two RGB vectors. Therefore, one can set one of the vectors to be (1.0, 1.0, 1.0), turning the dot product into a summation of the other's vectors components:
(R, G, B) dot (1.0, 1.0, 1.0) = R + G + B
In the following code snippet the pixel shader instruction dp3 calculates the sum of the RGB color by replicating the scalar result into all four channels of r1.
ps.1.1 def c0, 1.0, 1.0, 1.0, 1.0 tex t0 dp3 r1, t0, c0 mov r0.a, r1.b +mov r0.rgb, t0
An appropriate blending function might look like this:
dev->SetRenderState(D3DRS_ALPHAREF, (DWORD)0x00000001); dev->SetRenderState(D3DRS_ALPHATESTENABLE, TRUE); dev->SetRenderState(D3DRS_ALPHAFUNC, D3DCMP_GREATEREQUAL);
If the color data being rasterized is more opaque than the color at a given pixel (D3DPCMPCAPS_GREATEREQUAL), then the pixel is written. Otherwise, the rasterizer ignores the pixel altogether, saving the processing required for example to blend the two colors.
An even more clever method to utilize the alpha test for fillrate optimization was shown by ShaderX author Dean Calver on the public Microsoft DirectX discussion forum [Calver]. He uses an alpha map to lerp three textures this way:
; Dean Calver ; 2 lerps from a combined alpha texture ps.1.1 tex t0 ; combined alpha map tex t1 ; texture0 tex t2 ; texture1 tex t3 ; texture2 mov r1.a, t0.b ; copy blue to r1.a lrp r0, r1.a, t2, t1 ; lerp between t1 and t2 lrp r0, t0.a, t3, r0 ; lerp between result and t3
The .a replicate is analogous to the D3DTA_ALPHAREPLICATE flag in the DX6/7 multitexturing unit.
To move any channel to any channel, use dp3 to replicate the channel across, and then mask it out with the help of def instructions. The following pixel shader move the blue channel to the alpha channel:; move the red to blue and output combined ps.1.1 def c0, 1.f, 0.f, 0.f, 0.f ; select red channel def c1, 0.f, 0.f, 1.f, 0.f ; mask for blue channel def c2, 1.f, 1.f, 0.f, 1.f ; mask for all channels but blue tex t0 dp3 r0, t0, c0 ; copy red to all channels mul r0, r0, c1 ; mask so only blue is left mad r0, t0, c2, r0 ; remove blue from original texture and ; add red shifted into blue
In ps.1.4, there are specific source register selectors for texld and texcrd:
texld and texcrd are only able to use three channels in their source registers, so these selectors provide the option of taking the third component from either the third or the fourth component of the source register. Here are two examples on how to use these selectors:
texld r0, t1.rgb ... texcrd r1.rgb, t1.rga texcrd r4.rgb, t2.rgb
An overview on all possible source register selectors, modifiers and destination write masks is provided with the description of the texcrd and texld instructions above.
Source Register Modifiers
Source register modifiers are useful to adjust the range of register data in preparation for the instruction or to scale the value.
All modifiers can be used on arithmetic instructions. In ps.1.1 you can use the signed scale modifier _bx2 on the source register of any texm3x2* and texm3x3* instruction. In ps.1.2 and ps.1.3 it can be used on the source register of any texture address instruction.
bias subtracts 0.5 from all components. It allows the same operation as D3DTOP_ADDSIGNED in the DX6/7 multitexturing unit.
It is used to change the range of data from 0 to 1 to -0.5 to 0.5. Applying bias to data outside this range may produce undefined results. Data outside this range might be saturated with the _sat instruction modifier to the range [0..1], before used with a biased source register (more on instruction modifiers in the next section).
A typical example for this modifier is detail mapping, shown in the add example.
Invert inverts (1 - value) the value for each channel of the specified source register. The following code snippet uses inversion to complement the source register r1:
mul r0, r0, 1-r1 ; multiply by (1.0 - r1)
Negate negates all source register components by using a subtract sign before a register. This modifier is mutually exclusive with the invert modifier so it can not be applied to the same register.
Scale with the _x2 modifier is only available in ps.1.4. It multiplies a value by two before using it in the instruction and it is mutually exclusive to the invert modifier.
Signed scaling with the _bx2 modifier is a combination of bias and scale so it subtracts 0.5 from each channel and scales the result by 2. It remaps input data from unsigned to signed values. As with bias, using data outside of the range 0 to 1 may produce undefined results. This modifier is typically used in dp3 instructions. An example for this is presented with the description of the dp3 instruction above. Signed scaling is mutually exclusive with the invert modifier.
None of these modifiers change the content of the source register they are applied to. These modifiers are applied only to the data read from the register, so the value stored in the source register is not changed.
Modifiers and selectors may be combined freely. In the following example r1 uses the negative, bias and signed scaling modifiers as well as a red selector
With the help of the source register modifiers, per-pixel normalization is possible in the pixel shader. Per-pixel normalization can be used instead of a cubemap:
; Assuming v0 contains the unnormalized biased & scaled vector ( just ; like a normal map), r0 will end up with a very close to normalized ; result. ; This trick is also useful to do 'detail normal maps' by adding ; a vector to a normal map, and then renormalizing it. dp3 r0, v0_bx2, v0_bx2 ; r0 = N . N mad r0, v0_bias, 1-r0, v0_bx2 ; (v0_bias * (1-r0)) + v0_bx2 ; ((N - 0.5) * (1 - N.N)) + (N - 0.5) * 2
Normalization requires calculating 1/sqrt(N). This code snippet normalizes the normal vector with a standard Newton-Raphson iteration for approximating a square root. This trick was shown by Sim Dietrich in the Microsoft DirectX forum [Dietrich-DXDev].
In the first line, the normal vector N is biased and scaled by _bx2 and then multiplied with itself via a dot product. The result is stored in r0. In the next line, the normal is again biased and scaled and added to the inverse of r0. The result of this addition is multiplied with the biased normal.
There are additional modifiers specific to the texld and texcrd instructions in the ps.1.4 implementation. These modifiers provide projective divide functionality by dividing the x and y values by either the z or w values and therefore projective dependant reads are possible in ps.1.4.
These modifiers provide a functional replacement of the D3DTFF_PROJECTED flag of the D3DTSS_TEXTURETRANSFORMFLAGS texture stage state flag in the pixel shader. A typical instruction would look like this:
texcrd r2.rg, t1_dw.xyw ; third channel unset
The modifier copies x/w and y/w from t1 into the first two channels of r2. The 3rd and 4th channels of r2 are uninitialized. Any previous data written to these channels will be lost. The per-pixel perspective divide is useful for example for projective textures.
The restriction for the two texture addressing registers are:
After the swizzling of the source register channels and the modification of the values read out from a source register with source register modifiers, the instruction starts executing. As shown in figure 12, now the instruction modifiers are applied. These are indicated as an appendix to the instruction connected via an underscore. Instruction modifiers are used to change the output of an instruction. They can multiply or divide a result or clamp a result to [0..1]:
Instruction modifiers can be used only on arithmetic instructions. The _x8, _d4, and _d8 modifiers are new to the 1.4 shading model. _sat may be used alone or combined with one of the other modifiers. i.e. mad_d8_sat.
Multiplier modifiers are useful to scale a value. Note that any such scaling reduces accuracy of results. The following examples scale the results by using _x2 or _x4:
ps.1.1 tex t0 tex t1 mul_x2 r0, t1, t0 ; (t1 * t0) * 2 ... mul_x2 r0, 1-t1, t0 ; t0 * inverse(t1) * 2 ... mad_x2 r0, t1, v0, t0 ; (t1 + ( v0 * t0)) * 2 ... mul_x4 r0, v0, t0 ; (v0 * t0) * 4 mul_x4 r1, v1, t1 ; (v1 * t1) * 4 add r0, r0, r1 ; (v0*t0 * 4)+(v1*t1 * 4)
The _x2 modifer does the same as a shift left in C/C++.
The _d2 modifer does the same as a right shift in C/C++. Here is a more complex example:
; Here is an example for per-pixel area lighting ps.1.1 def c1, 1.0, 1.0, 1.0, 1.0 ; sky color def c2, 0.15, 0.15, 0.15, 1.0 ; ground color def c5, 0.5, 0.5, 0.5, 1.0 tex t0 ; normal map tex t1 ; base texture dp3_d2 r0, v0_bx2, t0_bx2 ; v0.rgb is hemi axis in tangent space ; dot normal with hemi axis add r0, r0, c5 ; map into range lrp r0, r0, c1, c2 mul r0, r0, t1 ; modulate base texture
This pixel shader biases the hemisphere axis in v0 and scales it by 2. The same is done to the values of the normal map. dp3_bx2 divides the result through 2. The add instruction adds 0.5 to the vector of r0. lrp uses r0 as the proportion to linearly interpolate between sky color in c1 and ground color in c2.
The saturation modifer _sat clamps each component or channel of the result to the range [0..1]. It is most often used to clamp dot products like in the code snippet:
dp3_sat r0, t1_bx2, r0 ; N.H dp3_sat r1.rgb, t1_bx2, r1 ; N.L
The result of the dot product operation of the normal vector with the half angle vector and the result of the dot product operation of the normal and the light vector are saturated. That means the values in r0 and r1.rgb are clamped to [0..1].
Destination Register Modifiers/Masking
A destination register modifer or write mask controls which channel in a register is updated. So it only alters the value of the channel it is applied to.
Write masks are supported for arithmetic instructions only. The following destination write masks are available for all arithmetic instructions:
In ps.1.1 - ps.1.3 a pixel shader can only use the .rgb or .a write masks. The arbitrary write mask in ps.1.4 allows any set of channels in the order r, g, b, a to be combined. It is possible to choose for example:
mov r0.ra, r1
If no destination write mask is specified, the destination write mask defaults to the .rgba case, which updates all channels in the destination register. An alternate syntax for the r, g, b, a channels is x, y, z, w.
As with the source register selectors and source register modifiers, the texld and texcrd instructions have additional write masks and write mask rules. texcrd can write only to the .rgb channels. It supports additionally a write mask that masks the first two channels with .rg or .xy. texld uses all four channels of the destination register. There is no alternative write mask available.
The usage of write masks is shown in the following ps.1.4 pixel shader that handles diffuse bump mapping with two spotlights (taken from the file 14_bumpspot.sha of the ATI Treasure Chest example program):
ps.1.4 def c0, 1.0f, 1.0f, 1.0f, 1.0f ; Light 1 Color def c1, 1.0f, -0.72f, 1.0f, 1.0f ; Light 1 Angle scale(x) and bias(Y) def c2, 1.0f, 1.0f, 1.0f, 1.0f ; Light 2 Color def c3, 0.25f, 0.03f, 1.0f, 1.0f ; Light 2 Angle scale(x) and bias(Y) texcrd r0.rgb, t2 ; Spot light 1 direction texcrd r1.rgb, t4 ; Spot light 2 direction texld r2, t1 ; Light 1 to Point vector texld r3, t3 ; Light 2 to Point vector texcrd r4.rgb, t1 ; Light 1 space position for attenuation texcrd r5.rgb, t3 ; Light 2 space position for attenuation dp3_sat r4.x, r4, r4 ; Light 1 Distance^2 dp3_sat r5.x, r5, r5 ; Light 2 Distance^2 dp3_sat r4.y, r0, r2_bx2 ; Light 1 Angle from center of spotlight dp3_sat r5.y, r1, r3_bx2 ; Light 2 Angle from center of spotlight mad_x4 r4.y, r4.y, c1.x, c1.y ; Light 1 scale and bias for angle mad_x4 r5.y, r5.y, c3.x, c3.y ; Light 2 scale and bias for angle phase texld r0, t0 ; Base Map texld r1, t0 ; Normal Map texld r4, r4 ; Distance/Angle lookup map texld r5, r5 ; Distance/Angle lookup map dp3_sat r2.rgb, r1_bx2, r2_bx2 ; *= (N.L1) mul_x2 r2.rgb, r2, r4.r ; Attenuation from distance and angle mad r2.rgb, r2, c0, c7 ; * Light Color + Ambient dp3_sat r3.rgb, r1_bx2, r3_bx2 ; *= (N.L2) mul_x2 r3.rgb, r3, r5.r ; Attenuation from distance and angle mad r3.rgb, r3, c2, r2 ; * Light 2 Color + Light 1 Color + Ambient mul r0.rgb, r3, r0 ; Modulate by base map +mov r0.a, c0
There are four different write masks used throughout this shader. These are the .rgb, .x, .y and the .a write masks. The first write mask used for the texcrd instructions are imperative. texld can't handle write masks other than .rgba, which is the same as applying no explicit write mask. The first four dp3 and the next two mad instructions write to a the x respectively y values of the r4 and r5 registers. These write masks are not supported by ps.1.1 - ps.1.3. The usage of the .rgb write mask in the second phase of this shader is supported by all implementations. The last two lines of this shader show the pairing of two instructions using co-issue. We will discuss instruction paring or "co-issuing" in the next section.
As shown above in Figure 12, there are two pipelines one for the color data and one for the alpha data. Because of the parallel nature of these pipelines, the instructions that write color data and instructions that write only alpha data can be paired. This helps reducing the fill-rate. Only arithmetic instructions can be co-issued, with the exception of dp4. Pairing, or co-issuing, is indicated by a plus sign (+) preceding the second instruction of the pair. The following shader fragment shows three pairs of co-issued instructions:
dp3_sat r1.rgb, t1_bx2, r1 +mul r1.a, r0, r0 mul r1.rgb, r1, v0 +mul r1.a, r1, r1 add r1.rgb, r1, c7 +mul_d2 r1.a, r1, r1
First a dp3 instruction is paired with a mul, than a mul instruction with a mul and as the last an add instruction with a mul. Pairing happens in ps.1.1 - ps.1.3 always with the help of a pair of .rgb and .a write masks. In ps.1.4, a pairing of the .r, .g. or .b write masks together with an .a masked destination register is possible. The line
mul r1.a, r0, r0
only writes the alpha value of the result of the multiplication of r0 with itself into r1.a.
Co-issued instructions are considered a single entity, the result from the first instruction is not available until both instructions are finished and vice versa. The following shader will fail shader validation:
ps.1.1 def c0, 1, 1, 1, 1 mov r0, v0 dp3 r1.rgb, r0, c0 +mov r0.a, r1.b
mov tries to read r1.b, but dp3 did not write to r1.b at that time. The shader will fail, because r1.b was not initialized before.
This could be troublesome, when r1.b is initialized before by any instruction. Then the validator will not catch the bug and the results will not look as expected.
Another restriction to pay attention is the maximum number of three different register types, that can be used across two co-issued instructions.
Geforce3/4TI has a problem with co-issuing instructions in the 8th arithemtic instruction slot. It stops showing the results, when a co-issue happens in the 8th arithmetic instruction, whereas the REF works as expected. The following meaningless pixel shader doesn't show something with the driver version 28.32:ps.1.1 tex t0 ; color map tex t1 ; normal map dp3 r0,t1_bx2,v1_bx2; ; dot(normal,half) mul r1,r0,r0; ; raise it to 32nd power mul r0,r1,r1; mul r1,r0,r0; mul r0,r1,r1; mul r1, r0, r0 mul r0, r1, r1 ; assemble final color mul r0.rgb,t0,r0 +mov r0.a, r1
Assemble Pixel Shader
After checking for pixel shader support, setting the proper textures with SetTexture() and after writing a pixel shader and setting the needed constant values, the pixel shader has to be assembled. This is needed, because Direct3D uses pixel shaders as byte-code.
Assembling the shader is helpful in finding bugs earlier in the development cycle.
At the time of this writing there are three different ways to compile a pixel shader:
Use the pixel shader in a separate ASCII file for example test.psh and compile it with a pixel shader assembler (Microsoft Pixel Shader Assembler or NVASM) to produce a byte-code file which could be named test.pso. This way, not every person will be able to read and modify your source.
On the Fly Compiled Shaders
Write the pixel shader in a separate ASCII file or as a char string into your *.cpp file and compile it "on the fly" while the application starts up with the D3DXAssembleShader*() functions.
Shaders in Effect Files
Write the pixel shader source in an effect file and open this effect file when the application starts up. The pixel shader will be compiled by reading the effects file with D3DXCreateEffectFromFile(). It is also possible to pre-compile an effects file. This way most of the handling of pixel shaders is simplified and handled by the effects file functions.
The pre-compiled shader should be the preferred way of compiling shaders, since compilation happens during development of the code i.e. at the same time that the *.cpp files are compiled.
Creating a Pixel Shader
The CreatePixelShader() function is used to create and validate a pixel shader.
HRESULT CreatePixelShader( CONST DWORD* pFunction, DWORD* pHandle );
This function takes the pointer to the pixel shader byte-code in pFunction and returns a handle to the pixel shader in pHandle. A typical piece of source might look like this:
TCHAR Shad; LPD3DXBUFFER pCode = NULL; DXUtil_FindMediaFile(Shad,_T("environment.psh")); if(FAILED(D3DXAssembleShaderFromFile(Shad,0,NULL, &pCode,NULL) ) ) return E_FAIL; if( FAILED(m_pd3dDevice->CreatePixelShader((DWORD*)pCode->GetBufferPointer(), &m_dwPixShader) ) ) return E_FAIL;
DXUtil_FindMediaFile() helps you finding the ASCII file. D3DXAssembleShaderFromFile() compiles it before CreatePixelShader() returns the handle in m_dwPixShader.
The pointer pCode to the ID3DXBuffer interface is used to store the object code and to return a pointer to this object code with GetBufferPointer().
Set Pixel Shader
You set a pixel shader for a specific amount of vertices by using the SetPixelShader() function before the DrawPrimitive*() call for these vertices:
The only parameter that has to be provided is the handle of the pixel shader created by CreatePixelShader(). The pixel shader is executed for every pixel that is covered by the vertices in the DrawPrimitve*() call.
Free Pixel Shader resources
While the game shuts down or before a device change, the resources taken by the pixel shader has to be freed. This must be done by calling DeletePixelShader() with the pixel shader handle like this:
We have walked step by step through a vertex shader creation process. Let's summarize what was shown so far:
What happens next?
In the next part of this introduction named "Programming Pixel Shaders", we will start with a first basic pixel shader program and discuss a few basic algorithms and the way how to implement them with pixel shaders.
[Bendel] Steffen Bendel, "Hallo World - Font Smoothing with Pixel Shaders", ShaderX, Wordware Inc., pp. ?? - ??, 2002, ISBN 1-55622-041-3
[Beaudoin/Guardado] Philippe Beaudoin , Juan Guardado, "Non-integer Power Function on the Pixel Shader", ShaderX, Wordware Inc., pp. ?? - ??, 2002, ISBN 1-55622-041-3
[Brennan] Chris Brennan, "Per-Pixel Fresnel Term", ShaderX, Wordware Inc., pp ?? - ??, 2002, ISBN 1-55622-041-3
[Brennan2] Chris Brennan, "Diffuse Cube Mapping", ShaderX, Wordware Inc., pp ?? - ??, 2002, ISBN 1-55622-041-3
[Brennan3] Chris Brennan, "Accurate Environment Mapped Reflections and Refractions by Adjusting for Object Distance", ShaderX, Wordware Inc., pp ?? - ??, 2002, ISBN 1-55622-041-3
[Card/Mitchell] Drew Card, Jason L. Mitchell, "Non-Photorealistic Rendering with Pixel and Vertex Shaders", ShaderX, Wordware Inc., pp ?? - ??, 2002, ISBN 1-55622-041-3
[Calver] Dean Calver, Microsoft DirectX discussion forum, mail from Fri, 21 Sep 2001 10:00:55, http://discuss.microsoft.com/SCRIPTS/WA-MSD.EXE?A2=ind0109C&L=DIRECTXDEV&P=R25479
[Dietrich01] Sim Dietrich, "Guard Band Clipping in Direct3D", NVIDIA web-site
[Dietrich-DXDev] Sim Dietrich, Microsoft DirectX discussion forum, mail from Tue, 14 Aug 2001 20:36:02, http://discuss.microsoft.com/SCRIPTS/WA-MSD.EXE?A2=ind0108B&L=DIRECTXDEV&P=R13431
[Dominé01] Sébastien Dominé, "Alpha Test Tricks", NVIDIA web-site
[Hart] Evan Hart, "3D Textures and Pixel Shaders", ShaderX, Wordware Inc., pp ?? - ??, 2002, ISBN 1-55622-041-3
[Isidoro/Brennan] John Isidoro and Chris Brennan, "Per-Pixel Strand Based Anisotropic Lighting", ShaderX, Wordware Inc., pp ?? - ??, 2002, ISBN 1-55622-041-3
[Isidoro/Riguer] John Isidoro and Guennadi Riguer, "Texture Perturbation Effects", ShaderX, Wordware Inc., pp ?? - ??, 2002, ISBN 1-55622-041-3
[Kraus] Martin Kraus, "TVolumetric Effects", ShaderX, Wordware Inc., pp ?? - ??, 2002, ISBN 1-55622-041-3
[Mitchell] Jason L. Mitchell, "Image Processing with 1.4 Pixel Shaders in Direct3D", ShaderX, Wordware Inc. pp ?? - ??, 2002, ISBN 1-55622-041-3
[Moravánsky] Ádám Moravánszky, "Bump Mapped BRDF Rendering", ShaderX, Wordware Inc., pp ?? - ??, 2002, ISBN 1-55622-041-3
[Vlachos] Alex Vlachos, "Blending Textures for Terrain", ShaderX, Wordware Inc. pp ?? - ??, 2002, ISBN 1-55622-041-3
[Watt92] Alan Watt, Mark Watt, "Advanced Animation and Rendering Techniques", Addison Wesley, 1992, ISBN 1-55622-041-3
[Weiskopf] Daniel Weiskopf and Matthias Hopf, "Real-Time Simulation and Rendering of Particle Flows", ShaderX, Wordware Inc. pp ?? - ??, 2002, ISBN 1-55622-041-3
[Zecha] Oliver Zecha, "Procedural Textures", ShaderX, Wordware Inc. pp ?? - ??, 2002, ISBN 1-55622-041-3
The best resource to accompany this article is the pixel shader assembler reference in the Direct3D 8.1 documentation at
DirectX Graphics->Reference->Pixel Shader Assembler Reference
I'd like to recognize a couple of individuals that were involved in proof-reading and improving this paper (in alphabetical order):
© 2000 - 2002 Wolfgang Engel, Frankenthal, Germany