GameDev.net -- Shader Programming Part III: Fundamentals of Pixel Shaders

The final output of any 3D graphics hardware consists of pixels. Depending on the resolution, in excess of 2 million pixels may need to be rendered, lit, shaded and colored. Prior to DirectX 8.0, Direct3D used a fixed-function multitexture cascade for pixel processing. The effects possible with this approach were very limited on the implementation of the graphics card device driver and the specific underlying hardware. A programmer was restricted on the graphic algorithms implemented by these.
With the introduction of shaders in DirectX 8.0 and the improvements of pixel shaders in DirectX 8.1, a whole new programming universe can be explored by game/demo coders. Pixel shaders are small programs that are executed on individual pixels. This is an unprecedented level of hardware control for their users.

The third part in this "Introduction to Shader Programming", named "Fundamentals of Pixel Shaders", shows you

the place and the tasks of pixel shaders in the Direct3D pipeline
the architecture of the pixel shader unit
tools, that help to program pixel shaders
the ingredients for a pixel shader-driven application

The fourth and last part will present many pixel shader concepts and algorithms with example code.

Why use Pixel Shaders?

Pixel shaders are at the time of this writing supported by GeForce 3/4TI and RADEON 8500-based cards. Unlike vertex shaders, however, there is no feasible way of emulating pixel shaders using software.
The best argument for using pixel shaders is to take a look at a few demos that uses them :-) ... or just one word: per-pixel non-standardized lighting. The gain in visual experience is enormous. You can use membrane shaders, for balloons, skins, kubelka-munk shaders for translucency effects, fur/hair shaders and new lighting formulas that lead to a totally new lighting experience. Think of a foreign planet with three moons moving faster than the two suns or with a planet surface consisting of different crystalline substances, reflecting light in different ways. The following list should give you a glimpse on the many types of effects that are possible by using pixel shaders now:

Single pass, per-pixel lighting (see next part)
True phong shading [Beaudoin/Guardado]
Anisotropic lighting [Isidoro/Brennan]
Non-Photorealistic-Rendering: cartoon shading, hatching, gooch lighting, image space techniques [Card/Mitchell]
Per-pixel fresnel term [Brennan]
Volumetric effects [Kraus][Hart]
Advanced bump mapping (self-shadowing bump maps (also known as Horizon Mapping)
Procedural textures [Zecha] and texture perturbation [Isidoro/Riguer]
Bidirectional reflectance distribution functions [Moravánsky]

Not to mention the effects that are not discovered until now or that are discovered but used only in scientific journals. These visual effects are waiting until they get implemented by you :-).
One of the biggest catches of pixel shaders is that they often have to be "driven" by the vertex shader. For example to calculate per-pixel lighting the pixel shader needs the orientation of the triangle, the orientation of the light vector and in some cases the orientation of the view vector. The graphics pipeline shows the relationship between vertex and pixel shaders.

Pixel Shaders in the Pipeline

Figure 1 - Direct3D Pipeline

The following diagram shows the DX6/7 multitexturing unit and the new pixel shader unit. On pixel shader-capable hardware, the pixel shader heightens the number of transistors on the graphic chip, because it is added to the already existing DX 6/7 multitexturing hardware. Therefore it is also functionally an independent unit, that the developer can choose instead of the DX6/7 multitexturing unit.
But what happens in the 3D pipeline before the pixel shader? A vertex leaves the Vertex Shader as a transformed and colored vertex. The so-called Backface Culling removes all triangles, that are facing away from the viewer or camera. These are by default the vertices that are grouped counter-clockwise. On average, half of your game world triangles will be facing away from the camera at any given time, so this helps reducing rendering time. A critical point is the usage of translucent or transparent front triangles. Depending on what is going to be rendered, Backface culling can be switched off with the D3DCULL_NONE flag.

Backface Culling uses the crossproduct of two sides of a triangle to calculate a vector that is perpendicular to the plane that is formed by these two sides. This vector is the face normal. The direction of this normal determines whether the triangle is front- or backfacing. Because the Direct3D API always uses the same vertex order to calculate the crossproduct, it is known whether a triangle's vertices are "wound" clockwise or counter-clockwise.

User Clip Planes can be set by the developer to clip triangles with the help of the graphics card, that are outside of these planes and therefore to reduce the number of calculations. How to set User Clip Planes is shown in the Direct3D 8.1 example named "Clip Mirror" and in the ATI RADEON 8500 Nature demo. The RADEON and RADEON 8500 support six independant user clip planes in hardware. The clip planes on the GeForce 1/2/3/4TI chips are implemented using texture stages. That means that two User Clip Planes you use, eat up one texture stage.

It looks like NVIDIA no longer exposes the cap bits for these.The DirectX Caps Viewer reports MaxUserClipPlanes == 0 since the release of the 12.41 drivers.

One alternative to clip planes is shown in the "TexKillClipPlanes" example delivered in the NVIDIA Effects Browser, where the pixel shader instruction texkill is used to get a similar functionality like clip planes.
Another alternative to clip planes is guard band clipping as supported by GeForce hardware [Dietrich01]. The basic idea of guard band clipping is that hardware with a guard band region can accept triangles that are partially or totally off-screen, thus avoiding expensive clipping work. There are four cap fields for Guard Band support in the D3DCAPS8 structure: GuardBandLeft, GuardBandRight, GuardBandTop, and GuardBandBottom. These represent the boundaries of the guard band. These values have to be used in the clipping code. Guard band clipping is handled automatically by switching on clipping in Direct3D.

Frustum Clipping is performed on the viewing frustum. Primitives that lie partially or totally off-screen must be clipped to the screen or viewport boundary, which is represented by a 3D viewing frustum. The viewing frustum can be visualized as a pyramid, with the camera positioned at the tip. The view volume defines what the camera will see and won't see. The entire scene must fit between the new and far clipping planes and also be bounded by the sides, the bottom and top of the frustum. A given object in relation to the viewing frustum can fall into one of three categories [Watt92]:

The object lies completely outside the viewing in which case it is discarded.
The object lies completely inside the viewing frustum, in which case it is passed on to the Homogenous Divide.
The object intersects the viewing frustum, in which case it is clipped against it and then passed on to the Homogenous Divide.

Translating the definition of the viewing frustrum above into homogenous coordinates gives us the clipping limits:

-w <= x <= w -w <= y <= w 0 <= z <= w

This would be a slow process, if it has to be done by the CPU, because each edge of each triangle that crosses the viewport boundary must have an intersection point calculated, and each parameter of the vertex ( x,y,z diffuse r,g,b, specular r,g,b, alpha, fog, u and v ) must be interpolated accordingly. Therefore frustrum clipping is done by modern graphics hardware on the GPU essentially for free with clip planes and guard band clipping.
After Frustrum clipping, the Homogenous or perspective Divide happens. This means that the x-, y- and z-coordinates of each vertex of the homogenous coordinates are divided by w. The perspective divide makes nearer objects larger, and further objects smaller as you would expect when viewing a scene in reality. After this division, the coordinates are in a normalized space:

-1 <= x/w <= 1 -1 <= y/w <= 1 0 <= z/w <= 1

Why do we need this divide through w? By definition every point gets a fourth component w that measures distance along an imaginary fourth-dimensional axis called w. For example, (4, 2, 8, 2) represents the same point as (2, 1, 4, 1). Such a w-value is useful for example to translate an object, because a 3x3 matrix is not able to translate an object without changing its orientation. The fourth coordinate, w, can be thought of as carrying the perspective information necessary to represent a perspective transformation.

Clip coordinates are also referred to as normalized device coordinates (NDC). These coordinates are now mapped to the screen by transforming into screen space via the so-called Viewport Mapping. To map these NDCs to screen space, the following formula is used for a resolution of 1024x768:

ScreenX(max) = NDC(X) * 512 + 512 ScreenY(max) = NDC(Y) * 384 + 384

The minimum and maximum value of X and Y is (-1, -1) to (1, 1). When the NDC's are (-1, -1), the screen coordinates are

ScreenX = -1 * 512 + 512 ScreenY = -1 * 384 + 384

This point lies in the upper-left corner. For example the lower-right corner will be reached with NDCs of (1, 1). Although z and w values are retained for depth buffering tests, screen space is essentially a 2D coordinate system, so only x and y values need to be mapped to screen resolution.
Now comes the Triangle Setup, where the life of the vertices end and the life of the pixel begins. It computes triangle for triangle the parameters required for the rasterization of the triangles, amongst other things it defines the pixel-coordinates of the triangle outline. This means that it defines the first and the last pixel of the triangle scan line by scan line:

Figure 2 - Triangle Setup

Then the Rasterizer interpolates color and depth values for each pixel from the color and depth values of the vertices. These values are interpolated using a weighted average of the color and depth values of the edge's vertex values, where the color and depth data of edge pixels closer to a given vertex more closely approximate values for that vertex. Then the rasterizer fills in pixels for each line.

Figure 3 - Filled Triangle

In addition, the texture coordinates are interpolated for use during the Multitexturing/Pixel Shader stage in a similar way.

The rasterizer is also responsible for Direct3D Multisampling. Because it is done in the Rasterizer stage, Multisampling only affects triangles and group of triangles; not lines. It increases the resolution of polygon edges and therefore as well as depth and stencil tests. The RADEON 7x00/GeForce 3 supports this kind of spatial anti-aliasing by setting the D3DRASTERCAPS_STRECHBLTMULTISAMPLE flag, but both ignore the D3DRS_MULTISAMPLEMASK render state to control rendering into the sub-pixel samples, whereas the RADEON 8500 is able to mask the sub-samples with different bit-pattern by using this render state. By affecting only specific sub-samples, effects like motion blur or depth of field and others can be realized.

Alternatively, motion blur and depth of field effects are possible with the help of vertex and pixel shaders; see the NVIDIA examples in the Effects Browser.
GeForce 3/4TI, RADEON 7x00 and RADEON 8500 chips have another stage in between the triangle setup and the pixel shader. Both chips has a visibility subsystem that is responsible for z occlusion culling. In case of the ATI cards it is called Hierarchical-Z. It determines if a pixel will be hidden behind an earlier rendered pixel or not. If the z occlusion culling should determine that the pixel is hidden it will directly be discarded and it won't enter the pixel shader, thus saving the initial z-buffer read of the pixel-rendering pipeline. During scan line conversion, the screen is divided into 8x8 pixel blocks (RADEON 7x00) or 4x4 pixel blocks (RADEON 8500), and each block is assigned a bit that determines whether that block is "not cleared", that means visible or "cleared", that means occluded. A quick z-check is performed on each of these blocks to determine whether any pixels in them are visible. If a pixel in a block is visible, the status bit is flipped to visible and the new z-value for that pixel is written to the z-buffer. If the block is not visible, its status bit remains set to occluded and the z-buffer isn't touched. The whole block will not be send to the pixel shader.
Sorting objects or object groups from front to back when rendering helps z-culling to kick out pixels from the gaphics pipeline very early, saving not only a write into the z-buffer (as with texkill), but also a read of the z-buffer.

The Pixel Shader is not involved on the sub-pixel level. It gets the already multisampled pixels along with z, color values and texture information. The already Gouraud shaded or flat shaded pixel might be combined in the pixel shader with the specular color and the texture values fetched from the texture map. For this task the pixel shader provides instructions that affect the texture addressing and instructions to combine the texture values in different ways with each other.
There are five different pixel shader standards supported by Direct3D 8.1. Currently, no pixel shader capable graphics hardware is restricted on the support of ps.1.0. Therefore all available hardware supports at least ps.1.1. So I will not mentioned the legacy ps.1.0 in this text anymore. ps.1.1 is the pixel shader version that is supported by GeForce 3. GeForce 4TI supports additionally ps.1.2 and ps.1.3. The RADEON 8x00 supports all of these pixel shader versions plus ps.1.4.
Whereas the ps.1.1, ps.1.2 and ps.1.3 are, from a syntactical point of view, build on each other, ps.1.4 is a new and more flexible approach. ps.1.1 - ps.1.3 differentiate between texture address operation instructions and texture blending instructions. As the name implies, the first kind of instructions is specialized and only usable for address calculations and the second kind of instructions is specialized and only usable for color shading operations. ps.1.4 simplifies the pixel shader language by allowing the texture blending (color shader) instruction set to be used for texture address (address shader) operations as well. It differentiate between arithmetic instructions, that modify color data and texture address instructions, that process texture coordinate data and in most cases sample a texture.

Sampling means looking up a color value in a texture at the specified up to four coordinates (u, v, w, q) while taking into account the texture stage state attributes.

One might say that the usage of the instructions in ps.1.4 happens in a more RISC-like manner, whereas the ps.1.1 - ps.1.3 instruction sets are only useable in a CISC-like manner.
This RISC-like approach will be used in ps.2.0, that will appear in DirectX 9. Syntactically ps.1.4 is compared to ps.1.1-ps.1.3 an evolutionary step into the direction of ps.2.0.
What are the other benefits of using ps.1.4?

Unlimited texture address instructions: Direct3D 8.0 with ps.1.1 is restricted to a small number of meaningful ways in which textures can be addressed. Direct3D 8.1 with ps.1.4 allows textures to be addressed and sampled in a virtually unlimited number of ways, because the arithmetic operation instructions are also usable as texture address operation instructions.
Flexible dependent texture reads.
RADEON 8x00 performs ps.1.4 faster than ps.1.1 - ps.1.3, because the latter is emulated.
All of the upcoming graphics hardware that will support ps.2.0, will support ps.1.4 for backward compatibility.

The next stage in the Direct3D pipeline is Fog. A fog factor is computed and applied to the pixel using a blending operation to combine the fog amount (color) and the already shaded pixel color, depending on how far away an object is. The distance to an object is determined by its z- or w-value or by using a separate attenuation value that measure the distance between the camera and the object in a vertex shader. If fog is computed by-vertex, it is interpolated across each triangle using Gouraud shading. The "MFC Fog" example in the DirectX 8.1 SDK shows linear and exponential fog calculated per-vertex and per-pixel. A layered range-based fog is shown in the "Height Fog" example in the NVIDIA Effects Browser. The Volume Fog example in the DirectX 8.1 SDK shows volumetric fog produced with a vertex and a pixel shader and the alpha blending pipeline. As shown in these examples fog effects can be driven by the vertex and/or pixel shader.

The Alpha Test stage kicks out pixels with a specific alpha value, because these shouldn't be visible. This is for example one way to map decals with an alpha mapped texture. The alpha test is switched on with D3DRS_ALPHATESTENABLE. The incoming alpha value is compared with the reference alpha value, using the comparison function provided by the D3DRS_ALPHAFUNC render state. If the pixel passes the test, it will be processed by the subsequent pixel operation, otherwise it will be discarded.
Alpha test does not incur any extra overhead, even if all the pixels pass the test. By tuning appropriately the reference value of Alpha Test to reject the transparent or almost transparent pixels, one can improve the application performance significantly, if the application is memory bandwidth fill-bound. Basically, varying the reference value acts like a threshold setting up how many pixels are going to be evicted. The more pixels are being discarded, the faster the application will run.
There is a trick to drive the alpha test with a pixel shader, that will be shown later in the section on swizzling. The image-space outlining technique in [Card/Mitchell] uses alpha testing to reject pixels that are not outlines and consequently improve performance.

The Stencil Test masks the pixel in the render target with the contents of the stencil buffer. This is useful for dissolves, decaling, outlining or to build shadow volumes [Brennan]. A nice example for this is "RadeonShadowShader", that could be found on the ATI web-site.

The Depth Test determines, whether a pixel is visible by comparing its depth value to the stored depth value. An application can set up its z-buffer z-min and z-max, with positive z going away from the view camera in a left-handed coordinate system. The Depth test is a pixel-by-pixel logical test that asks "Is this pixel in back of another pixel at this location?". If the answer returned is yes, that pixel gets discarded, if the answer is no, that pixel will travel further through the pipeline and the z-buffer is updated. There is also a more-precise and bandwidth saving depth buffer form called w-buffer. It saves bandwidth by having only to send x/y/w coordinates over the bus, while the z-buffer in certain circumstances has to send all that plus z.

It is the only depth buffering method used by the Savage 2000-based boards. These cards emulate the z-buffer with the help of the hardware w-buffer.
When using vertex shaders with w-buffering on a GeForce 3/4TI card, make sure the projection matrix is set in the traditional way (using SetTransform()), otherwise w-buffering won’t work correctly (Read more in the GeForce 3 FAQ, which can be found at www.nvidia.com/developer).

The pixel shader is able to "drive" the depth test with the texdepth (ps.1.4 only) or the texm3x2depth (ps.1.3 only) instructions. These instructions can calculate the depth value used in the depth buffer comparison test on a per-pixel basis.

The Alpha Blending stage blends the pixel's data with the pixel data already in the Render Target. The blending happens with the following formula:

FinalColor = SourcePixelColor * SourceBlendFactor + DestPixelColor * DestBlendFactor

There are different flags, that can be set with the D3DRS_SRCBLEND (SourceBlendFactor) and D3DRS_DESTBLEND (DestBlendFactor) parameters in a SetRenderState() call. Alpha Blending was formerly used to blend different texture values together, but nowadays it is more efficient to do that with the multitexturing or pixel shader unit, depending on hardware support. Alpha Blending is used on newer hardware to simulate different levels of transparency.

Dithering tries to fool the eye into seeing more colors than are actually present by placing different colored pixels next to one another to create a composite color that the eye will see. For example using a blue next to a yellow pixel would lead to a green appearance. That was a common technique in the days of the 8-bit and 4-bit color systems. You switch on the dithering stage globally with D3DRS_DITHERENABLE.

The Render Target is normally the backbuffer in which you render the finished scene, but it could be also the surface of a texture. This is useful for the creation of procedural textures or for the re-usage of results of former pixel shader executions. The render target is read by the stencil- , depth test and the alpha blending stage.

To summarize the tasks of the pixel shader in the Direct3D pipeline:

The pixel shader is fed with rasterized pixels, which means in most cases z-occlussion culled, multisampled and already gouraud-shaded pixels
The pixel shader receives an interpolated texture coordinate, interpolated diffuse and specular colors
There are mainly four flavours of pixel shaders, the ps.1.1 - ps.1.3 standards are from a syntactical point of view a more CISC-like approach, whereas the ps.1.4 standard uses a RISC-like approach, that will be used in the upcoming ps.2.0
The pixel shader delivers the shaded pixel to the fog stage
The pixel shader is able to support or drive the consecutive Direct3D stages

Before getting our feets wet with a first look at the pixel shader architecture, let's take a look at the currently available tools:

Pixel Shader Tools

Contents

Introduction

Pixel Shader Tools

High Level View

Instruction Modifiers

Conclusion

Printable version

Discuss this article

The Series

Fundamentals of Vertex Shaders

Programming Vertex Shaders

Fundamentals of Pixel Shaders

Programming Pixel Shaders

Diffuse & Specular Lighting with Pixel Shaders