How instructions can be modified is best shown with the following diagram in mind:
This diagram shows the parallel pipeline structure of the pixel shader ALU. The vector or color pipeline handles the color values and the scalar or alpha pipeline handles the alpha value of a 32-bit value. There are "extensions" for instructions that enable the programmer to change the way data is read and/or written by the instruction. They are called Swizzling, Source Register Modifiers, Instruction Modifiers and Destination Register Modifiers.
We will work through all instruction "extensions" shown by Figure 12 in the following paragraphs from top to bottom.
In contrast to the more powerful swizzles that can be used in vertex shaders, the swizzling supported in pixel shader is only able to replicate a single channel of a source-register to all channels. This is done by so called source register selectors.
The .r and .g selectors are only available in ps.1.4. The following instruction replicates the red channel to all channels of the source register.
r1.r ; r1.rgba = r1.r
As shown in Figure 12, selectors are applied first in the pixel shader ALU. They are only valid on source registers of arithmetic instructions.
The .b replicate functionality is available in ps.1.1 - ps.1.3 since the release of DirectX 8.1, but this swizzle is only valid together with an alpha write mask in the destination register of an instruction like this:
mul r0.a, r0, r1.b
ps.1.1 does not support the .b replicate in DirectX 8.0.
This means that the .b source swizzle cannot be used with dp3 in ps.1.1 - ps.1.3, because the only valid write destination masks for dp3 are .rgb or .rgba (write masks will be presented later):
dp3 r0.a, r0, c0.b ; fails
The ability to replicate the blue channel to alpha opens the door to a bandwidth optimization method, described in an NVIDIA OpenGL paper named "Alpha Test Tricks" [Dominé01]. A pixel shader allow a dot product operation at the pixel level between two RGB vectors. Therefore, one can set one of the vectors to be (1.0, 1.0, 1.0), turning the dot product into a summation of the other's vectors components:
(R, G, B) dot (1.0, 1.0, 1.0) = R + G + B
In the following code snippet the pixel shader instruction dp3 calculates the sum of the RGB color by replicating the scalar result into all four channels of r1.
ps.1.1 def c0, 1.0, 1.0, 1.0, 1.0 tex t0 dp3 r1, t0, c0 mov r0.a, r1.b +mov r0.rgb, t0
An appropriate blending function might look like this:
dev->SetRenderState(D3DRS_ALPHAREF, (DWORD)0x00000001); dev->SetRenderState(D3DRS_ALPHATESTENABLE, TRUE); dev->SetRenderState(D3DRS_ALPHAFUNC, D3DCMP_GREATEREQUAL);
If the color data being rasterized is more opaque than the color at a given pixel (D3DPCMPCAPS_GREATEREQUAL), then the pixel is written. Otherwise, the rasterizer ignores the pixel altogether, saving the processing required for example to blend the two colors.
An even more clever method to utilize the alpha test for fillrate optimization was shown by ShaderX author Dean Calver on the public Microsoft DirectX discussion forum [Calver]. He uses an alpha map to lerp three textures this way:
; Dean Calver ; 2 lerps from a combined alpha texture ps.1.1 tex t0 ; combined alpha map tex t1 ; texture0 tex t2 ; texture1 tex t3 ; texture2 mov r1.a, t0.b ; copy blue to r1.a lrp r0, r1.a, t2, t1 ; lerp between t1 and t2 lrp r0, t0.a, t3, r0 ; lerp between result and t3
The .a replicate is analogous to the D3DTA_ALPHAREPLICATE flag in the DX6/7 multitexturing unit.
To move any channel to any channel, use dp3 to replicate the channel across, and then mask it out with the help of def instructions. The following pixel shader move the blue channel to the alpha channel:; move the red to blue and output combined ps.1.1 def c0, 1.f, 0.f, 0.f, 0.f ; select red channel def c1, 0.f, 0.f, 1.f, 0.f ; mask for blue channel def c2, 1.f, 1.f, 0.f, 1.f ; mask for all channels but blue tex t0 dp3 r0, t0, c0 ; copy red to all channels mul r0, r0, c1 ; mask so only blue is left mad r0, t0, c2, r0 ; remove blue from original texture and ; add red shifted into blue
In ps.1.4, there are specific source register selectors for texld and texcrd:
texld and texcrd are only able to use three channels in their source registers, so these selectors provide the option of taking the third component from either the third or the fourth component of the source register. Here are two examples on how to use these selectors:
texld r0, t1.rgb ... texcrd r1.rgb, t1.rga texcrd r4.rgb, t2.rgb
An overview on all possible source register selectors, modifiers and destination write masks is provided with the description of the texcrd and texld instructions above.
Source Register Modifiers
Source register modifiers are useful to adjust the range of register data in preparation for the instruction or to scale the value.
All modifiers can be used on arithmetic instructions. In ps.1.1 you can use the signed scale modifier _bx2 on the source register of any texm3x2* and texm3x3* instruction. In ps.1.2 and ps.1.3 it can be used on the source register of any texture address instruction.
bias subtracts 0.5 from all components. It allows the same operation as D3DTOP_ADDSIGNED in the DX6/7 multitexturing unit.
It is used to change the range of data from 0 to 1 to -0.5 to 0.5. Applying bias to data outside this range may produce undefined results. Data outside this range might be saturated with the _sat instruction modifier to the range [0..1], before used with a biased source register (more on instruction modifiers in the next section).
A typical example for this modifier is detail mapping, shown in the add example.
Invert inverts (1 - value) the value for each channel of the specified source register. The following code snippet uses inversion to complement the source register r1:
mul r0, r0, 1-r1 ; multiply by (1.0 - r1)
Negate negates all source register components by using a subtract sign before a register. This modifier is mutually exclusive with the invert modifier so it can not be applied to the same register.
Scale with the _x2 modifier is only available in ps.1.4. It multiplies a value by two before using it in the instruction and it is mutually exclusive to the invert modifier.
Signed scaling with the _bx2 modifier is a combination of bias and scale so it subtracts 0.5 from each channel and scales the result by 2. It remaps input data from unsigned to signed values. As with bias, using data outside of the range 0 to 1 may produce undefined results. This modifier is typically used in dp3 instructions. An example for this is presented with the description of the dp3 instruction above. Signed scaling is mutually exclusive with the invert modifier.
None of these modifiers change the content of the source register they are applied to. These modifiers are applied only to the data read from the register, so the value stored in the source register is not changed.
Modifiers and selectors may be combined freely. In the following example r1 uses the negative, bias and signed scaling modifiers as well as a red selector
With the help of the source register modifiers, per-pixel normalization is possible in the pixel shader. Per-pixel normalization can be used instead of a cubemap:
; Assuming v0 contains the unnormalized biased & scaled vector ( just ; like a normal map), r0 will end up with a very close to normalized ; result. ; This trick is also useful to do 'detail normal maps' by adding ; a vector to a normal map, and then renormalizing it. dp3 r0, v0_bx2, v0_bx2 ; r0 = N . N mad r0, v0_bias, 1-r0, v0_bx2 ; (v0_bias * (1-r0)) + v0_bx2 ; ((N - 0.5) * (1 - N.N)) + (N - 0.5) * 2
Normalization requires calculating 1/sqrt(N). This code snippet normalizes the normal vector with a standard Newton-Raphson iteration for approximating a square root. This trick was shown by Sim Dietrich in the Microsoft DirectX forum [Dietrich-DXDev].
In the first line, the normal vector N is biased and scaled by _bx2 and then multiplied with itself via a dot product. The result is stored in r0. In the next line, the normal is again biased and scaled and added to the inverse of r0. The result of this addition is multiplied with the biased normal.
There are additional modifiers specific to the texld and texcrd instructions in the ps.1.4 implementation. These modifiers provide projective divide functionality by dividing the x and y values by either the z or w values and therefore projective dependant reads are possible in ps.1.4.
These modifiers provide a functional replacement of the D3DTFF_PROJECTED flag of the D3DTSS_TEXTURETRANSFORMFLAGS texture stage state flag in the pixel shader. A typical instruction would look like this:
texcrd r2.rg, t1_dw.xyw ; third channel unset
The modifier copies x/w and y/w from t1 into the first two channels of r2. The 3rd and 4th channels of r2 are uninitialized. Any previous data written to these channels will be lost. The per-pixel perspective divide is useful for example for projective textures.
The restriction for the two texture addressing registers are:
After the swizzling of the source register channels and the modification of the values read out from a source register with source register modifiers, the instruction starts executing. As shown in figure 12, now the instruction modifiers are applied. These are indicated as an appendix to the instruction connected via an underscore. Instruction modifiers are used to change the output of an instruction. They can multiply or divide a result or clamp a result to [0..1]:
Instruction modifiers can be used only on arithmetic instructions. The _x8, _d4, and _d8 modifiers are new to the 1.4 shading model. _sat may be used alone or combined with one of the other modifiers. i.e. mad_d8_sat.
Multiplier modifiers are useful to scale a value. Note that any such scaling reduces accuracy of results. The following examples scale the results by using _x2 or _x4:
ps.1.1 tex t0 tex t1 mul_x2 r0, t1, t0 ; (t1 * t0) * 2 ... mul_x2 r0, 1-t1, t0 ; t0 * inverse(t1) * 2 ... mad_x2 r0, t1, v0, t0 ; (t1 + ( v0 * t0)) * 2 ... mul_x4 r0, v0, t0 ; (v0 * t0) * 4 mul_x4 r1, v1, t1 ; (v1 * t1) * 4 add r0, r0, r1 ; (v0*t0 * 4)+(v1*t1 * 4)
The _x2 modifer does the same as a shift left in C/C++.
The _d2 modifer does the same as a right shift in C/C++. Here is a more complex example:
; Here is an example for per-pixel area lighting ps.1.1 def c1, 1.0, 1.0, 1.0, 1.0 ; sky color def c2, 0.15, 0.15, 0.15, 1.0 ; ground color def c5, 0.5, 0.5, 0.5, 1.0 tex t0 ; normal map tex t1 ; base texture dp3_d2 r0, v0_bx2, t0_bx2 ; v0.rgb is hemi axis in tangent space ; dot normal with hemi axis add r0, r0, c5 ; map into range lrp r0, r0, c1, c2 mul r0, r0, t1 ; modulate base texture
This pixel shader biases the hemisphere axis in v0 and scales it by 2. The same is done to the values of the normal map. dp3_bx2 divides the result through 2. The add instruction adds 0.5 to the vector of r0. lrp uses r0 as the proportion to linearly interpolate between sky color in c1 and ground color in c2.
The saturation modifer _sat clamps each component or channel of the result to the range [0..1]. It is most often used to clamp dot products like in the code snippet:
dp3_sat r0, t1_bx2, r0 ; N.H dp3_sat r1.rgb, t1_bx2, r1 ; N.L
The result of the dot product operation of the normal vector with the half angle vector and the result of the dot product operation of the normal and the light vector are saturated. That means the values in r0 and r1.rgb are clamped to [0..1].
Destination Register Modifiers/Masking
A destination register modifer or write mask controls which channel in a register is updated. So it only alters the value of the channel it is applied to.
Write masks are supported for arithmetic instructions only. The following destination write masks are available for all arithmetic instructions:
In ps.1.1 - ps.1.3 a pixel shader can only use the .rgb or .a write masks. The arbitrary write mask in ps.1.4 allows any set of channels in the order r, g, b, a to be combined. It is possible to choose for example:
mov r0.ra, r1
If no destination write mask is specified, the destination write mask defaults to the .rgba case, which updates all channels in the destination register. An alternate syntax for the r, g, b, a channels is x, y, z, w.
As with the source register selectors and source register modifiers, the texld and texcrd instructions have additional write masks and write mask rules. texcrd can write only to the .rgb channels. It supports additionally a write mask that masks the first two channels with .rg or .xy. texld uses all four channels of the destination register. There is no alternative write mask available.
The usage of write masks is shown in the following ps.1.4 pixel shader that handles diffuse bump mapping with two spotlights (taken from the file 14_bumpspot.sha of the ATI Treasure Chest example program):
ps.1.4 def c0, 1.0f, 1.0f, 1.0f, 1.0f ; Light 1 Color def c1, 1.0f, -0.72f, 1.0f, 1.0f ; Light 1 Angle scale(x) and bias(Y) def c2, 1.0f, 1.0f, 1.0f, 1.0f ; Light 2 Color def c3, 0.25f, 0.03f, 1.0f, 1.0f ; Light 2 Angle scale(x) and bias(Y) texcrd r0.rgb, t2 ; Spot light 1 direction texcrd r1.rgb, t4 ; Spot light 2 direction texld r2, t1 ; Light 1 to Point vector texld r3, t3 ; Light 2 to Point vector texcrd r4.rgb, t1 ; Light 1 space position for attenuation texcrd r5.rgb, t3 ; Light 2 space position for attenuation dp3_sat r4.x, r4, r4 ; Light 1 Distance^2 dp3_sat r5.x, r5, r5 ; Light 2 Distance^2 dp3_sat r4.y, r0, r2_bx2 ; Light 1 Angle from center of spotlight dp3_sat r5.y, r1, r3_bx2 ; Light 2 Angle from center of spotlight mad_x4 r4.y, r4.y, c1.x, c1.y ; Light 1 scale and bias for angle mad_x4 r5.y, r5.y, c3.x, c3.y ; Light 2 scale and bias for angle phase texld r0, t0 ; Base Map texld r1, t0 ; Normal Map texld r4, r4 ; Distance/Angle lookup map texld r5, r5 ; Distance/Angle lookup map dp3_sat r2.rgb, r1_bx2, r2_bx2 ; *= (N.L1) mul_x2 r2.rgb, r2, r4.r ; Attenuation from distance and angle mad r2.rgb, r2, c0, c7 ; * Light Color + Ambient dp3_sat r3.rgb, r1_bx2, r3_bx2 ; *= (N.L2) mul_x2 r3.rgb, r3, r5.r ; Attenuation from distance and angle mad r3.rgb, r3, c2, r2 ; * Light 2 Color + Light 1 Color + Ambient mul r0.rgb, r3, r0 ; Modulate by base map +mov r0.a, c0
There are four different write masks used throughout this shader. These are the .rgb, .x, .y and the .a write masks. The first write mask used for the texcrd instructions are imperative. texld can't handle write masks other than .rgba, which is the same as applying no explicit write mask. The first four dp3 and the next two mad instructions write to a the x respectively y values of the r4 and r5 registers. These write masks are not supported by ps.1.1 - ps.1.3. The usage of the .rgb write mask in the second phase of this shader is supported by all implementations. The last two lines of this shader show the pairing of two instructions using co-issue. We will discuss instruction paring or "co-issuing" in the next section.
As shown above in Figure 12, there are two pipelines one for the color data and one for the alpha data. Because of the parallel nature of these pipelines, the instructions that write color data and instructions that write only alpha data can be paired. This helps reducing the fill-rate. Only arithmetic instructions can be co-issued, with the exception of dp4. Pairing, or co-issuing, is indicated by a plus sign (+) preceding the second instruction of the pair. The following shader fragment shows three pairs of co-issued instructions:
dp3_sat r1.rgb, t1_bx2, r1 +mul r1.a, r0, r0 mul r1.rgb, r1, v0 +mul r1.a, r1, r1 add r1.rgb, r1, c7 +mul_d2 r1.a, r1, r1
First a dp3 instruction is paired with a mul, than a mul instruction with a mul and as the last an add instruction with a mul. Pairing happens in ps.1.1 - ps.1.3 always with the help of a pair of .rgb and .a write masks. In ps.1.4, a pairing of the .r, .g. or .b write masks together with an .a masked destination register is possible. The line
mul r1.a, r0, r0
only writes the alpha value of the result of the multiplication of r0 with itself into r1.a.
Co-issued instructions are considered a single entity, the result from the first instruction is not available until both instructions are finished and vice versa. The following shader will fail shader validation:
ps.1.1 def c0, 1, 1, 1, 1 mov r0, v0 dp3 r1.rgb, r0, c0 +mov r0.a, r1.b
mov tries to read r1.b, but dp3 did not write to r1.b at that time. The shader will fail, because r1.b was not initialized before.
This could be troublesome, when r1.b is initialized before by any instruction. Then the validator will not catch the bug and the results will not look as expected.
Another restriction to pay attention is the maximum number of three different register types, that can be used across two co-issued instructions.
Geforce3/4TI has a problem with co-issuing instructions in the 8th arithemtic instruction slot. It stops showing the results, when a co-issue happens in the 8th arithmetic instruction, whereas the REF works as expected. The following meaningless pixel shader doesn't show something with the driver version 28.32:ps.1.1 tex t0 ; color map tex t1 ; normal map dp3 r0,t1_bx2,v1_bx2; ; dot(normal,half) mul r1,r0,r0; ; raise it to 32nd power mul r0,r1,r1; mul r1,r0,r0; mul r0,r1,r1; mul r1, r0, r0 mul r0, r1, r1 ; assemble final color mul r0.rgb,t0,r0 +mov r0.a, r1