16. Loop Optimation =================== When analyzing a program you often find that 99% of the time consumption lies in the innermost loop. The way to improve the speed is to carefully optimize the most time-consuming loop using ASM language. The rest of the program may be left in high-level language. A loop generally contains a counter controlling how many times to iterate, and often array access reading or writing one array element for each iteration. I have chosen as example a procedure which reads integers from an array, changes the sign of each integer, and stores the results in another array. A C language code for this procedure would be: void ChangeSign (int * A, int * B, int N) { int i; for (i=0; i<N; i++) B[i] = -A[i];} Translating to assembler, we might write the procedure like this: Example 1: _ChangeSign PROCEDURE NEAR PUSH ESI PUSH EDI A EQU DWORD PTR [ESP+12] B EQU DWORD PTR [ESP+16] N EQU DWORD PTR [ESP+20] MOV ECX, [N] JECXZ L2 MOV ESI, [A] MOV EDI, [B] CLD L1: LODSD NEG EAX STOSD LOOP L1 L2: POP EDI POP ESI RET _ChangeSign ENDP This looks like a nice solution, but it is not optimal because it uses slow non-pairable instructions. It takes 11 clock cycles per iteration if all data are in the level one cache. Using pairable instructions only -------------------------------- Example 2: MOV ECX, [N] MOV ESI, [A] TEST ECX, ECX JZ SHORT L2 MOV EDI, [B] L1: MOV EAX, [ESI] ; u XOR EBX, EBX ; v (pairs) ADD ESI, 4 ; u SUB EBX, EAX ; v (pairs) MOV [EDI], EBX ; u ADD EDI, 4 ; v (pairs) DEC ECX ; u JNZ L1 ; v (pairs) L2: Here we have used pairable instructions only, and scheduled the instruc- tions so that everything pairs. It now takes only 4 clock cycles per iteration. We could have obtained the same speed without splitting the NEG instruction, but the other unpairable instructions should be split up. Using the same register for counter and index --------------------------------------------- Example 3: MOV ESI, [A] MOV EDI, [B] MOV ECX, [N] XOR EDX, EDX TEST ECX, ECX JZ SHORT L2 L1: MOV EAX, [ESI+4*EDX] ; u NEG EAX ; u MOV [EDI+4*EDX], EAX ; u INC EDX ; v (pairs) CMP EDX, ECX ; u JB L1 ; v (pairs) L2: Using the same register for counter and index gives us fewer instructions in the body of the loop, but it still takes 4 clocks because we have two unpaired instructions. Letting the counter end at zero ------------------------------- We want to get rid of the CMP instruction in example 3 by letting the counter end at zero and use the zero flag for detecting when we are finished as we did in example 2. One way to do this would be to execute the loop backwards taking the last array elements first. However, data caches are optimized for accessing data forwards, not backwards, so if cache misses are likely, then you should rather start the counter at -N and count through negative values up to zero. The base registers should then point to the end of the arrays rather than the beginning: Example 4: MOV ESI, [A] MOV EAX, [N] MOV EDI, [B] XOR ECX, ECX LEA ESI, [ESI+4*EAX] ; point to end of array A SUB ECX, EAX ; -N LEA EDI, [EDI+4*EAX] ; point to end of array B JZ SHORT L2 L1: MOV EAX, [ESI+4*ECX] ; u NEG EAX ; u MOV [EDI+4*ECX], EAX ; u INC ECX ; v (pairs) JNZ L1 ; u L2: We are now down at five instructions in the loop body but it still takes 4 clocks because of poor pairing. (If the addresses and sizes of the arrays are constants we may save two registers by substituting A+SIZE A for ESI and B+SIZE B for EDI). Now let's see how we can improve pairing. Pairing calculations with loop overhead --------------------------------------- We may want to improve pairing by intermingling calculations with the loop control instructions. If we want to put something in between INC ECX and JNZ L1, it has to be something that doesn't affect the zero flag. The MOV [EDI+4*ECX],EBX instruction after INC ECX would generate an AGI delay, so we have to be more ingenious: Example 5: MOV EAX, [N] XOR ECX, ECX SHL EAX, 2 ; 4 * N JZ SHORT L3 MOV ESI, [A] MOV EDI, [B] SUB ECX, EAX ; - 4 * N ADD ESI, EAX ; point to end of array A ADD EDI, EAX ; point to end of array B JMP SHORT L2 L1: MOV [EDI+ECX-4], EAX ; u L2: MOV EAX, [ESI+ECX] ; v (pairs) XOR EAX, -1 ; u ADD ECX, 4 ; v (pairs) INC EAX ; u JNC L1 ; v (pairs) MOV [EDI+ECX-4], EAX L3: I have used a different way to calculate the negative of EAX here: inverting all bits and adding one. The reason why I am using this method is that I can use a dirty trick with the INC instruction: INC doesn't change the carry flag, whereas ADD does. I am using ADD rather than INC to increment my loop counter and testing the carry flag rather than the zero flag. It is then possible to put the INC EAX in between without affecting the carry flag. You may think that we could have used LEA EAX,[EAX+1] here in stead of INC EAX, at least that doesn't change any flags, but the LEA instruction would have an AGI stall so that's not the best solution. I have obtained perfect pairing here and the loop now takes only 3 clock cycles. Whether you want to increment the loop counter by 1 (as in example 4) or by 4 (as in example 5) is a matter of taste, it makes no difference in loop timing. Overlapping the end of one operation with the beginning of the next ------------------------------------------------------------------- The method used in example 5 is not very generally applicable so we may look for other methods of improving pairing opportunities. One way is to reorganize the loop so that the end of one operation overlaps with the beginning of the next. Actually, example 5 did pair the last MOV of one iteration with the first MOV of the next, but we want to explore this method further: Example 6: MOV ESI, [A] MOV EAX, [N] MOV EDI, [B] XOR ECX, ECX LEA ESI, [ESI+4*EAX] ; point to end of array A SUB ECX, EAX ; -N LEA EDI, [EDI+4*EAX] ; point to end of array B JZ SHORT L3 XOR EBX, EBX MOV EAX, [ESI+4*ECX] INC ECX JZ SHORT L2 L1: SUB EBX, EAX ; u MOV EAX, [ESI+4*ECX] ; v (pairs) MOV [EDI+4*ECX-4], EBX ; u INC ECX ; v (pairs) MOV EBX, 0 ; u JNZ L1 ; v (pairs) L2: SUB EBX, EAX MOV [EDI+4*ECX-4], EBX L3: Here we begin reading the second value before we have stored the first, and this of course improves pairing opportunities. The MOV EBX,0 instruction has been put in between INC ECX and JNZ L1 not to improve pairing but to avoid AGI stall. Rolling out a loop ------------------ The most generally applicable way to improve pairing opportunities is to do two operations for each run and do half as many runs. This is called rolling out a loop: Example 7: MOV ESI, [A] MOV EAX, [N] MOV EDI, [B] XOR ECX, ECX LEA ESI, [ESI+4*EAX] ; point to end of array A SUB ECX, EAX ; -N LEA EDI, [EDI+4*EAX] ; point to end of array B JZ SHORT L2 TEST AL,1 ; test if N is odd JZ SHORT L1 MOV EAX, [ESI+4*ECX] ; N is odd. do the odd one NEG EAX MOV [EDI+4*ECX], EAX INC ECX ; make counter even JZ SHORT L2 ; N = 1 L1: MOV EAX, [ESI+4*ECX] ; u MOV EBX, [ESI+4*ECX+4] ; v (pairs) NEG EAX ; u NEG EBX ; u MOV [EDI+4*ECX], EAX ; u MOV [EDI+4*ECX+4], EBX ; v (pairs) ADD ECX, 2 ; u JNZ L1 ; v (pairs) L2: Now we are doing two operations in parallel which gives the best pairing opportunities. We have to test if N is odd and if so do one operation outside the loop because the loop can only do an even number of operations. The loop has an AGI stall at the first MOV instruction because ECX has been incremented in the preceding clock cycle. The loop therefore takes 6 clock cycles for two operations. Reorganizing a loop to remove AGI stall --------------------------------------- Example 8: MOV ESI, [A] MOV EAX, [N] MOV EDI, [B] XOR ECX, ECX LEA ESI, [ESI+4*EAX] ; point to end of array A SUB ECX, EAX ; -N LEA EDI, [EDI+4*EAX] ; point to end of array B JZ SHORT L3 TEST AL,1 ; test if N is odd JZ L2 MOV EAX, [ESI+4*ECX] ; N is odd. do the odd one NEG EAX ; no pairing opportunity MOV [EDI+4*ECX-4], EAX INC ECX ; make counter even JNZ L2 NOP ; add NOP's if JNZ L2 not predictable NOP JMP L3 ; N = 1 L1: NEG EAX ; u NEG EBX ; u MOV [EDI+4*ECX-8], EAX ; u MOV [EDI+4*ECX-4], EBX ; v (pairs) L2: MOV EAX, [ESI+4*ECX] ; u MOV EBX, [ESI+4*ECX+4] ; v (pairs) ADD ECX, 2 ; u JNZ L1 ; v (pairs) NEG EAX NEG EBX MOV [EDI+4*ECX-8], EAX MOV [EDI+4*ECX-4], EBX L3: The trick is to find a pair of instructions that do not use the loop counter as index and reorganize the loop so that the counter is incremented in the preceding clock cycle. We are now down at 5 clock cycles for two operations which is close to the best possible. If data caching is critical, then you may improve the speed further by combining the A and B arrays into one structured array so that each B[i] comes immediately after the corresponding A[i]. If the structured array is aligned by at least 8 then B[i] will always be in the same cache line as A[i], so you will never have a cache miss when writing B[i]. This may of course have a tradeoff in other parts of the program so you have to weigh the costs against the benefits. Rolling out by more than 2 -------------------------- You may think of doing more than two operations per iteration in order to reduce the loop overhead per operation. But since the loop overhead in most cases can be reduced to only one clock cycle per iteration, then rolling out the loop by 4 rather than by 2 would only save 1/4 clock cycle per operation, which is hardly worth the effort. The drawbacks of excessive loop unrolling are: 1. You need to calculate N MODULO R, where R is the unrolling factor, and do N MODULO R operations before or after the main loop in order to make the remaining number of operations divisible by R. This takes a lot of extra code and poorly predictable branches. And the loop body of course also becomes bigger. 2. A Piece of code usually takes much more time the first time it executes, and the penalty of first time execution is bigger the more code you have, especially if N is small. 3. Excessive code size makes the utilization of the code cache less effective. Handling multiple 8 or 16 bit operands simultaneously in 32 bit registers ------------------------------------------------------------------------- If you need to manipulate arrays of 8 or 16 bit operands, then there is a problem with unrolled loops because you may not be able to pair two memory access operations. For example MOV AL,[ESI] / MOV BL,[ESI+1] will not pair if the two operands are within the same dword of memory. But there may be a much smarter method, namely to handle four bytes at a time in the same 32 bit register. The following example adds 2 to all elements of an array of bytes: Example 9: MOV ESI, [A] ; address of byte array MOV ECX, [N] ; number of elements in byte array TEST ECX, ECX ; test if N is 0 JZ SHORT L2 MOV EAX, [ESI] ; read first four bytes L1: MOV EBX, EAX ; copy into EBX AND EAX, 7F7F7F7FH ; get lower 7 bits of each byte in EAX XOR EBX, EAX ; get the highest bit of each byte in EBX ADD EAX, 02020202H ; add desired value to all four bytes XOR EBX, EAX ; combine bits again MOV EAX, [ESI+4] ; read next four bytes MOV [ESI], EBX ; store result ADD ESI, 4 ; increment pointer SUB ECX, 4 ; decrement loop counter JA L1 ; loop L2: This loop takes 5 clock cycles for every 4 bytes. The array should of course be aligned by 4. If the number of elements in the array is not divisible by four, then you may pad it in the end with a few extra bytes to make the length divisible by four. This loop will always read past the end of the array, so you should make sure the array is not placed at the end of a segment to avoid a general protection error. Note that I have masked out the highest bit of each byte to avoid a possible carry from each byte into the next when adding. I am using XOR rather than ADD when putting in the high bit again to avoid carry. The ADD ESI,4 instruction could have been avoided by using the loop counter as index as in example 4. However, this would give an odd number of instructions in the loop body, so there would be one unpaired instruction and the loop would still take 5 clocks. Making the branch instruction unpaired would save one clock after the last operation when the branch is mispredicted, but we would have to spend an extra clock cycle in the prolog code to setup a pointer to the end of the array and calculate -N, so the two methods will be exactly equally fast. The method presented here is the simplest and shortest. The next example finds the length of a zero-terminated string by searching for the first byte of zero. It is faster than using REP SCASB: Example 10: STRLEN PROC NEAR MOV EAX,[ESP+4] ; get pointer MOV EDX,7 ADD EDX,EAX ; pointer+7 used in the end PUSH EBX MOV EBX,[EAX] ; read first 4 bytes ADD EAX,4 ; increment pointer L1: LEA ECX,[EBX-01010101H] ; subtract 1 from each byte XOR EBX,-1 ; invert all bytes AND ECX,EBX ; and these two MOV EBX,[EAX] ; read next 4 bytes ADD EAX,4 ; increment pointer AND ECX,80808080H ; test all sign bits JZ L1 ; no zero bytes, continue loop TEST ECX,00008080H ; test first two bytes JNZ SHORT L2 SHR ECX,16 ; not in the first 2 bytes ADD EAX,2 L2: SHL CL,1 ; use carry flag to avoid a branch POP EBX SBB EAX,EDX ; compute length RET ; (or RET 4 if pascal) STRLEN ENDP Again we have used the method of overlapping the end of one operation with the beginning of the next to improve pairing. I have not unrolled the loop because it is likely to repeat relatively few times. The string should of course be aligned by 4. The code will always read past the end of the string, so the string should not be placed at the end of a segment. The loop body has an odd number of instructions so there is one unpaired. Making the branch instruction unpaired rather than one of the other instructions has the advantage that it saves 1 clock cycle when the branch is mispredicted. The TEST ECX,00008080H instruction is non-pairable. You could use the pairable instruction OR CH,CL here in stead, but then you would have to put in a NOP or something for the following reason: The loop branch (JZ L1) is usually mispredicted the last time, when the loop is finished. Executing a branch (JNZ L2) in the first clock cycle after a mispredicted branch will give you a penalty of 5-10 clock cycles. Another problem with OR CH,CL is that it would cause a partial register stall on a 486 or PentiumPro processor. So I have chosen to keep the unpairable TEST instruction. Handling 4 bytes simultaneously can be quite difficult. The code uses an algorithm which generates a nonzero value for a byte if, and only if, the byte is zero. This makes it possible to test all four bytes in one operation. This algorithm involves the subtraction of 1 from all bytes (in the LEA instruction). I have not masked out the highest bit of each byte before subtracting, as I did in the previous example, so the subtraction may generate a borrow to the next byte, but only if it is zero, and this is exactly the situation where we don't care what the next byte is, because we are searching forwards for the first zero. If we were searching backwards then we would have to re-read the dword after detecting a zero, and then test all four bytes to find the last zero, or use BSWAP to reverse the order of the bytes. If you want to search for a byte value other than zero, then you may XOR all four bytes with the value you are searching for, and then use the method above to search for zero. Handling multiple operands in the same register is easier on the MMX processors because they have special instructions and special 64 bit registers for this purpose. Using the MMX instructions has a high penalty if you are using floating point instructions shortly afterwards, so there may still be situations where you want to use 32 bit registers as in the examples above. Loops with floating point operations ------------------------------------ The methods of optimizing floating point loops are basically the same as for integer loops, although the floating point instructions are overlapping rather than pairing. Consider the C language code: int i, n; double * X; double * Y; double DA; for (i=0; i<n; i++) Y[i] = Y[i] - DA * X[i]; This piece of code has been studied extensively because it is the key to solving linear equations. Example 11: MOV EAX, [N] ; number of elements MOV ESI, [X] ; pointer to X MOV EDI, [Y] ; pointer to Y XOR ECX, ECX LEA ESI, [ESI+8*EAX] ; point to end of X SUB ECX, EAX ; -N LEA EDI, [EDI+8*EAX] ; point to end of Y JZ SHORT L3 ; test for N = 0 FLD QWORD PTR [DA] FMUL QWORD PTR [ESI+8*ECX] ; DA * X[0] JMP SHORT L2 ; jump into loop L1: FLD QWORD PTR [DA] FMUL QWORD PTR [ESI+8*ECX] ; DA * X[i] FXCH ; get old result FSTP QWORD PTR [EDI+8*ECX-8] ; store Y[i] L2: FSUBR QWORD PTR [EDI+8*ECX] ; subtract from Y[i] INC ECX ; increment index JNZ L1 ; loop FSTP QWORD PTR [EDI+8*ECX-8] ; store last result L3: Here we are using the same methods as in example 6: Using the loop counter as index register and counting through negative values up to zero. The end of one operation overlaps with the beginning of the next. The interleaving of floating point operations work perfectly here: The 2 clock stall between FMUL and FSUBR is filled with the FSTP of the previous result. The 3 clock stall between FSUBR and FSTP is filled with the loop overhead and the first two instructions of the next operation. An AGI stall has been avoided by reading the only parameter that doesn't depend on the index in the first clock cycle after the index has been incremented. This solution takes 6 clock cycles per operation, which is better than the unrolled solution published by Intel! Unrolling floating point loops ------------------------------ As explained in section 15, you often have to interleave four threads of floating point instructions in order to remove all stalls. For the same reason you often have to unroll floating point loops by 4. If you experiment with the task of the preceding example, you will find that it is impossible to obtain a stall-free solution when unrolling by 2 or 3. Unrolling the loop by four: Example 12: MOV EAX, [N] ; number of elements MOV ESI, [X] ; pointer to X MOV EDI, [Y] ; pointer to Y XOR ECX, ECX LEA ESI, [ESI+8*EAX] ; point to end of X SUB ECX, EAX ; -N LEA EDI, [EDI+8*EAX] ; point to end of Y TEST AL,1 ; test if N is odd JZ SHORT L1 FLD QWORD PTR [DA] ; do the odd operation FMUL QWORD PTR [ESI+8*ECX] FSUBR QWORD PTR [EDI+8*ECX] INC ECX ; adjust counter FSTP QWORD PTR [EDI+8*ECX-8] L1: TEST AL,2 ; test for possibly 2 more operations JZ L2 FLD QWORD PTR [DA] ; N MOD 4 = 2 or 3. Do two more FMUL QWORD PTR [ESI+8*ECX] FLD QWORD PTR [DA] FMUL QWORD PTR [ESI+8*ECX+8] FXCH FSUBR QWORD PTR [EDI+8*ECX] FXCH FSUBR QWORD PTR [EDI+8*ECX+8] FXCH FSTP QWORD PTR [EDI+8*ECX] FSTP QWORD PTR [EDI+8*ECX+8] ADD ECX, 2 ; counter is now divisible by 4 L2: TEST ECX, ECX JZ L4 ; no more operations L3: ; main loop: FLD QWORD PTR [DA] FLD QWORD PTR [ESI+8*ECX] FMUL ST,ST(1) FLD QWORD PTR [ESI+8*ECX+8] FMUL ST,ST(2) FLD QWORD PTR [ESI+8*ECX+16] FMUL ST,ST(3) FXCH ST(2) FSUBR QWORD PTR [EDI+8*ECX] FXCH ST(3) FMUL QWORD PTR [ESI+8*ECX+24] FXCH FSUBR QWORD PTR [EDI+8*ECX+8] FXCH ST(2) FSUBR QWORD PTR [EDI+8*ECX+16] FXCH FSUBR QWORD PTR [EDI+8*ECX+24] FXCH ST(3) FSTP QWORD PTR [EDI+8*ECX] FSTP QWORD PTR [EDI+8*ECX+16] FSTP QWORD PTR [EDI+8*ECX+8] FSTP QWORD PTR [EDI+8*ECX+24] ADD ECX, 4 ; increment index by 4 JNZ L3 ; loop L4: This loop takes 21 clock cycles for four operations. If N is not divisible by four then you will get up to three extra operations outside the loop. These extra operations are slower due to incomplete overlap and possibly mispredicted branches. A general rule for floating point loops is therefore that an unrolled solution is only optimal if N is very high or if the overlapping method (as in example 11) does not yield a stall-free solution. Discuss this article in the forums
See Also: © 1999-2011 Gamedev.net. All rights reserved. Terms of Use Privacy Policy
|