80x86 Optimizations

 80x86 Optimizations

 Journal:   Dr. Dobb's Journal  March 1991 v16 n3 p16(8)
 -----------------------------------------------------------------------------
 Title:     80x86 optimization: aim down the middle and pray. (80x86 family of
            microprocessors) (tutorial)
 Author:    Abrash, Michael.
 AttFile:    Program:  80X86.ASC  Source code listing.

 Summary:   Optimizing code for 8088, 80286, 80386 and 80486 microprocessors
            is difficult because the chips use significantly different memory
            architectures and instruction execution times.  Code cannot be
            optimized for the 80x86 family; rather, code must be designed to
            produce good performance on a range of systems or optimized for
            particular combinations of processors and memory.  Programmers
            must avoid the unusual instructions supported by the 8088, which
            have lost their performance edge in subsequent chips.  String
            instructions should be used but not relied upon.  Registers should
            be used rather than memory operations.  Branching is also slow for
            all four processors.  Memory accesses should be aligned to improve
            performance.  Generally, optimizing an 80486 requires exactly the
            opposite steps as optimizing an 8088.
 -----------------------------------------------------------------------------
 Descriptors..
 Company:   Intel Corp. (Products).
 Ticker:    INTC.
 Product:   Intel 80286 (Microprocessor) (Programming)
            Intel 80386 (Microprocessor) (Programming)
            Intel 80486 (Microprocessor) (Programming)
            Intel 8088 (Microprocessor) (Programming).
 Topic:     Microprocessors
            Optimization
            Programming
            Tutorial
            Assembly Language
            Guidelines
            Type-In Programs
            Microcode
            Processor Architecture.
 Feature:   illustration
            graph.
 Caption:   Official and actual cycles per binary-to-hex ASCII conversion.
            (graph)
            Actual performance in microseconds of two solutions to a problem.
            (graph)
            Actual performance of three clearing approaches across the 80x86
            family. (graph)

 -----------------------------------------------------------------------------
 Full Text:

 Optimization

 Picture this: You're an archer aiming at a target 100 feet away.  A strong
 wind comes up and pushes each arrow to the left as it flies.  Naturally, you
 compensate by aiming farther to the right.  That's what it's like optimizing
 for the 8088; once you learn to compensate for the strong but steady effects
 of the prefetch queue and the 8-bit bus, you can continue merrily on your
 programming way.

 Now the wind starts gusting unpredictably.  There's no way to compensate, so
 you just aim for the bull's-eye and hope for the best.  That's what it's like
 writing code for good performance across the entire 80x86 family, or even for
 the 286/386SX/386 heart of today's market.  You just aim down the middle and
 pray.

 The New World of the 80x86

 In the beginning, the 8088 was king, and that was good.  The optimization
 rules weren't obvious, but once you learned them, you could count on them
 serving you well on every computer out there.

 Not so these days.  There are four major processor types--8088, 80286, 80386,
 and 80486--with a bewildering array of memory architectures: cached (in
 several forms), page mode, static-column RAM, interleaved, and, of course,
 the 386SX, with its half-pint memory interface.  The processors offer wildly
 differing instruction execution times, and memory architectures warp those
 times further by affecting the speed of instruction fetching and access to
 memory operands.  Because actual performance is a complex interaction of
 instruction characteristics, instruction execution times, and memory access
 speed, the myriad processor-memory combinations out there make "exact
 performance" a meaningless term.  A specific instruction sequence may run at
 a certain speed on a certain processor in a certain system, but that often
 says little about the performance of the same instructions on a different
 processor, or even on the same processor with a different memory system.  The
 result: Precise optimization for the general PC market is a thing of the
 past.  (We're talking about optimizing for speed here; optimizing for size is
 the same for all processors so long as you stick to 8088-compatible code.)

 So there is no way to optimize performance ideally across the 80x86 family.
 An optimization that suits one processor beautifully is often a dog on
 another.  Any 8088 programmer would instinctively replace:

 DEC  CX JNZ  LOOPTOP

 with:

 LOOP  LOOPTOP

 because LOOP is significantly faster on the 8088.  LOOP is also faster on the
 286.  On the 386, however, LOOP is actually two cycles slower than DEC/JNZ.
 The pendulum swings still further on the 486, where LOOP is about twice as
 slow as DEC/JNZ--and, mind you, we're talking about what was originally
 perhaps the most obvious optimization in the entire 80x86 instruction set.

 In short, there is no such thing as code that's truly optimized for the
 80x86.  Instead, code is either optimized for specific processor-memory
 combinations, or aimed down the middle, designed to produce good performance
 across a range of systems.  Optimizing for the 80x86 family by aiming down
 the middle is quite different from optimizing for the 8088, but many PC
 programmers are inappropriately still applying the optimization lore they've
 learned over the years on the PC (or AT).  The world has changed, and many of
 those old assumptions and tricks don't hold true anymore.

 You will not love the new world of 80x86 optimization, which is less precise
 and offers fewer clever tricks than optimizing for the 8088 alone.  Still,
 isn't it better to understand the forces affecting your code's performance
 out in the real world than to optimize for a single processor and hope for
 the best?

 Better, yes.  As much fun, no.  Optimizing for the 8088 was just about as
 good as it gets.  So it goes.

 Optimization Rules for a New World

 So, how do you go about writing fast code nowadays?  One way is to write
 different versions of critical code for various processors and memory access
 speeds, selecting the best version at runtime.  That's a great solution, but
 it requires an awful lot of knowledge and work.

 An alternative is to optimize for one particular processor and settle for
 whatever performance you get on the others.  This might make sense when the
 8088 is the target processor because it certainly needs the optimization more
 than any other processor.  However, 8088 optimization works poorly at the
 upper end of the 80x86 family.

 Nowadays, though, most of us want to optimize for the 286 and 386 systems
 that dominate the market, or across all 80x86 processors, and that's a tough
 nut to crack.  The 286 and 386 come in many configurations, and you can be
 sure, for example, that a 386SX, an interleaved 386, and a cached 386 have
 markedly different performance characteristics.  There are, alas, no hard and
 fast optimization rules that apply across all these environments.

 My own approach to 80x86 optimization has been to develop a set of general
 rules that serve reasonably well throughout the 80x86 line, especially the
 286 and 386, and to select a specific processor (in my case a cached 386, for
 which cycle times tend to be accurate) to serve as the tiebreaker when
 optimization details vary from one processor to another.  (Naturally, it's
 only worth bothering with these optimizations in critical code.)  The rules
 I've developed are:

 * Avoid accessing memory operands; use the registers to the max.

 * Don't branch.

 * Use string instructions, but don't go much out of your way to do so.

 * Keep memory accesses to a minimum by avoiding memory operands and keeping
 instructions short.

 * Align memory accesses.

 * Forget about many of those clever 8088 optimizations, using oddball
 instructions such as DAA and XLAT, that you spent years learning.

 Next I'll discuss each of these rules in turn in the context of
 8088-compatible real mode, which is still the focus of the 80x86 world.
 Later, I'll touch on protected mode.

 Let's start by looking at the last--and most surprising--rule.

 Kiss Those Tricks Goodbye

 To skilled assembly language programmers, the 8088 is perhaps the most
 wonderful processor ever created, largely because the instruction set is
 packed with odd instructions that are worthless to compilers but can work
 miracles in the hands of clever assembly programmers.  Unfortunately, each
 new generation of the 80x86 has rendered those odd instructions and marvelous
 tricks less desirable.  As the execution time for the commonly used
 instruction ADD BX, 4 has gone down from four cycles (8088) to three cycles
 (286) to two cycles (386) to one cycle (486), the time for the less
 frequently used instruction CBW has gone from two cycles (8088 and 286) up to
 three cycles (386 and 486)!

 Consider this ancient optimization for converting a binary digit to hex
 ASCII:

 ADD  AL,90H DAA ADC  AL,40H DAA

 Now consider the standard alternative:

 ADD  AL,'0' CMP  AL,'9' JBE  HaveAscii ADD  AL,'A'-('9'+1) HaveAscii:

 As Figure 1 indicates, the standard code should be slower on an 8088 or 286,
 but faster on a 386 or a 486--and real-world tests confirm those results, as
 shown in Figure 2.  (All "actual performance" timings in this article were
 performed with the Zen timer from Zen of Assembly Language, see "References"
 for details.  The systems used for the tests were: 8088, standard 4.77 MHz PC
 XT; 80286, standard one-wait-state, 8 MHz PC AT; 386SX, 16 MHz noncached;
 80386, 20 MHz externally cached with all instructions and data in external
 cache for all tests except Listings One and Two; 80486, 25 MHz internally
 cached, with all instructions and data in internal cache for all tests except
 Listings One and Two.)

 In other words, this nifty, time-tested optimization is an anti-optimization
 on the 386 and 486.

 Why is this?  On the 386, DAA--a rarely used instruction--takes four cycles,
 and on the 486 it takes two cycles, in both cases twice as long as the more
 common instructions CMP and ADD; in contrast, on the 8088 all three
 instructions are equally fast at four cycles.  Also, the instruction-fetching
 advantage that the 1-byte DAA provides on the 8088 means nothing on a cached
 386.

 Nor is this an isolated example.  Most oddball instructions, from AAA to
 XCHG, have failed to keep pace with the core instructions--ADC, ADD, AND,
 CALL, CMP, DEC, INC, Jcc, JMP, LEA, MOV, OR, POP, PUSH, RET, SBB, SUB, TEST,
 and XOR--during the evolution from 8088 to 486.  As we saw earlier, even LOOP
 lags behind on the 386 and 486.  Check your favorite tricks for yourself;
 they might or might not hold up on the 386, but will most likely be
 liabilities on the 486.  Sorry, but I just report the news, and the news is:
 Kiss most of those tricks goodbye as the 386 and 486 come to dominate the
 market.  (This means that hand-optimization in assembly language yields less
 of a performance boost nowadays than it did when the 8088 was king; the
 improvement is certainly significant, but rarely in the 200-500 percent range
 anymore.  Sic transit gloria mundi.)  Most startling of all, string
 instructions lose much of their allure as we move away from the 8088, hitting
 bottom on the 486.

 The 486: All the Rules Change

 The 486 represents a fundamental break with 8088-style optimization.
 Virtually all the old rules fail on the 486, where, incredibly, a move to or
 from memory often takes just one cycle, but exchanging two registers takes
 three cycles.  The nonbranching core instructions mentioned earlier take only
 one cycle on the 486 when operating on registers; MOV can, under most
 conditions, access memory in one cycle; and CALL and JMP take only three
 cycles, given a cache hit.  However, noncore instructions take considerably
 longer.  XLAT takes four cycles; even STC and CLC take two cycles each.  The
 486's highly asymmetric execution times heavily favor core instructions and
 defeat most pre-486 optimizations.

 Core instructions do have a weakness on the 486.  While 486 MOVs involving
 memory are remarkably fast, accessing memory for an operand to OR, ADD, or
 the like costs cycles.  Even with the 8K internal cache, memory is not as
 fast as registers, except when MOV is used (and sometimes not even then), so
 registers are still preferred operands.  (AND [BX],1 is fast, at only three
 cycles, but AND BX,1 takes only one cycle--three times as fast.)

 OUT should be avoided whenever possible on the 486, and likewise for IN.  OUT
 takes anywhere from 10 to 31 cycles, depending on processor mode and
 privileges, more than an order of magnitude slower than MOV.  The lousy
 performance of OUT -- true on the 386 as well -- has important implications
 for graphics applications.

 String instructions are so slow on the 486 that you should check cycle times
 before using any string instruction other than the always superior REP MOV's.
 For example, LODSB takes five cycles on the 486, but MOV AL,[SI]/INC SI takes
 only two cycles; likewise for STOSB and MOV [DI],AL/INC DI.  Listing One
 (page 73) uses LODSB/STOSB to copy a string, converting lowercase to
 uppercase while copying; Listing Two (page 73) uses MOV/INC instead.  Figure
 3 summarizes the performance of the two routines on a variety of processors;
 note the diminishing effectiveness of string instructions on the newer
 processors.  Think long and hard before using string instructions other than
 REP MOVS on the 486.

 Optimization for the 486 is really a whole new ball game.  When optimizing
 across the 80x86 family, the 486 will generally be the least of your worries
 because it is so much faster than the rest of the family; anything that runs
 adequately on any other processor will look terrific on the 486.  Still, the
 future surely holds millions of 486s, so it wouldn't hurt to keep one eye on
 the 486 as you optimize.

 String Instructions: Fading Stars

 On the 8088, string instructions are so far superior to other instructions
 that it's worth going to great lengths to use them, but they lose much of
 that status on newer processors.  One of the best things about string
 instructions on the 8088 is that they require little instruction fetching,
 because they're 1-byte instructions and because of the REP prefix; however,
 instruction fetching is less of a bottleneck on newer processors.  String
 instructions also have superior cycle times on the 8088, but that advantage
 fades on the 286 and 386 as well.

 On the 286, string instructions (when they do exactly what you need) are
 still clearly better than the alternatives.  On the 386, however, some string
 instructions are, even under ideal circumstances, the best choice only by a
 whisker, if at all.  For example, since Day One, clearing a buffer has been
 done with REP STOS.  That's certainly faster than the looping MOV/ADD
 approach shown in Listing Three (page 73), but on the 386 and 486 it's no
 faster than the unrolled loop MOV/ADD approach of Listing Four (page 73), as
 shown in Figure 4.  (Actually, in my tests REP STOS was a fraction of a cycle
 slower on the 386, and fractionally faster on the 486.)  REP STOS is much
 easier to code and more compact, so it's still the approach of choice for
 buffer clearing--but it's not necessarily fastest on a 486 or fast-memory
 386.  This again demonstrates just how unreliable the old optimization rules
 are on the newer processors.

 The point is not that you shouldn't use string instructions on the 386.  REP
 MOVs is the best way to move data, and the other string instructions are
 compact and usually faster, especially on uncached systems.  However, on the
 386 it's no longer worth going to the trouble of juggling registers and
 reorganizing data structures to use string instructions.  Furthermore, when
 you truly need maximum performance on the 386, check out nonstring
 instructions in unrolled loops.  It goes against every lesson learned in a
 decade of 8088 programming, but avoiding string instructions sometimes pays
 on the 386.

 The Siren Song of Memory Accesses

 Finally, here's a rule that's constant from the 8088 to the 486: Use the
 registers.  Avoid memory.

 Don't be fooled by the much faster memory access times of the 286 and 386.
 The effective address calculation time of the 8088 is mostly gone, so MOV
 AX,[BX] takes only five cycles on the 286, and ADD [SI],DX takes only seven
 on the 386.  That's so much faster than the 17 and 29 cycles, respectively,
 that they take on the 8088 that you might start thinking that memory is
 pretty much interchangeable with registers.

 Think again.  MOV AX,BX is still more than twice as fast as MOV AX,[BX] on
 the 286, and ADD SI,DX is more than three times as fast as ADD [SI],DX on the
 386.  Memory operands can also reduce performance by slowing instruction
 fetching.  Memory is fast on the 286 and 386.  Registers are faster.  Use
 them as heavily as possible.

 Don't Branch

 Here's another rule that stays the same across the 80x86 family: Don't
 branch.  Branching suffers on the 8088 from lengthy cycle counts and emptying
 the prefetch queue.  Emptying the prefetch queue is a lesser but nonetheless
 real problem in the post-8088 world, and the cycle counts of branches are
 still killers.  As Figure 4 indicates, it pays to eliminate branches by
 unrolling loops or using repeated string instructions.

 Modern-Day Instruction Fetching

 Instruction fetching is the bugbear of 8088 performance; the 8088 simply
 can't fetch instruction bytes as quickly as it can execute them, thanks to
 its undersized bus.  Minimizing all memory accesses, including instruction
 fetches, is paramount on the 8088.

 Instruction fetching is less of a problem nowadays.  Figure 5 shows the
 maximum rates at which various processors can fetch instruction bytes;
 clearly, matters have improved considerably since the 8088, although
 instructions also execute in fewer cycles on the newer processors.  Fetching
 problems can occur on any 80x86 processor, even the 486, but the only
 processors other than the 8088 that face major instruction fetching problems
 are the one-wait-state 286 and the 386SX, although uncached 386s may also
 outrun memory.  However, the problems here are different from and less
 serious than with the 8088.

 Consider: An 8088 executes a register ADD in three cycles, but requires eight
 cycles to fetch that instruction, a fetch/execute ratio of 2.67.  A
 one-wait-state 286 requires three cycles to fetch a register ADD and executes
 it in two cycles, a ratio of 1.5.  A 386SX can fetch a register ADD in two
 cycles, matching the execution time nicely, and a cached 386 can fetch two
 register ADDs in the two cycles it takes to execute just one.  For
 register-only code--the sort of code critical loops should contain--the 386
 generally runs flat out, and the 286 and 386SX usually (not always, but
 usually) outrun memory by only a little at worst.  Greater fetching problems
 can arise when working with large instructions or instruction sequences that
 access memory nonstop, but those are uncommon in critical code.  This is a
 welcome change from the 8088, where small, register-only instructions tend to
 suffer most from inadequate instruction fetching.

 Also, uncached 386 systems often use memory architectures that provide
 zero-wait-state performance when memory is accessed sequentially.  In
 register-only code, instruction fetches are the only memory accesses, so
 fetching proceeds at full speed when the registers are used heavily.

 So, is instruction fetching a problem in the post-8088 world?  Should
 instructions be kept short?

 Yes.  Smaller instructions can help considerably on the one-wait-state 286
 and on the 386SX.  Not as much as on the 8088, but it's still worth the
 trouble.  Even a cached 386 can suffer from fetching problems, although
 that's fairly uncommon.  For example, when several MOV WORD PTR [MemVar],0
 instructions are executed in a row, as might happen when initializing memory
 variables, performance tends to fall far below rated speed, as shown in
 Figure 6.  The particular problem with MOV WORD PTR [MemVar],0 is that it
 executes in just two (386) or three (286) cycles, yet has both an addressing
 displacement field and a constant field.  This eats up memory bandwidth by
 requiring more instruction fetching.  It also accesses memory, eating up
 still more bandwidth.  We'll see this again, and worse, when we discuss
 protected mode.

 Generally, though, post-8088 processors with fast memory systems and
 full-width buses run most instructions at pretty near their official cycle
 times; for these systems, optimization consists mostly of counting cycles.
 Slower memory or constricted buses (as in the 386SX) require that memory
 accesses (both instruction fetches and operand accesses) be minimized as
 well.  Fortunately, the same sort of code--register only--meets both
 requirements.

 Use the registers.  Avoid constants.  Avoid displacements.  Don't branch.
 That's the big picture.  Don't sweat the details.

 Alignment: The Easy Optimization

 The 286, 386SX, and 386 take twice as long to access memory words at odd
 addresses as at even addresses.  The 386 takes twice as long to access memory
 dwords at addresses that aren't multiples of four as those that are.  You
 should use ALIGN 2 to word align all word-sized data, and ALIGN 4 to dword
 align all data that's accessed as a dword operand, as in:

 ALIGN  4 MemVar  dd  ? : MOV EAX,[MemVar]

 Alignment also applies to code; you may want to word or dword align the
 starts of procedures, labels that can only be reached by branching, and the
 tops of loops.  (Code alignment matters only at branch targets, because only
 the first instruction fetch after a branch can suffer from nonalignment.)
 Dword alignment of code is optimal, and will help on the 386 even in real
 mode, but word alignment will produce nearly as much improvement as dword
 alignment without wasting nearly as many bytes.

 Alignment improves performance on many 80x86 systems without hindering it on
 any.  Recommended.

 Protected Mode

 There are two sorts of protected mode, 16-bit and 32-bit.  The primary
 optimization characteristic of 16-bit protected mode (OS/2 1.X, Rational DOS
 Extender) is that it takes an ungodly long time to load a segment register
 (for example, MOV ES,AX takes 17 cycles on a 286) so load segment registers
 as infrequently as possible in 16-bit protected mode.

 Optimizing for 32-bit protected mode (OS/2 2.0, SCO Unix, Phar Lap DOS
 Extender) is another matter entirely.  Typically, no segment loads are needed
 because of the flat address space.  However, 32-bit protected mode code can
 be bulky, and that can slow instruction fetching.  Constants and addressing
 displacements can be as large as 4 bytes each, and an extra byte, the SIB
 byte, is required whenever two 32-bit registers are used to address an
 operand or scaled addressing is used.  So, for example, MOV DWORD PTR
 [MemVar],0 is a 10-byte instruction in 32-bit protected mode.  The
 instruction is supposed to execute in two cycles, but even a 386 needs four
 to six cycles to fetch it, plus another two cycles to access memory; a few
 such instructions in a row can empty the prefetch queue and slow performance
 considerably.  The slowdown occurs more quickly and is more acute on a 386SX,
 which needs 14 cycles to perform the memory accesses for this nominally
 2-cycle instruction.

 Code can get even larger when 32-bit instructions are executed in 16-bit
 segments, adding prefix bytes.  (Avoid prefix bytes if you can; they increase
 instruction size and can cost cycles.)  Figure 7 shows actual versus nominal
 cycle times of multiple MOV DWORD PTR [EBX*4+MemVar],0 instructions running
 in a 16-bit segment.  Although cache type (write-back, write-through) and
 main-memory write time also affect the performance of stores to memory, there
 is clearly a significant penalty for using several large (in this case,
 13-byte) instructions in a row.

 Fortunately, this is a worst case, easily avoided by keeping constants and
 displacements out of critical loops.  For example, you should replace:

 ADDLOOP: MOV  DWORD PTR BaseTable[EDX+EBX],0 ADD  EBX,4 DEC  ECX JNZ  ADDLOOP

 with:

 LEA  EBX,BaseTable[EDX+EBX] SUB  EAX,EAX ADDLOOP: MOV  [EBX],EAX ADD  EBX,4
 DEC  ECX JNZ  ADDLOOP

 Better yet, use REP STOSD or unroll the loop!

 Happily, register-only instructions are no larger in 32-bit protected mode
 than otherwise and run at or near their rated speed in 32-bit protected mode
 on all processors.  All in all, in protected mode it's more important than
 ever to avoid large constants and displacements and to use the registers as
 much as possible.

 Conclusion

 Optimization across the 80x86 family isn't as precise as 8088 optimization,
 and it's a lot less fun, with fewer nifty tricks and less spectacular
 speed-ups.  Still, familiarity with the basix 80x86 optimization rules can
 give you a decided advantage over programmers still laboring under the
 delusion that the 286, 386, and 486 are merely faster 8088s.

 References

 Abrash, Michael.  Zen of Assembly Language.  Glenview, Ill.: Scott, Foresman,
 1990.

 Barrenechea, Mark.  "Peak Performance: On to the 486."  Programmer's Journal,
 (November-December 1990).

 Paterson, Tim.  "Assembly Language Tricks of the Trade."  Dr. Dobb's Journal
 (March 1990).

 Turbo Assembler Quick Reference Guide.  Borland International, 1990.

 i486 Microprocessor Programmer's Reference Manual.  Intel Corporation, 1989.

 80386 Programmer's Reference Manual.  Intel Corporation, 1986.

 Microsystems Components Handbook: Microprocessors Volume I.  Intel
 Corporation, 1985.

 [BACK] Back
Discuss this article in the forums
Date this article was posted to GameDev.net: 7/16/1999
(Note that this date does not necessarily correspond to the date the article was written)
See Also:
Michael Abrash's Articles