Upcoming Events
Unite 2010
11/10 - 11/12 @ Montréal, Canada

GDC China
12/5 - 12/7 @ Shanghai, China

Asia Game Show 2010
12/24 - 12/27  

GDC 2011
2/28 - 3/4 @ San Francisco, CA

More events...
Quick Stats
88 people currently visiting GDNet.
2406 articles in the reference section.

Help us fight cancer!
Join SETI Team GDNet!
Link to us Events 4 Gamers
Intel sponsors gamedev.net search:

  Contents

 Introduction
 Prefetching
 Non-Temporal
 Stores

 Recommendations

 Get VectorC
 Printable version

 


Read Prefetching

Because modern processors can reorder instructions, you would not expect this to be a problem - the processor should just continue executing other instructions that don't depend on this value - and then come back when the value has been read in. Unfortunately, that level of reordering just isn't possible, yet. So, you have to reorder the code yourself. Issue the read early, do some other processing, and then process the value when it has been read in. This would work - and is called "software pipelining" - except that it requires a lot of registers. Not much use on the PC, then. It is also quite difficult to work out in advance what needs to be read in for the future.

On modern PC microprocessors (K6-2, Athlon, PIII) there is a "prefetch" instruction that reads data into the cache before you need it. Then, when you issue a read to the prefetched address, it is already available. Let's have a look at a simple example.


int test (int *a)
    {
    int i;
    for (i=0; i<SIZE; i++)
        {
        __hint__((prefetch (&a [i+16])));
        if (a [i] == 5)
            return i;
        }
    return -1;
    }

Loop:
        prefetch 64[edi+eax*4]
        cmp [edi+eax*4],5
        je Return
        inc eax
        cmp eax,SIZE
        jl Loop
        ret

The left hand side is C source with a VectorC 'hint' to force a prefetch. The right hand side is some assembly language that uses the 'prefetch' instruction to do the same thing. On my Athlon, the code with the prefetch works nearly twice as fast as the code without - that's an amazing performance boost! I wish it was always so easy to double the speed of my code!

You can see that we are processing an array and that the prefetch instruction tells the processor to read in values 64 bytes ahead in memory. We just pass an address - we have no control on how much data is prefetched in - the size is fixed by the size of the processor's cache lines.

But, you might be thinking, I have spotted a problem! If the loop doesn't find a '5' in the array 'a', then the prefetch will go past the end of the array and possibly into an illegal memory address. This could cause a crash in weird and unexpected situations. Well, that is the advantage of a prefetch - if an illegal address is used, nothing happens - no exception and no read. This way you can be sure that using a prefetch instruction is OK even if you are concerned that the memory address may be invalid.

Using VectorC's Automatic Prefetching

The C example above uses an explicit prefetch - which makes it obvious, but might be a lot of work to do in a large program. VectorC provides an alternative - add a hint to a variable or pointer to say that all accesses to it should be prefetched when appropriate.

int test (int __hint__((prefetch)) *a) { int i; for (i=0; i<SIZE; i++) { if (a [i] == 5) return i; } return -1; }

Problems with Prefetching

This looks easy, so why not prefetch everything all the time? Well, you first need a compiler that supports prefetching - VectorC can do that, or you need to write in assembly language. You then have the problem that there are more than one prefetch instructions and different processors support different ones. Under AMD's 3DNow, there are 2 prefetch instructions "prefetch" and "prefetchw". These instructions are also available on the Athlon. With Intel's Pentium III and above (including the latest Celerons) there are some alternative prefetch instructions: "prefetchnta", "prefetcht0", "prefetcht1" and "prefetcht2" (the most common). These instructions are also available on the Athlon.

The final and most annoying problem is that sometimes prefetching slows down your code. This is almost always true on the AMD K6-2 (however, write prefetching, discussed below, can often give improvements). So, avoid read prefetching on the K6-2. Also, test different sections of code with and without prefetching - try and use it only when reading large amounts of data.

VectorC will just ignore prefetch operations when compiling for a processor that doesn't support them.

Write Prefetching

Writing to memory is much faster because the data just gets sent out - the processor doesn't have to wait for the data to get to memory. Unfortunately, it isn't that simple. In most situations if you write to an area of memory, there is a good chance that you are going to read from that same area, soon. So, the cache hardware loads the cache line that you wrote to into the cache from main memory. This can slow things down, so the K6-2 has a "prefetchw" instruction that reads data into the cache and sets the cache line to "dirty". This is the only case of prefetching that often seems to give a performance improvement on the K6-2. This is not applicable to any other processor.

Unfortunately, VectorC 1.0 doesn't support write prefetching. The new version (1.1 - which should be out soon) has support for this and is also more intelligent about read prefetching - i.e. not doing it! Under VectorC 1.1, you can use "__hint__ ((prefetchw (address)))". Write prefetching is not supported on Intel processors and has no useful effect on an Athlon.

Write prefetching only gives a small speed improvement on K6-2s, so you might not consider it worthwhile.




Next : Non-Temporal Stores