Speeding up Memory Reads and Writes with VectorC
by Andrew Richards

As technology improves, the speed and complexity of chips increases. With microprocessors, both increased complexity and speed improvements lead to greater performance. But with memory chips, the result is a lower speed improvement and increased storage. This means that memory gets progressively slower compared to microprocessors. The most common solution is to have 1 or more caches - which store the most recently access memory locations. This is fine if you only process small amounts of data, but if you are processing large images, sound effects or 3D models, then caches aren't so helpful. In fact, there can be a speed reduction, because unused data is being read in or out.

Small variables that are accessed often will be in the cache, so you don't need to worry about them. The first problem I will deal with is when you want to read and process a large array - an image, for example. Because the image won't be in the cache (unless it is small and has recently been processed) the read instructions will be slow. If the processor has to stop and wait for the values to be read in before continuing then performance will be bad.

Read Prefetching

Because modern processors can reorder instructions, you would not expect this to be a problem - the processor should just continue executing other instructions that don't depend on this value - and then come back when the value has been read in. Unfortunately, that level of reordering just isn't possible, yet. So, you have to reorder the code yourself. Issue the read early, do some other processing, and then process the value when it has been read in. This would work - and is called "software pipelining" - except that it requires a lot of registers. Not much use on the PC, then. It is also quite difficult to work out in advance what needs to be read in for the future.

On modern PC microprocessors (K6-2, Athlon, PIII) there is a "prefetch" instruction that reads data into the cache before you need it. Then, when you issue a read to the prefetched address, it is already available. Let's have a look at a simple example.


int test (int *a)
    {
    int i;
    for (i=0; i<SIZE; i++)
        {
        __hint__((prefetch (&a [i+16])));
        if (a [i] == 5)
            return i;
        }
    return -1;
    }


Loop:
        prefetch 64[edi+eax*4]
        cmp [edi+eax*4],5
        je Return
        inc eax
        cmp eax,SIZE
        jl Loop
        ret

The left hand side is C source with a VectorC 'hint' to force a prefetch. The right hand side is some assembly language that uses the 'prefetch' instruction to do the same thing. On my Athlon, the code with the prefetch works nearly twice as fast as the code without - that's an amazing performance boost! I wish it was always so easy to double the speed of my code!

You can see that we are processing an array and that the prefetch instruction tells the processor to read in values 64 bytes ahead in memory. We just pass an address - we have no control on how much data is prefetched in - the size is fixed by the size of the processor's cache lines.

But, you might be thinking, I have spotted a problem! If the loop doesn't find a '5' in the array 'a', then the prefetch will go past the end of the array and possibly into an illegal memory address. This could cause a crash in weird and unexpected situations. Well, that is the advantage of a prefetch - if an illegal address is used, nothing happens - no exception and no read. This way you can be sure that using a prefetch instruction is OK even if you are concerned that the memory address may be invalid.

Using VectorC's Automatic Prefetching

The C example above uses an explicit prefetch - which makes it obvious, but might be a lot of work to do in a large program. VectorC provides an alternative - add a hint to a variable or pointer to say that all accesses to it should be prefetched when appropriate.


int test (int __hint__((prefetch)) *a)
    {
    int i;
    for (i=0; i<SIZE; i++)
        {
        if (a [i] == 5)
            return i;
        }
    return -1;
    }

Problems with Prefetching

This looks easy, so why not prefetch everything all the time? Well, you first need a compiler that supports prefetching - VectorC can do that, or you need to write in assembly language. You then have the problem that there are more than one prefetch instructions and different processors support different ones. Under AMD's 3DNow, there are 2 prefetch instructions "prefetch" and "prefetchw". These instructions are also available on the Athlon. With Intel's Pentium III and above (including the latest Celerons) there are some alternative prefetch instructions: "prefetchnta", "prefetcht0", "prefetcht1" and "prefetcht2" (the most common). These instructions are also available on the Athlon.

The final and most annoying problem is that sometimes prefetching slows down your code. This is almost always true on the AMD K6-2 (however, write prefetching, discussed below, can often give improvements). So, avoid read prefetching on the K6-2. Also, test different sections of code with and without prefetching - try and use it only when reading large amounts of data.

VectorC will just ignore prefetch operations when compiling for a processor that doesn't support them.

Write Prefetching

Writing to memory is much faster because the data just gets sent out - the processor doesn't have to wait for the data to get to memory. Unfortunately, it isn't that simple. In most situations if you write to an area of memory, there is a good chance that you are going to read from that same area, soon. So, the cache hardware loads the cache line that you wrote to into the cache from main memory. This can slow things down, so the K6-2 has a "prefetchw" instruction that reads data into the cache and sets the cache line to "dirty". This is the only case of prefetching that often seems to give a performance improvement on the K6-2. This is not applicable to any other processor.

Unfortunately, VectorC 1.0 doesn't support write prefetching. The new version (1.1 - which should be out soon) has support for this and is also more intelligent about read prefetching - i.e. not doing it! Under VectorC 1.1, you can use "__hint__ ((prefetchw (address)))". Write prefetching is not supported on Intel processors and has no useful effect on an Athlon.

Write prefetching only gives a small speed improvement on K6-2s, so you might not consider it worthwhile.

Non-Temporal Stores

If you are going to overwrite the whole cache line, then reading the old values in first is a massive waste of time. When writing out large amounts of data, you do want to overwrite a lot of whole cache lines, so you need a way of writing out to memory directly, without affecting the cache. This also has the useful side-effect of not removing useful data from the cache. This operation is called a "non-temporal store" and is supported on the Athlon and Pentium III.

Unfortunately, only a limited number of non-temporal stores are available, so it can usually only be used after you have "vectorized" your code and are writing out 8 or 16 bytes at a time from MMX, 3DNow or Streaming SIMD registers.

Here is a simple example of using non-temporal stores with VectorC and assembly language.


void writetest (int __hint__ ((nontemporal)) *a, int v)
    {
    int i;

    for (i=0; i<SIZE; i++)
        a [i] = v;
    }


Loop:
        movntq [eax],mm0
        add eax,8
        dec ecx
        jne Loop
        ret

This example writes out the same value to a large array using non-temporal stores. In the VectorC version, I have added "__hint__((nontemporal))" to the definition of the pointer to tell VectorC that I want non-temporal stores. I have also written a loop that I know VectorC can vectorize with MMX.

In the assembly language, I have used an MMX register to write out data. Unfortunately, I couldn't have used a general-purpose register like eax, because there is no non-temporal store using general-purpose registers.

Problems with Non-Temporal Stores

When using non-temporal stores, there are a few potential problems. If you mix non-temporal stores with normal (cached) stores, then you get a massive speed reduction. So be careful - this can be a problem when compiling with VectorC, because you don't have complete control over what type of stores are used. It is worth checking the speed of routines compiled with this hint or checking the assembly language produced. This will become a bit easier with the "Interactive Optimizer" which CodePlay will releasing very soon.

Recommendations

I strongly recommend you to experiment with some of these techniques here. You can get large performance improvements for relatively little work. Prefetching is supported by a lot of processors and is quite easy to use. It also isn't a disaster if you get it wrong. Non-temporal stores, however, are much harder to get right. If you get these wrong, then you can get disastrous results. But when you get it right, the results are amazing (remember that you are stopping the caches from reading in all the memory that you are overwriting - a stupid and time-consuming thing to do).

Both techniques are supported well on only the latest processors - but these are now perfectly affordable. The Duron and newer Celerons support both prefetching and non-temporal stores.

Compile your code with VectorC for different processors and run the appropriate version for your user's computer. Cache sizes, available instructions and the situations that prefetching is beneficial are different on each processor. Aaaaaarrrrrrgghhh! But then that's PCs for you.

Naming

The names are not my fault. I suppose prefetching is a reasonably sensible name, but what about "non-temporal store"? Intel seems to be going in for longer and longer names. MMX was nice and simple, but "Internet Streaming SIMD Extensions" is ridiculous. "Non-temporal" I suppose means that there is a long time between writing and reading the same bit of memory. But I think that "uncached write" would be a much more self-explanatory name. Maybe I should start a campaign to change the name. Write some petitions. Get signatures. Lobby my member of parliament. Or maybe not.

Discuss this article in the forums

Date this article was posted to GameDev.net: 8/16/2000
(Note that this date does not necessarily correspond to the date the article was written)

See Also:
Optimization