Non-Temporal StoresIf you are going to overwrite the whole cache line, then reading the old values in first is a massive waste of time. When writing out large amounts of data, you do want to overwrite a lot of whole cache lines, so you need a way of writing out to memory directly, without affecting the cache. This also has the useful side-effect of not removing useful data from the cache. This operation is called a "non-temporal store" and is supported on the Athlon and Pentium III. Unfortunately, only a limited number of non-temporal stores are available, so it can usually only be used after you have "vectorized" your code and are writing out 8 or 16 bytes at a time from MMX, 3DNow or Streaming SIMD registers. Here is a simple example of using non-temporal stores with VectorC and assembly language.
This example writes out the same value to a large array using non-temporal stores. In the VectorC version, I have added "__hint__((nontemporal))" to the definition of the pointer to tell VectorC that I want non-temporal stores. I have also written a loop that I know VectorC can vectorize with MMX. In the assembly language, I have used an MMX register to write out data. Unfortunately, I couldn't have used a general-purpose register like eax, because there is no non-temporal store using general-purpose registers. Problems with Non-Temporal StoresWhen using non-temporal stores, there are a few potential problems. If you mix non-temporal stores with normal (cached) stores, then you get a massive speed reduction. So be careful - this can be a problem when compiling with VectorC, because you don't have complete control over what type of stores are used. It is worth checking the speed of routines compiled with this hint or checking the assembly language produced. This will become a bit easier with the "Interactive Optimizer" which CodePlay will releasing very soon. |
||||||||||||||