Cache Manipulation

Several mechanisms have been put into place to squeeze optimal throughput from the processors. One method of cache manipulation discussed in Chapter 10, "Branching," is Intel's hint as to the prediction of logic flow through branches counter to the static prediction logic. Another mechanism is a hint to the processor about cache behavior so as to give the processor insight into how a particular piece of code is utilizing memory access. Here is a brief review of some terms that have already been discussed:

Temporal data Memory that requires multiple accesses and therefore needs to be loaded into a cache for better throughput.
Non-temporal hint A hint (an indicator) to the processor that memory only requires a single access (one shot). This would be similar to copying a block of memory or performing a calculation, but the result is not going to be needed for a while so there is no need to write it into the cache. Thus, the memory access has no need to read and load cache, and therefore the code can be faster.

For speed and efficiency, when memory is accessed for read or write a cache line containing that data (whose length is dependent upon manufacturer and version) is copied from system memory to high-speed cache memory. The processor performs read/write operations on the cache memory. When a cache line is invalidated, the write back of that cache line to system memory occurs. In a multiprocessor system, this occurs frequently due to non-sharing of internal caches. The second stage of writing the cache line back to system memory is called a "write back."

Cache Sizes

Different processors have different cache sizes for data and for code. These are dependent upon processor model, manufacturer, etc., as shown below:

CPU	L1 Cache (Data /Code)	L2 Cache
Celeron	16Kb /16Kb	256Kb
Pentium 4	8Kb /12K m ops	512Kb
Athlon XP	64Kb /64Kb	256Kb
Duron	64Kb /64Kb	64Kb
Pentium M	32Kb /32Kb	1024Kb
Xeon		512Kb

Depending on your code and level of optimization, the size of the cache may be of importance. For the purposes of this book, however, it is being ignored, as that topic is more suitable for a book very specifically targeting heavy-duty optimization. This book, however, is interested in the cache line size as that is more along the lightweight optimization that has been touched on from time to time. It should be noted that AMD uses a minimum size of 32 bytes.

Cache Line Sizes

The (code/data) cache line size determines how many instruction/data bytes can be preloaded.

Intel	Cache Line Size
PIII	32
Pentium M	64
P4	64
Xeon	64

AMD	Cache Line Size
Athlon	64
Opteron	64

The cache line size can be obtained by using the CPUID instruction with EAX set to 1. The following calculation will give you the actual cache line size.

 mov   eax,1 cpuid     and   ebx,00000FF00h shr   ebx,8-3           ; ebx = size of cache line

PREFETCH _x Prefetch Data into Caches

Mnemonic

PII

3D!

3Mx+

SSE

SSE2

A64

SSE3

E64T

PREFETCH

Cache Sizes

Cache Line Sizes

PREFETCH x Prefetch Data into Caches

PREFETCH _x Prefetch Data into Caches