3.2 Caching | System Performance Tuning2002

A major concern of our microprocessor optimization strategy is that pipeline stalls cause serious performance problems. If every memory request has to be satisfied from main memory, the pipeline will stall very frequently. Modern microprocessors universally implement caches to try and solve this problem. In some processors, this cache is organized as a single unit, called a unified cache architecture. More commonly, however, the cache is divided up into two areas: one for data and one for instructions. This is called a Harvard cache architecture. Harvard caches tend to be significantly more efficient in the real world, especially for small caches.

3.2.1 The Cache Hierarchy

Memory speed has increased at a much slower rate than processor speed: processor performance doubles roughly every eighteen months, but memory performance doubles roughly every seven years . The important thing to realize is that as processor speed has outpaced memory speed, even static RAM caches located off the processor die are not fast enough. ^[7] The total time required to move data to or from memory is more than just the memory's physical access time, as there is overhead in moving data between the chips, as well as keeping the caches synchronized with main memory. The most common mechanism for minimizing access time is to create a hierarchy of caches with steadily increasing size and steadily decreasing performance:

^[7] We'll talk about the various types of memory in greater detail in Chapter 4.

The processor has separate caches for instructions and data; they are referred to as the i-cache and the d-cache , respectively. These are often about 16 KB each, and the fastest caches in the system. They are called level 1 or L1 caches.
There is a larger, combined data/instruction second-level cache, which is generally between 256 KB and 4 MB in size. A cache miss at the L1 cache can still produce a cache hit in this cache -- while it takes a few cycles longer it is still very efficient. This is the level 2 or L2 cache.
Sometimes a third-level cache is configured. It is usually 4 to 32 MB, and while it is still slower than the L2 cache, it is roughly an order of magnitude faster than main memory.
If cache misses occur at all these levels, main memory must be referenced. If the memory system is very loaded, there could be a long delay. We want to avoid that.

Most extremely high clock rate microprocessors have developed a three-level cache structure. An example of cache levels and their access times for a typical 1.0 GHz processor can be found in Table 3-2.

Table 3-2. Typical memory access speeds

Cache location and type	Access time (in ns)	Size	"Good" hit rate ^[8]
L1 on-chip	1-3	8-64 KB	>90%
L2 on-chip	6-18	1-8 MB	>50%
L3 off-chip	30-40	8-32 MB	>30%
Main memory	220	Very large	-

^[8] This is best described as "the percentage of requests to the cache that can be fulfilled with data already in the cache."

Every time data is loaded from any lower cache level, it's loaded into the caches above it as well. When a reference to memory is located in a cache, we say we have a cache hit ; when we have to "fall through" the cache and go look in the next level of the hierarchy, we have incurred a cache miss . Every cache miss incurs a miss cost. Predicting the miss cost is often complicated: newer processor designs like Sun's UltraSPARC have variable cache miss costs. In general, the miss cost is proportional to the difference in speed between the processor clock rate, the cache clock rate, and the main memory speed. It is very good from a performance viewpoint to hit the cache as often as possible. Let's look at the average memory-access time of our prototype system based on two distinct scenarios:

We hit the L1 cache 65% of the time, the L2 cache 25% of the time, and main memory 10% of the time.

0.65(4 ns )+0.25(5 ns )+0.10(220 ns ) = 25.85 ns
We hit the L1 cache 90% of the time, the L2 cache 8% of the time, and main memory 2% of the time.

0.90(4 ns )+0.08(5 ns )+0.02(220 ns ) = 8.4 ns

While 8.4 nanoseconds sounds awfully good, it's important to keep in mind that the clock cycle time of this system is on the order of a nanosecond. Nonetheless, as should be apparent, it is important for the cache hits to happen as much as possible and as high up in the hierarchy as fast as possible (Figure 3-4).

Figure 3-4. Performance as a function of L1 cache hit rate

3.2.2 Cache Organization and Operation

Caches are organized into equal- sized chunks called lines . A line typically contains four to sixteen sequential memory locations: although the data in a line is physically contiguous in main memory, two distinct cache lines can contain data that is quite separated in memory.

When data is requested from the cache, the system checks to see if the data is in one of the cache lines. If it is, then it is returned quickly; if not, the request falls through the next cache level (or to main memory). When the data is retrieved from the lower cache level, it is pulled into a line in the caches, discarding an existing line if necessary.

It is important to note that the data in a cache is not the "master copy" of the data. So, when data is modified in a cache, systems administrators have three choices:

Update the copy in the cache only, which is very fast. However, now the master copy and the cached copy differ . This is write-back behavior.
Update both copies immediately, which may be slow. This is efficient if the cost of writing to the cache is low. This is called write-through .
Throw away the cached copy and just update the master. This is only a good idea if the cost of writing to the cache is high.

Not all of main memory is cacheable: typically regions that involve I/O devices are uncacheable. This is determined by the address range, not by the data or instructions stored in the memory location.

3.2.3 Associativity

Every cache line is paired with the canonical data stored in main memory in a process called mapping . The cache maintains a record for each line, called a tag , of what memory addresses the cache line is associated with; the last access time is also stored in some designs. These maps can be constructed in three different ways:

Direct mapping: This is the most straightforward way of mapping main memory to cache lines. In this scheme, cache line zero stores integer multiples of the cache size. For a 32 KB cache, the memory addresses that were 0 modulo 32 KB (0, 32 KB, 64 KB, etc.) would be mapped against the first cache line. This is a cheap and simple method to implement, but it can cause decreased performance (called thrashing ) when working with two memory areas that are an exact integer multiple of the cache size apart. For example, when copying the contents of an array A into an array B, and A and B are exactly 32 KB apart, A will be read, then B has to be pulled into the same cache line to be written; this incurs a lot of cache misses very quickly.
Fully associative mapping: Fully associative mapping allows any cache line to be mapped against any memory address. Unlike a direct mapping implementation, each cache line in a fully associative cache knows something about the data it contains. When the processor requests data, each of the cache lines is queried as to whether it contains the data. If none of them have it, then a cache miss occurs. The cache lines track when the last access time of the data occurred, so that when a line needs to be overwritten, the oldest data can be discarded. Fully associative caches tend to have high utilization percentages, but are very expensive.
Set associative caching: A set associative cache is a compromise between the direct-mapping and fully-associative methods . They consist of a grouping of two or four banks of direct-mapped cache. Every time a cache line needs to be replaced , there is more than one choice of where it can go. This prevents the cache thrashing problems that are observed in direct-mapped caching.

3.2.4 Locality and "Cache-Busters"

Caches are so successful at what they do because programs exhibit peculiar memory access characteristics, which are called temporal locality and spatial locality . This means that, after an instruction or data access, the next instruction executed or the next data bit fetched tends to be close in either space or time. Temporal locality depends on reusing something repeatedly; spatial locality depends on using things at the same time that are close to each other.

We call a piece of code that causes the cache functionality to fail a cache-buster . Typically, a cache-buster breaks the temporal/spatial locality that the cache relies on for efficient operation. Three of the most common are non-unit stride, linked lists, and cache-aligned block copies.

3.2.4.1 Unit stride

Caches work best when a program is working through memory sequentially. Let's say that we are using short (16-bit) integers, and our cache lines are 128 bits in length. When we reference the first integer, the next seven will be loaded into the cache line as well, and our next seven references will be satisfied very quickly. This is called unit stride , because the address of each successive data reference is one greater than the data reference previous. The following code exhibits unit stride:

 for (i = 0; i <= 100000; i++) {    j = j + Array[i]; }

If a program exhibits non-unit stride, its performance suffers because each memory reference requires a trip to main memory. This is an example of a non-unit stride loop:

 for (i = 0; i <= 100000; i += 16) {    j = j + Array[i]; }

Even though this code performs one-sixteenth of the work of the unit stride loop, the two loops perform identically! Noncached memory accesses dominate the performance of the non-unit stride loop. For more examples of loop optimization, see Section 8.3 in Chapter 8.

3.2.4.2 Linked lists

Let's think about a linked list with a few thousand entries. Each entry consists of some data and a link to the next element. We search our list by traversal: because the code for searching the list can be made very compact, it fits well in the instruction cache. However, every data request will be in a different cache line, and so every data access incurs a cache miss. If the size of the linked list exceeds our cache size and we are forced to start flushing cache lines to make room for the next element in the list, our next search attempt will not find the start of the list in the cache! The key point is that list elements are rarely congruent in memory; therefore, iterating over them breaks the spatial locality principle that our caches depend on.

The only solution to this problem is to devise a new algorithm. As processors get faster, the problem will only get worse , since the processor/cache clock rate and the memory access time disparity will only grow larger.

3.2.4.3 Cache-aligned block copy problems

The kernel spends much of its time copying and zeroing blocks of data, particularly in page-sized (typically 4 or 8 KB) chunks. ^[9] This data does not need to be cached, since it probably won't be used again immediately; in fact, it will simply serve to remove useful data from the cache. The most efficient way to copy a page-sized piece of data with write-back caching is to read a cache line, then write it as a block, as shown in the following steps:

^[9] I discuss pages in Chapter 4.

Read the first piece of the source, incurring a cache miss.
Read the entire cache line of the source.
Write the first piece to the target, incurring a cache miss.
Write the entire cache line to the target.
Repeat until finished. At some later stage, the target cache line will be flushed for more useful data.

If the target and source are separated by an exact multiple of the cache size, and the system uses a direct-mapped cache, both the source and destination will use the same cache line, which will cause a read and write miss for each entry in the line. This is horribly inefficient. Here's an example:

 #define CACHESIZE 0x40000 char source[CACHESIZE], target[CACHESIZE]; for (i = 0; i < CACHESIZE; i++) {    source[i] = target[i]; }

Support exists in processors to accelerate block copying. Sun's VIS instruction set implemented in the UltraSPARC-series processors has a block-copy instruction that bypasses the caches, while still maintaining consistency between the caches and main memory. The bcopy(3C) and memcpy (3C) function calls both use this mechanism on VIS-enabled processors automatically.

3.2.5 The Cache Size Anomaly

Based on what we have said so far, the general rule of "increased cache means increased performance" holds. If a piece of code operates in a cache-busting fashion, it may be that the code performs much faster without a processor cache at all! Even so, sometimes very strange performance results are observed, like a SPARCstation 5 Model 110 outperforming a SPARCstation 20 Model 85 in a large simulation. Why is this the case?

The SPARCstation 5 uses a microSPARC processor that was designed for cheap, efficient operation. It has a small cache, a high cache miss rate, but a very low cache miss cost. The SuperSPARC-II processor used in the SPARCstation 20 has a much larger cache, a much lower cache miss rate, and a much higher cache miss cost. In some applications (such as database processing), the large cache works efficiently , and the SPARCstation 20 is much faster, but in a simulation where the cache is not large enough to help performance at all, the low cache miss cost of the SPARCstation 5 makes it an extremely fast performer. The UltraSPARC design incorporated many of these features, which has a very low latency to main memory.

While I've given an example that uses older hardware, the potential for this sort of problem comes up quite frequently. For example, Celeron processors sometimes outperform their "professional" Pentium-II counterparts, because of a reduced cache miss cost. Some in-depth testing may be necessary to determine exactly what the cache access patterns of your applications are.

Another influence on cache effectiveness is something called page coloring, which refers to the algorithm used by the operating system kernel to decide where pages should be placed in the processor caches. For more information on this topic, see Section 4.2.6 in Chapter 4.