Cache Memory


As Tables 2.12.7 demonstrate, processor core speeds have increased dramatically in the past decade. Although memory speeds have increased as well during the same time period, memory speeds have not kept up with processor performance. How could you run a processor faster than the memory from which you feed it without having performance suffer terribly? The answer was cache. In its simplest terms, cache memory is a high-speed memory buffer that temporarily stores data the processor needs, allowing the processor to retrieve that data faster than if it came from main memory. But there is one additional feature of a cache over a simple buffer: intelligence. A cache is a buffer with a brain.

A buffer holds random data, usually on a first-in, first-out (FIFO) or first-in, last-out (FILO) basis. A cache, on the other hand, holds the data the processor is most likely to need in advance of it actually being needed. This enables the processor to continue working at either full speed or close to it without having to wait for the data to be retrieved from slower main memory. Cache memory is usually made up of static RAM (SRAM) memory integrated into the processor die, although older systems with cache also used chips installed on the motherboard.

Processors use various types of cache algorithms to determine what data to store in cache and what data to remove from cache to make room for new data. Common methods include the following:

  • Least recently used (LRU) This method discards the oldest data from the cache.

  • Least frequently used (LFU) This method discards from the cache the data that is used less often than other data.

See "SRAM: Cache Memory," p. 348.


At least two levels of processor/memory cache are used in a modern server: Level 1 (L1) and Level 2 (L2). Some server processors such as the Itanium series and the latest Xeon models from Intel as well as some RISC-based processors also use Level 3 (L3) cache. These caches and how they function are described in the following sections.

Internal Level 1 Cache

All server processors include an integrated L1 cache and controller. This feature was first introduced to x86-compatible processors starting with the 486 family. Tables 2.12.7 show that, historically, server processors have featured L1 cache sizes from as little as 8KB to as much as 3MB. However, most recent server processors have L1 cache sizes ranging from 32KB to 128KB.

Note

Dual-core processors have separate L1 caches for each processor core. Most also include separate L2 caches for each core.


To understand the importance of cache, you need to know the relative speeds of processors and memory. The problem with this is that processor speed is usually expressed in MHz or GHz (millions or billions of cycles per second), whereas memory speeds are often expressed in nanoseconds (billionths of a second per cycle). Most newer types of memory express the speed in either MHz or in megabytes per second (MBps) bandwidth (throughput). Both are really time- or frequency-based measurements, and Table 5.3 in Chapter 5, "Memory," is a chart that compares them. For years, servers have had huge disparities between the speed of the processor core, the speed of the processor bus connection to memory, and the speed of the memory itself. For example, consider a server with a 3.0GHz Xeon processor with a FSB of 800MHz (200MHzx4 transfers per cycle). Typically, such a server would use PC2700 DDR memory, which has an effective clock speed of 333MHz (166MHzx2). In this example, the processor's clock speed is 3.75 times faster than the FSB, and the FSB is 2.4 times faster than main memory. Similar comparisons can be made with virtually any server processor, whether it's x86, AMD64/EM64T, EPIC, or RISC architecture. If the processor had to get information from main memory every time it needed to process information, the processor would spend a lot of time waiting for memory.

Because L1 cache is always built in to the processor die, it runs at the full-core speed of the processor internally. Full-core speed means that the cache runs at the same speed as the internal processor core rather than the slower external motherboard speed. This cache is basically an area of very fast memory that is built in to the processor and used to hold some of the current working set of code and data. Cache memory can be accessed with no wait states because it is running at the same speed as the processor core.

Using cache memory reduces traditional system bottlenecks because system RAM is almost always much slower than the CPU; the performance difference between memory and CPU speed has become especially large in recent systems. Using cache memory prevents the processor from having to wait for code and data from much slower main memory, therefore improving performance. Without the L1 cache, a processor would frequently be forced to wait until system memory caught up.

Cache is even more important in modern processors because it is often the only memory in the entire system that can truly keep up with the chip. Most modern processors are clock multiplied, which means they run at a speed that is a multiple of the motherboard into which they are plugged. The Pentium 4 2.8GHz, for example, runs at a multiple of 5.25 times the true motherboard speed of 533MHz. The main memory is half this speed (266MHz) because the Pentium 4 uses a quad-pumped memory bus. Because the main memory is plugged in to the motherboard, it can run at only 266MHz maximum. The only 2.8GHz memory in such a system is the L1 and L2 caches that are built in to the processor core. In this example, the Pentium 4 2.8GHz processor has 20KB of integrated L1 cache (8KB data cache and 12KB execution trace cache) and 512KB of L2, all running at the full speed of the processor core.

See "Memory Module Speed," p. 388.


For cache sizes of server processors, see Tables 2.12.7, pp. 2431.


If the data the processor wants is already in the internal cache, the CPU does not have to wait. If the data is not in the cache, the CPU must fetch it from the Level 2 cache or (in less sophisticated system designs) from the system bus, meaning directly from main memory.

How Cache Works

To learn how the L1 cache works, consider an analogy. In this story, you are eating food to act as the processor requesting and operating on data from memory (you're dining at a restaurant, and representing a 3GHz Xeon, by the way). The kitchen where the food is prepared represents main memory (DIMM RAM). The cache controller is represented by the waiter, and the L1 cache is represented by the table at which you are seated.

Say you start to eat at this particular restaurant every day at the same time. You come in, sit down, and order a hot dog. You can eat a bite every half second (3GHz = 0.33ns cycling). When you first arrive, you sit down, order a hot dog, and wait 3 seconds (333MHz memory speed) for the food to be produced before you can begin eating. After the waiter brings the food, you start eating at your normal rate. Pretty quickly, you finish the hot dog, so you call the waiter over and order a hamburger. Again, you wait 3 seconds while the hamburger is being produced. When it arrives, you again begin eating at full speed. After you finish the hamburger, you order a plate of fries. Again you wait, and after it is delivered 3 seconds later, you eat it at full speed. Finally, you decide to finish the meal and order cheesecake for dessert. After another 3-second wait, you can eat cheesecake at full speed. Your overall eating experience consists of mostly a lot of waiting, followed by short bursts of actual eating at full speed.

After coming into the restaurant for two consecutive nights at exactly 6 p.m. and ordering the same items in the same order each time, on the third night the waiter begins to think; "I know this guy is going to be here at 6 p.m., order a hot dog, a hamburger, fries, and then cheesecake. Why don't I have these items prepared in advance and surprise him? Maybe I'll get a big tip." So you enter the restaurant and order a hot dog, and the waiter immediately puts it on your plate, with no waiting. You then proceed to finish the hot dog, and right as you are about to request the hamburger, the waiter deposits one on your plate. The rest of the meal continues in the same fashion, and you eat the entire meal, taking two bites every second, and never have to wait for the kitchen to prepare the food. Your overall eating experience this time consists of all eating, with no waiting for the food to be prepared, due primarily to the intelligence and thoughtfulness of your waiter.

This analogy exactly describes the function of the L1 cache in the processor. The L1 cache itself is the table that can contain one or more plates of food. Without a waiter, the space on the table is a simple food buffer. When it is stocked, you can eat until the buffer is empty, but nobody seems to be intelligently refilling it. The waiter is the cache controller who takes action and adds the intelligence to decide which dishes are to be placed on the table in advance of your needing them. Like the real cache controller, he uses his skills to literally guess which food you will require next, and if and when he guesses right, you never have to wait.

Let's now say that on the fourth night, you arrive exactly on time and start off with the usual hot dog. The waiter, by now really feeling confident, has the hot dog already prepared when you arrive, so there is no waiting.

Just as you finish the hot dog, and right as he is placing a hamburger on your plate, you say "Gee, I'd really like a bratwurst now; I didn't actually order this hamburger." The waiter guessed wrong, and the consequence is that this time you have to wait the full 3 seconds as the kitchen prepares your brat. This is known as a cache miss, in which the cache controller did not correctly fill the cache with the data the processor actually needed next. If the system has only L1 cache, the processor must fetch data from main memory, effectively slowing down to the memory speed, when a cache miss takes place.

According to Intel, the L1 cache in most of its processors has approximately a 90% hit ratio. (Some processors, such as the Xeon, are slightly higher.) This means that the cache has the correct data 90% of the time, and consequently the processor runs at full speed3.0GHz, in this example90% of the time. However, 10% of the time, the cache controller guesses wrong, and the data has to be retrieved out of the significantly slower main memory, which means the processor has to wait. This essentially throttles the system back to RAM speed, which in this example is 333MHz (3.0ns). Comparable L1 performance is reported by most other server processors.

In this analogy, the processor is 9 times faster than the main memory. Memory speeds have increased from 16MHz (60ns) to 400MHz (2.5ns) or faster in the latest systems, but processor speeds have also risen, up to 3.8GHz, so even in the latest systems, memory is still 7.5 or more times slower than the processor. Cache makes up the difference.

The main feature of L1 cache is that it has always been integrated into the processor core, where it runs at the same speed as the core. This, combined with the hit ratio of 90% or greater, makes L1 cache very important for system performance.

Level 2 Cache

To mitigate the dramatic slowdown every time an L1 cache miss occurs, a secondary cache, L2, is employed. At one time, L2 cache was located outside the processor die. Depending on the processor, it might have been located on the motherboard or in a bulky processor cartridge (as with the Pentium II, Pentium II Xeon, and some versions of the Pentium III and Pentium III Xeon processors). In such cases, there was a slowdown when an L1 cache miss took place, but the system had the desired information in L2 cache.

However, in virtually all recent server processors (refer to Tables 2.12.7), L2 cache is also incorporated into the processor die, where it runs at full speed. To revisit the restaurant analogy, on-die L2 cache is like having a larger table at the restaurant where the waiter can pre-position your favorite foods. Because L2 cache is often four or more times larger than L1 cache, there's plenty of room for data (or, in our example, the hot dog, hamburger, fries, and cheesecake). As before, you take a bite every half-second, but you receive additional food just as fast from L1 cache and from L2 cache. 99% of the time, you would run at 3GHz (L1 and L2 hit ratios combined), or, in other words, the food you want is already on the table 99% of the time. You slow down to RAM speed (333MHz, or 3ns) only 1% of the time. If only restaurant performance would increase at the same rate as processor performance!

Level 3 Cache

L3 cache is the third level !of cache, and it is present in only a few very-high-performance server and high-performance workstation processors at this time. These include the Intel Pentium 4 Extreme Edition, Xeon MP, Intel Itanium family, and a few RISC-based processors.

L3 cache is checked after the processor checks L1 and then L2 cache for the necessary information. As with L2 cache, a large L3 cache improves performance by storing a larger amount of the contents of main memory for quick access by the processor. In our restaurant analogy, Level 3 cache could be compared to another food cart between the original food cart and the kitchen.

The location of the L3 cache affects its speed. If the L3 cache is off-die, it runs at a slower speed than the processor core. Consequently, the system's performance is reduced when L1 and L2 cache misses take place, but L3 cache contains the desired information. However, even in such cases, accessing L3 cache is generally faster than accessing main memory.

Processors with on-die L3 cache access it at the same speed as L1 and L2 cache. In such cases, there is little practical distinction between the operations of L2 and L3 cache.

Cache Organization

A cache stores copies of data from various main memory addresses. Because the cache cannot hold copies of the data from all the addresses in main memory simultaneously, there has to be a way to know which addresses are currently copied into the cache so that, if we need data from those addresses, it can be read from the cache rather than from the main memory. This function is performed by tag RAM, which is additional memory in the cache that holds an index of the addresses that are copied into the cache. On older systems that used external L2 cache, tag RAM was implemented as a separate SRAM chip. However, on modern servers with integrated cache, tag RAM is built in to the L1 and L2 caches in the processor die.

Each line of cache memory has a corresponding address tag that stores the main memory address of the data currently copied into that particular cache line. If data from a particular main memory address is needed, the cache controller can quickly search the address tags to see whether the requested address is currently being stored in the cache (a hit) or not (a miss). If the data is there, it can be read from the faster cache; if it isn't, it has to be read from the much slower main memory.

Various ways of organizing or mapping the tags affect how the cache works. A cache can be mapped in three different ways:

  • Fully associative In a fully associative mapped cache, when a request is made for data from a specific main memory address, the address is compared against all the address tag entries in the cache tag RAM. If the requested main memory address is found in the tag (a hit), the corresponding location in the cache is returned. If the requested address is not found in the address tag entries, a miss occurs, and the data must be retrieved from the main memory address instead of the cache. See Figure 2.3.

    Figure 2.3. In a fully associative cache, any block can be stored in any block frame.

  • Direct mapped In a direct-mapped cache, specific main memory addresses are preassigned to specific line locations in the cache where they will be stored. Therefore, the tag RAM can use fewer bits because when you know which main memory address you want, only one address tag needs to be checked, and each tag needs to store only the possible addresses a given line can contain. This also results in faster operation because only one tag address needs to be checked for a given memory address. See Figure 2.4.

    Figure 2.4. In a direct-mapped cache, specific addresses are assigned to specific cache locations.

  • Set associative A set-associative cache is a modified direct-mapped cache. A direct-mapped cache has only one set of memory associations, meaning that a given memory address can be mapped into (or associated with) only a specific given cache line location. A two-way set-associative cache has two sets, so that a given memory location can be in one of two locations. A four-way set-associative cache can store a given memory address into four different cache line locations (or sets). When you increase the set associativity, the chance of finding a value increases; however, it takes a little longer because more tag addresses must be checked when you're searching for a specific location in the cache. In essence, each set in an n-way set-associative cache is a subcache that has associations with each main memory address. As the number of subcaches or sets increases, eventually the cache becomes fully associativea situation in which any memory address can be stored in any cache line location. In that case, an n-way set-associative cache is a compromise between a fully associative cache and a direct-mapped cache. See Figure 2.5.

Figure 2.5. In a set-associative cache, the L1 cache is divided up into multiple memory associations.


In general, a direct-mapped cache is the fastest at locating and retrieving data from the cache because it has to look at only one specific tag address for a given memory address. However, it also results in more misses overall than the other designs. A fully associative cache offers the highest hit ratio but is the slowest at locating and retrieving the data because it has many more address tags to check through. An n-way set-associative cache is a compromise between optimizing cache speed and hit ratio, but the more associativity there is, the more hardware (tag bits, comparator circuits, and so on) is required, making the cache more expensive. Obviously, cache design is a series of tradeoffs, and what works best in one instance might not work best in another. Multitasking environments such as Windows are good examples of environments in which the processor needs to operate on different areas of memory simultaneously and in which an n-way cache can improve performance.

The organization of the cache memory in the Pentium Pro and later Pentium-class processors is called a four-way set-associative cache, which means that the cache memory is split into four blocks. Each block is also organized as 128 or 256 lines of 16 bytes each.

The contents of the cache must always be in sync with the contents of main memory to ensure that the processor is working with current data. Server processors use an internal write-back cache, which means that both reads and writes are cached, further improving performance over the write-through cache used by older processors.

Another feature of improved cache designs is that they are nonblocking. This is a technique for reducing or hiding memory delays by exploiting the overlap of processor operations with data accesses. A nonblocking cache enables program execution to proceed concurrently with cache misses, as long as certain dependency constraints are observed. In other words, the cache can handle a cache miss much better and enable the processor to continue doing something that is not dependent on the missing data.

The cache controller built in to the processor is also responsible for watching the memory bus when alternative processors, known as bus masters, are in control of the system. This process of watching the bus is referred to as bus snooping. If a bus master device writes to an area of memory that is also stored in the processor cache currently, the cache contents and memory no longer agree. The cache controller then marks this data as invalid and reloads the cache during the next memory access, preserving the integrity of the system.

All processor designs that support cache memory include a feature known as a translation lookaside buffer (TLB) to improve recovery from cache misses. The TLB is a table inside the processor that stores information about the location of recently accessed memory addresses. The TLB speeds up the translation of virtual addresses to physical memory addresses. To improve TLB performance, several recent processors have increased the number of entries in the TLB. Pentium 4 processors that support Hyper-Threading Technology (HT Technology) have a separate instruction TLB (iTLB) for each virtual processor thread. Some older RISC processors used the operating system to manage the TLB, but generally, especially in recent and current processors, the TLB is managed by the processor itself.

See "Hyper-Threading Technology," p. 60.


Cache Considerations for Multiple-Processor Systems

Systems with two or more processors or with dual-core processors use cache in the same ways that single-processor systems do, but making sure that cache memory is kept up-to-date is more difficult in a multiple-processor or dual-core system. Because each of the processors in a multiple-processor system or each processor core in a dual-processor system might be working with memory locations that might also be used by other processors, there must be a method for each processor to alert others of changes to the contents of cache memory. The process of keeping cache memory up-to-date is known as cache coherency.

Cache coherency protocols vary widely from processor to processor. Table 2.11 lists the most widely supported cache coherency states in current server processors.

Table 2.11. Typical Cache Coherency States

Cache Coherence State

Description

Modified (also known as exclusive modified)

One processor's cache memory is up to date, but main memory has not yet been updated.

Exclusive (also known as exclusive clean)

One processor's cache memory is up to date, and main memory is up to date as well.

Shared (also known as shared clean)

Multiple processors' cache memory is up to date, as is main memory.

Invalid

No processors' cache memory is up to date.

Owned (also known as shared modified)

Combines Shared and Modified states.


Table 2.12 lists server processors and the cache coherency states they support.

Table 2.12. Selected Server Processors and Their Cache Coherency Protocols

Vendor

Processor

Cache Coherence States

AMD

Athlon MP, Opteron

Modified, Owned, Exclusive, Shared, and Invalid (MOESI)

Hewlett-Packard

Alpha (all versions)

Modified, Owned, Exclusive, Shared, and Invalid (MOESI)

IBM

PowerPC 603

Modified, Exclusive, Invalid (MEI)

IBM

PowerPC 740, 750

Modified, Exclusive, Shared, Invalid (MESI)

IBM

Power4

Enhanced MESI[1]

Intel

Pentium Pro, Pentium II/Xeon, Pentium III/Xeon, Xeon, Itanium, Itanium 2

Modified, Exclusive, Shared, Invalid (MESI)

Sun

UltraSPARC

Exclusive Modified, Shared Modified, Exclusive Clean, Shared Clean, and Invalid (MOESI)


[1] Includes additional states in L2 and L3 caches.

In the MESI cache coherency protocol used by some vendors, each processor snoops the memory bus to determine whether any or all processors' cache memory contains valid information or invalid information and marks the cache accordingly. The MESI protocol can lead to a performance penalty when modified data must be written back to main memory; other processors must wait until this has occurred before they can access memory.

The MOESI protocol used by Alpha, Athlon MP, Opteron, and Sun UltraSPARC reduces memory latency and improves memory performance compared to MESI. Keep in mind that the Athlon MP descends from the original AMD Athlon, which uses the EV-6 bus design originally developed for the Alpha and subsequently licensed by AMD.

To help improve cache coherency, some vendors add cache coherency filters or other cache accelerators to Intel-based systems with more than two processors. For example, the Hewlett-Packard ProLiant DL740 (4 Xeon MP processors) uses a 2MB cache accelerator feature on the motherboard. These features are not needed on AMD-based systems using up to eight-way designs because of the superior performance of MOESI caching.




Upgrading and Repairing Servers
Upgrading and Repairing Servers
ISBN: 078972815X
EAN: 2147483647
Year: 2006
Pages: 240

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net