4.7. Improving Cache Performance | HP ProLiant Servers AIS: Official Study Guide and Desk Reference

< Day Day Up >

Engineers are always looking for ways to increase the efficiency of every clock cycle. Over time, several enhancements have been made to cache and memory design to speed up the time it takes the system to read and write data.

4.7.1 Single-Processor Operation

In single-processor systems, the L2 cache size is not as critical as in multiprocessor systems. 256KB is usually sufficient for L2 cache in single-processor systems. A memory access from the processor only competes for the system bus with a bus master PCI card. Doubling the cache size from 256KB to 512KB results in a performance gain of 3% to 5%. Larger cache sizes have better hit rates, but the relative incremental gain in single-processor systems is offset by the cost. Figure 4-17 illustrates this concept.

Figure 4-17. Performance gains with different levels of cache.

4.7.2 Multiprocessor Operation

In multiprocessor systems, up to four processors per processor bus must compete for memory and I/O access rights. The system bus therefore becomes a severe bottleneck, and the increase in traffic results in a decrease in performance. Additional processors will not improve performance proportionally.

With four processors on a single bus, it is very likely that a processor requiring data from main memory will have to wait before getting access to the system bus. If the processor has a large cache, however, it has quick and easy access to much of the data it will need. Performance will therefore increase substantially.

4.7.3 Cache Architectures

The speed at which cache can fill a request from the processor partially depends on where the data from main memory is stored in the cache. The two main cache architectures are look-aside cache and look-through cache.

4.7.3.1 LOOK-ASIDE CACHE

Early processors that did not use cache sent memory read and write requests to the system bus. Although the requests were intended only for memory, every device connected to the bus received the request and had to filter out whether the request belonged to it. Because only one device could use the system bus at one time, the memory requests slowed system performance.

When cache was implemented, the processor still sent the request to the system bus where both cache and memory received the request. If a cache hit occurred, the cache controller would terminate the request to the other devices on the system bus. If there was a cache miss, the request continued as normal. This type of architecture is known as look-aside cache, as illustrated in Figure 4-18.

Figure 4-18. Look-aside cache architecture.

The advantage of look-aside cache is that if there is a cache miss, memory has already been notified and can complete the request quickly. If there is a cache hit, however, memory has wasted time precharging the rows and columns, and the other devices on the system bus had to wait to use the bus.

4.7.3.2 LOOK-THROUGH CACHE

Look-through cache reduces the number of requests that the system bus receives. When a processor sends a read request in this type of system, the request goes to cache first. If there is a cache hit, no request is passed on to the system bus, thereby keeping the bus free for other devices to use. If there is a cache miss, the cache controller transfers the request to the system bus. Figure 4-19 illustrates this concept.

Figure 4-19. Look-through cache architecture.

The advantage of look-through cache is the overall reduction in the number of memory requests on the system bus. The disadvantage of look-through cache is that main memory does not begin looking for data until after the cache has determined there is a cache miss. This delay is commonly called the lookup penalty.

Look-through cache has been used in all Intel processors introduced after the Pentium processor.

4.7.4 Writing to Cache

The method that a system uses to write to cache and to memory can affect the speed of the system. The method used to write information to the cache depends on whether the cache is write-through cache or write-back cache.

4.7.4.1 WRITE-THROUGH CACHE

When the processor writes a piece of data, it updates the L1 cache, which updates the L2 cache, which updates the L3 cache (if there is one), which updates main memory. This update process can take many clock cycles to complete. If the processor needs the data again soon, it has to wait until the data has been updated in main memory.

This type of cache architecture is known as write-through cache because the system must write the data through all the memory levels before it can be used again.

4.7.4.2 WRITE-BACK CACHE

To minimize the time the processor has to wait to retrieve data it has just written, engineers designed write-back cache.

In write-back cache, the processor updates the L1 cache, but main memory is not updated immediately. Instead, a special status bit attached to the cache line, sometimes called the dirty bit, is flagged to indicate that the data has not been written to memory yet. This keeps the memory bus clear for other requests.

If the processor requests the data again, the cache controller checks the status bit to see whether the data in the cache line is the same as the data in memory. If not, data is written to memory before it is returned to the processor.

If the processor wants to write a piece of data, the cache controller looks at the status bit to see whether the data in the least recently used cache line has been written to memory yet. If not, the cache controller initiates the write to main memory before overwriting the cache line with the write request of the processor.

4.7.5 Bus Snooping and Bus Snarfing

A single processor is not the only device in a server that might try to read or write to memory. In multiprocessing systems, other processors might try to access memory. Many newer servers contain devices known as bus masters. Bus masters can access memory without going through the processor.

In write-back systems, one of these devices could request data from memory, but the memory might be stale. Cache might contain dirty data that has not been written to memory yet. In this case, the device making the memory request could get bad data.

To prevent this from happening, cache controllers are designed to "listen in" to the system bus traffic for any memory requests made by bus masters. This is called bus snooping, and is illustrated in Figure 4-20.

Figure 4-20. Bus snooping.

When the bus master is trying to read from memory, the cache controller checks to see whether newer data is in cache. If it is, the cache controller puts the request of the bus master on hold until it updates memory with the most current data.

When the bus master is trying to write to memory, the cache controller captures the data being written and writes it to cache. This method of capturing data is known as bus snarfing.

4.7.6 Reading from Cache

The speed at which cache can fill a request from the processor partially depends on where data from main memory is stored in the cache.

4.7.6.1 FULLY ASSOCIATED CACHE

The cache might be designed so that data from main memory can be stored in any cache line. This is known as fully associated cache.

In fully associated cache, moving data from memory to cache is quick. The data replaces the cache line with the oldest data. Finding this data later, however, can take a long time. In a 512KB cache, the cache controller might have to look through all 16,384 cache lines before it finds the data that it needs.

4.7.6.2 DIRECT MAPPED CACHE

Another cache architecture, known as direct mapped cache, assigns a group of addresses to each cache line. This makes it quicker for the cache controller to determine whether a piece of data is in the cache.

The controller has to look at only one address in the tag RAM the one to which the requested address is assigned. If the requested address matches the one in the tag RAM, it is a cache hit. If not, the cache controller does not have to look at any other tag RAM addresses and it can move on to the next step.

Direct mapped cache can cause cache thrashing. Cache thrashing describes a situation with multiple cache misses, frequent memory accesses, and frequent cache updates.

Example

The processor requests a piece of data from one address (D1). Main memory sends D1 to the processor and to cache. D1 is assigned to cache line 1 (CL1).

The processor immediately requests a different piece of data (D2). D2 also happens to be assigned to CL1. The cache controller looks for D2 in CL1, but gets a cache miss because the chunk containing D1 resides in CL1.

The cache controller sends the request to the next levels of memory until there is a cache hit. At this point, D2 replaces D1 in CL1.

The processor is running a loop and asks for D1 again. Again there is a cache miss because D2 is now occupying CL1. This process could continue until the loop is finally finished. During that time, the cache hit rate would be 0%.

4.7.6.3 SET-ASSOCIATIVE CACHE

Most cache architecture today is a compromise between direct mapped and fully associative cache. It is called set-associative cache.

Set-associative cache assigns a group of memory addresses to a specific group of cache lines known as a set. Depending on the cache, each set can have two, four, or eight lines. In two-way set-associative cache, data from a group of memory addresses can fill either of two cache lines. Using the previous example, D1 could fill CL1, and D2 could fill CL2.

Compared to direct mapped cache, it might take the cache controller slightly longer to find the data because it has to check two addresses in the tag RAM. However, the increased number of cache hits makes up for this. In general, the more cache lines in each set, the more likely a cache hit will occur. However, the more cache lines in each set, the more cache lines the cache controller has to check for a match.

< Day Day Up >