A.3 Memory Hierarchy | HP-UX CSE(c) Official Study Guide and Desk Reference

The ideal scenario for storing data is to store all the data we are currently using in registers . They are the quickest form of memory storage in a computer. Unfortunately , cost, volatility, space, power requirements, and heat dissipation all inhibit the likelihood of a processor ever existing that accommodate all the data we are currently using. Consequently, we need some other form of storage to keep our data. The closer we are physically located to the processor, the better the performance, but at ever increasing cost. Figure A-5 shows this memory or storage hierarchy.

Figure A-5. Memory hierarchy.

graphics/ap01fig05.gif

Our memory hierarchy is trying to alleviate the problem of not having enough storage capacity on the processor itself. The solutions we have available to us at present include using high-speed memory (a.k.a. cache memory ) is the best solution. The cr me de la cr me of cache memory is a cache that we can access in one clock cycle. Such a low latency will probably require that the cache is physically located on the processor die itself. In turn , this will limit the size of this Level 1 (L1) cache possibly to tens of kilobytes, but in some cases a L1 cache in the megabyte range has been seen. Cache memory is made up of SRAM (Static Random Access Memory) chips. They differ in a number of ways to the chips used for main memory. Main memory is made of a type of memory chip known as DRAM (Dynamic Random Access Memory). The Static in SRAM relates to the state of a memory cell while power is applied to the device. Memory cells in a SRAM chip will maintain their current state, i.e., hold their current data item, with no further intervention. The Dynamic in DRAM requires the memory cells to be acted upon in order for them to retain their current state. DRAM chips are comprised of MOS (metal oxide semiconductor) devices that have high densities and are relatively cheap to fabricate. The drawback is that they use a charge stored in a capacitance to indicate their state: on or off. Unfortunately, this capacitance can drain away if left unchecked. A DRAM cell would lose its charge and therefore lose the data stored in it. DRAM chips need to be refreshed (read, then rewritten) usually every millisecond or two to ensure that the data held in a memory cell is maintained . While this refresh is being performed, the data in a DRAM cell cannot be accessed. SRAM chips are similar to processor registers: They are effectively an extension of the electronic bistable flip-flop. One benefit of these electronic latches is that the charge is static and does not need refreshing. Another benefit is due to their similar nature to registers; we can access them in one clock cycle if they are located sufficiently close to other processor elements. SRAM chips are faster than DRAM, but unfortunately considerably more expensive ”usually a factor of 10-20 times more expensive. There are few computers that have all memory made of SRAM chips; CRAY supercomputers are the main exception. Seymour Cray famously once said about the dilemma between using SRAM chips in preference to DRAM chips irrespective of cost, " You can't fake what you don't have ." Before looking at cache architecture, let's look at some figures related to access times for memory in relation to processor speeds. Let's take a fictitious processor operating at 500MHz:

A clock tick occurs every 2 nanoseconds. The ideal is to execute an instruction every clock tick. Typical DRAM access times are in the order of 60 nanoseconds. Some chip manufactures are trying to bring that speed down with technologies such as SDRAM (synchronous DRAM), Rambus, and EDO (Extended Data Out) memory. These technologies are still essentially DRAM chips but with additional technology in order to improve access times in respect of SRAM memory. In essence, for every memory access we could execute 30 machine instructions. This is the major reason we need high-speed memory ”SRAM ”in order to alleviate some of that imbalance. In fact, studies have shown that somewhere around 30 percent of instructions make reference to memory locations (in RISC, that would mean 30 percent of instructions are LOAD / STORE instructions). Operating systems commonly use Virtual Addresses to access memory, meaning that every memory reference is now a reference to a virtual address, be that a reference to the next instruction or to a data element. This is an additional task for the processor to perform. Many modern processors use a special Functional Unit known as the TLB (Translation Lookaside Buffer). For our simple example, we won't use a hardware TLB; rather, we'll require the processor to perform the translation itself. The result is that is that every memory reference equates to:

1 reference_ for_ an_ instruction + (1reference _for_ data*30%)=1.3 virtual_ address_ references

Take into account that every memory reference occurs in two stages: (1) perform the Virtual to Physical translation, and then (2) fetch the datum from the Physical memory location. This means that with no assistance from a TLB, every memory reference is in fact making 2.6 real memory references. With our fictitious processor running at 30 times faster than our DRAM chips, the processor is consistently 2.6 x 30 = 61.8 times ahead of memory as far as access times are concerned . The natural response to this is that we need memory to operate at a faster speed. In fact, with a superscalar and super-pipelined processor, it can be argued that memory needs to operate faster than the processor. This would mean that we would need to invent an electronic device that can react quicker than a register. If someone invented such a device, I can guarantee that the first architects to use it would be the processor architects themselves , and we would get into a circular argument regarding processor and memory speeds. Our best effort is to use devices that can react as quickly as a register. Today, that means SRAM.

Let's first look at SRAM cache chips . Due to their cost, SRAM memory chips are smaller in capacity than their main memory counterparts. Being located close to the processor also limits their size. All of these size limitations don't seem to bode well for the performance of cache memory. As we see, the nature of how programs function means that having a small amount of cache memory doesn't hinder performance as much as we first think. It is seldom that we have a program that is purely sequential in flow, i.e., we start with the first instruction, move on to the next, and simply step through instruction after instruction until we get to the end. A typical program will have some form of repetition built into it, some form of looping construct whereby we work on a given set of instructions and possibly a similar set of data in a repetitive manner. This highlights a concept know as the principle of locality. Basically, there are two types of locality :

Spatial locality : Spatial locality deals with the notion that the data we are about to access is probably close to the data we have just accessed, i.e., the next instruction in our loop is close to the last instruction. Logically, this seems to make sense.

Temporal locality : Temporal locality deals with the concept that once we have accessed a piece of data, there is a strong probability that we will access the same piece of data in the not-to-distant future. Again, our notion of looping in a program means that we come around to the same instructions in the not-so- distant future. Again, logically this seems to make sense.

Considering both Spatial and Temporal locality, the prospect of having a relatively small but very fast cache is less of an impact than maybe we first thought. The chances are good that data and instructions will be in the cache the next time we need them. If we remember back to the von Neumann principle where data and instructions are both considered simply as a stream of data, we could see an immediate bottleneck in the design of our cache. We can be accessing the cache only to retrieve either a data element or an instruction, but not both simultaneously , i.e., the von Neumann bottleneck . Then we consider the principle of a superscalar architecture where we can be activating separate processor components as long as they are involved in separate tasks . The resulting technology is simple and elegant: Implement a separate cache for instructions and a separate cache for data. This means utilizing a separate instruction and data cache as well as a separate bus into the processor for each cache, effectively doubling the overall datum bandwidth into the processor. An architecture that utilizes separate data and instruction memories is known as a Harvard architecture . Today, it is used to identify architectures with separate instruction and data caches, even though main memory instructions and data are stored together, i.e., monolithic. Most high performance processors have a separate instruction cache (I-cache) and data cache (D-cache). Next, we look at how cache is organized. The cache architecture employs circuitry designed to decode a memory address to one of a number of cache lines . A cache line is a number of SRAM cells whose size is the unit of transfer from cache to main memory. This is to say that if we LOAD a 64KB cache line, we will transfer 64KB worth of data/instruction, even if we only asked for 2KB. This buffering can be a benefit to performance. If we think back to the notion of Spatial locality, this concept dictates that the next piece of data we require will be located close to the location of the current data element. If we have recently transferred too much data for the original request, there is a high probability that the data we now require is already in cache. The same ideas apply to the notion of Temporal locality and apply to both instructions and to data. A difficult question to answer is the ideal size for a cache line. If it is too large, we will need to perform large transfers to and from main memory. Performing large transfers from cache to main memory is going to impose the same transfer-time problems that we documented between processors and main memory. The size of the cache line is a decision for the processor architects. The design of the instruction set will uncover an ideal size for instructions. For data elements, they may use historical data to try to establish a "typical" data flow for such a processor and try to match as close as possible the cache line size to it. In reality, a general-purpose computer will not have a typical data flow for this type of processor. Ultimately, it is up to programmers to establish what the designer has chosen as the cache line size. With this and other information, the programmer needs to model his or her application data accordingly . The programmer will be trying to avoid a situation known as cache thrashing. Cache thrashing is where a program is continuously referencing a particular line in the cache simply as a consequence of the design of the data model used in the application. Consequently, we need to flush or purge the cache line every time a new datum needs to be loaded. Because this requires writes to and reads from main memory, programmers should try to avoid this as much as possible. The main reasons for cache thrashing is an unexpected conflict between cache organization and the data model used in the application. Programmers should be aware of the principle of locality as well as how the cache is organized on the systems that will run their applications and construct their data model with both of these in mind. It is not only the cache line size the programmer will need to consider when constructing his data model, but also the way that a memory addresses map onto cache lines. This is known as a mapping function used by the cache logic circuitry.

A.3.1 Cache memory mapping functions

There are three ways we can organize cache mapping :

Direct Mapping : This is where only one cache line maps to a given block of memory addresses.

Fully Associative : This is where any cache line can map to any block of memory addresses.

Set Associative : This is a mapping where there are a number of cache lines that map to a given block of memory addresses.

Before we look at each mapping, we look at a basic processor where some basic design criteria have already been established:

The cache has only eight cache lines.
Each cache line holds four bytes.

This means that we need three bits to identify a specific cache line (2 ³ = 8 cache lines) and two bits to identify an individual byte within a cache line. This is known as the offset (2 ² = 4 bytes of an offset). The rest of the address will be used as an identifier or tag field to ensure that the cache line we are interested in currently has our data element. With 3 bits being used to identify the c ache line and two bits to represent the offset , our tag field will be three bits in size.

Let's look at some examples of each of these cache mappings in order to fully understand how each works, as well as the costs and benefits.

A.3.1.1 DIRECT MAPPING

With a direct mapped cache, a particular memory address will always map to the same cache line. The cache logic to employ such a design is the most simplistic of the mapping functions and is hence the cheapest. It has drawbacks in that a number of memory addresses might get decoded to the same cache line. This can lead to the situation known as cache thrashing, which basically means that we have to get rid of old data to make way for new data. If both data elements are in constant use, we could have the situation where two pieces of data are in effect consuming the entire bandwidth of the cache. In such a situation, we get what is known as cache hot spots . Let's look at how direct mapping works.

NOTE : It is common to represent binary information in hexadecimal: Writing a 64-bit address in binary can be somewhat cumbersome. Using hex means that we can represent an 8-bit byte with just two hex digits. I use both forms in the accompanying diagrams where appropriate.

Figure A-6 represents the direct mapping of our 16-bit address to a cache line.

Figure A-6. Direct Mapped cache.

The problem we have here is that if the next datum we access happens to be at address 01001000 (notice that only bits 5 and 6 have changed), we have de-referenced to the same cache line. We will have to write the current cache line to memory before moving our new datum into cache, and that's a time-consuming and lengthy process. It transpires that, in our architecture, after every 32 bytes we hit this problem of decoding an address to the same cache line. A multiple of 32 bytes can be seen to hit upon our cache hot spot . A programmer would have to be aware of how the cache was organized, as well as the number and size of cache lines before designing his data model. As you can imagine, the overall performance of an application will be seriously affected if the data model happens to discover a cache hot spot .

A.3.1.2 FULLY ASSOCIATIVE MAPPING

This mapping scheme is at the other end of the spectrum of cache organization in comparison to a Direct Mapped cache. In this design, any memory reference can be stored in any cache line. A number of issues arise out of this design.

Do we need to use some of the address bits to encode a cache line number?

- The answer to this question is no. Because an address can decode to any line, a single cache line is of no relevance to a particular memory address.
The issue of Temporal locality needs to be addressed in that we would prefer not to remove a data element that is likely to be reused in the near future. We discuss replacement strategies after talking about Set Associative mapping. It is enough to say that it is desirable to record the frequency with which a cache line has been used.

A Fully Associative mapping will use more bits to identify whether a cache line contains our datum. In our example, we can disregard bits 2 through 4 encoding a cache line and, hence, our tag field now uses bits 2 through 7 as an identifier. Drawbacks of this mapping scheme include the fact that we need to ensure that a lookup of a cache line is very rapid. Every memory reference could require us to look up every cache line and compare bits 2 through 7 of the tag field with the corresponding bits in our memory reference. This makes Fully Associative mappings more expensive to employ. Figure A-7 shows an example mapping.

Figure A-7. Fully Associative Mapping.

A.3.1.3 SET ASSOCIATIVE MAPPING

On one hand, we have a simple, cheap but effective mapping scheme with a Direct Mapping. On the other hand, you have this sophisticated, expensive solution with Fully Associative Mapping. Set Associative Mapping sits in the middle: It tries to avoid the cache hot spot problem of Direct Mapping, but the cache logic is not as complicated and, hence, it is cheaper to employ than Fully Associative Mapping. The idea with Set Associative Mapping is that we will construct a Set of lines that can accommodate our memory reference. Any one of the lines in this Set can accommodate a given memory address. This avoids the cache hot spot problem of Direct Mapping. What we need to do is to change the logic of how the decoding of a memory address works. Instead of decoding to a single cache line, the memory address will decide a set of two, four, or possibly eight lines. In our simple example, we have eight cache lines. If you look at the binary representation of a cache line number, you might notice that we get a form of repetition in the binary notation, 010 and 110 for instance: The first two bits of the cache line are same, i.e., 10. We can use this as our encoding scheme. We will use part of the memory address to encode to " line 10 ". In our case, this will decode to two possible line numbers . We will use the remaining 4 bits as our Tag Field as before: to verify that the data in the cache line is actually our data. This is an example of a two-way Set Associative Mapping : two cache lines for the Set of cache lines in which my data can reside. Depending on the design of our cache, we can have an n-way Set Associative Mapping where each Set will contain n lines. When we come to place our datum in the cache line, both lines may already be occupied. The cache logic will have to make a decision as to which line to replace, and it may be prudent to use a Usage Field as we did in the Fully Associative Mapping in order to maximize the benefits of the principles of locality .

Figure A-8 shows an example of our two-way Set Associative Mapping.

Figure A-8. Two-way Set Associative Mapping.

The choice of cache mapping used is a design decision for the processor architect: speed but complexity with Fully Associative , simple but prone to cache hot spots with Direct Mapping , or the compromise choice of Set Associative . If either Fully Associative Mapping or Set Associative Mapping is used, the architect will have to take into account the idea of the Usage Field and how it should be interpreted. Let's discuss r eplacement strategies in the context of cache lines where we are using an Associative Mapping .

A.3.1.4 REPLACEMENT STRATEGIES

When we come across a situation where we need to replace a cache line with new data, we need some policy for choosing which cache line to select. It is possible to adopt a random policy, and statistically we would choose the best cache line a fair proportion of the time. Conversely, statistically we could choose the worst line a fair proportion of the time as well. Instead of leaving it to chance, we could employ some intelligence into the decision as to which cache line to choose to replace. When we consider our principles of locality, it would be advantageous to choose a cache line that is least likely to be needed in the near future. In this way, we are avoiding the laborious process of having to write back data to main memory only for it to be read back in again in the near future. The concept of a Usage Field is common in order to choose the most appropriate line. The Usage Field will track how often a cache line is accessed. There are essentially three choices of replacement strategies .

Last In First Out (LIFO)

We will not spend much time considering this replacement policy. LIFO means that the newest datum to be brought into cache is the first one to be replaced . When we consider Temporal locality, it is likely that we will probably need the newest datum in the near future. Immediately getting rid of it flies in the face of the notion of Temporal locality . We are encouraged in this assertion by the fact that no cache in production uses this replacement strategy.
First In First Out ( FIFO )

The idea here is that the oldest member of the cache is the first line to be replaced. The logic dictates that each datum should be allowed a fair amount of time in cache without a single datum dominating the cache entirely. The contradiction is that if a line is still in use, there is little reason to replace it. In fact, in many ways you could argue that if a cache line has been resident for a long time, then getting rid of it may mean that you are getting rid of the most heavily utilized datum. Both of these arguments are missing one crucial ingredient: They do not take into account when the cache line was last used.
Least Recently Used (LRU)

Least Recently Used attempts to overcome the problem of FIFO by taking into account when a cache line was last used. It requires the cache logic to record in some way aspects of the usage of the cache line. This in itself can be a complicated process. This complexity is mitigated by the fact that we are building significant intelligence into the replacement strategy. We will choose a line that has been least used with respect to some timeframe. The principle of Temporal locality reinforces this: If we are using a datum, we will probably need it again soon. The fact that we haven't needed that datum recently probably means that we will not need it in the near future either. Regardless of its complexity and inherent cost, LRU has been seen to be the most popular with processor architects.

There are a couple of other items I want to deal with before leaving the topic of cache.

Multiple levels of cache
When we write from cache to memory

A.3.1.5 MULTIPLE LEVELS OF CACHE

As we demonstrated earlier, the speed differential between a processor and memory is alarmingly high. We have tried to alleviate this problem by employing a cache. The dilemma we now have is how big to make that cache. In an ideal world, it would be big enough to satisfy a program's locality of reference . On a real computer system with hundreds of programs, the size of this locale can be vast . The result is that our L1 cache is likely to be relatively small due to speed requirements; it needs to be physically close to the processor. In order to further alleviate the problem of having to access main memory, we could employ a second level (L2) cache. This is often off-chip but relatively close to each processor module. Access times will be somewhere between the L1 cache and main memory. Being off-chip and slightly slower than the L1 cache means that the cost will be reduced.

A.3.1.6 WHEN WE WRITE FROM CACHE TO MEMORY

We need to mention one last topic in relation to cache behavior: the topic of writing from cache back to memory. When a cache line is modified, when does that change get stored back to main memory? If we were to immediately write that change to memory, we would incur the penalty of the time it takes to perform a transfer to memory. The positive aspects of this policy are that we know that our cache and main memory are in sync . This policy is known as write-through .

The alternative is a write-back policy . This policy will write the datum back to memory only when the cache line needs to be used by another datum. This has the advantage that if a datum is being modified frequently, we eliminate the costly additional writes to memory that a write-through policy would require. One cost to this design is that we need to monitor when a cache line has been modified. This can take the form of a dirty bit being set for a cache line, indicating that a change has been made ( subsequent changes don't need to update the dirty bit because once it is set to true, we know that at least one change has been made). Another cost of a write back policy is when an instruction needs to update a datum that is not currently in cache. The normal mode of operation with a write back policy is to first fetch the datum into cache and then modify it: The reason for this is due the principle of Temporal locality . With a write-through policy, Temporal locality still applies, but this policy dictates that every write to cache initiates a subsequent write to memory. In many cases, the fetch of the datum into cache is not carried out; we will simply schedule the update of the datum at the prescribed main memory address some time later.