11.3 Relative Performance of Memory Subsystems


11.3 Relative Performance of Memory Subsystems

If you take another look at Figure 11-1, you'll notice that the speed of the various memory hierarchy levels increases as you go up. Exactly how much faster is each successive level in the memory hierarchy? To summarize the answer here, the speed gradient is not uniform. The speed difference between any two contiguous levels ranges from 'almost no difference' to 'four orders of magnitude.'

Registers are, unquestionably, the best place to store data you need to access quickly. Accessing a register never requires any extra time, and most machine instructions that access data can access register data. Furthermore, instructions that access memory often require extra bytes (displacement bytes) as part of the instruction encoding. This makes instructions longer and, often, slower.

If you read Intel's instruction timing tables for the 80x86, you'll see that they claim that an instruction like mov(someVar,ecx); is supposed to run as fast as an instruction of the form mov(ebx,ecx); . However, if you read the fine print, you'll find that they make this claim based on several assumptions about the former instruction. First, they assume that someVar 's value is present in the L1 cache memory. If it is not, then the cache controller needs to look in the L2 cache, in main memory, or worse , on disk in the virtual memory subsystem. All of a sudden, an instruction that should execute in one nanosecond on a 1-GHz processor (that is, in one clock cycle) requires several milliseconds to execute. That's a difference of over six orders of magnitude. It is true that future accesses of this variable will take place in just one clock cycle because it will subsequently be stored in the L1 cache. But even if you access someVar 's value one million times while it is still in the cache, the average time of each access will still be about two cycles because of the large amount of time needed to access someVar the very first time.

Granted, the likelihood that some variable will be located on disk in the virtual memory subsystem is quite low. However, there is still a difference in performance of three orders of magnitude between the L1 cache subsystem and the main memory subsystem. Therefore, if the program has to bring the data from main memory, 999 memory accesses later you're still paying an average cost of two clock cycles to access the data that Intel's documentation claims should happen in one cycle.

The difference in speed between the L1 and L2 cache systems is not so dramatic unless the secondary cache is not packaged together on the CPU. On a 1-GHz processor the L1 cache must respond within one nanosecond if the cache operates with zero wait states (some processors actually introduce wait states in L1 cache accesses, but CPU designers try to avoid this).

Accessing data in the L2 cache is always slower than in the L1 cache, and there is always the equivalent of at least one wait state, and up to eight, when accessing data in the L2 cache.

There are several reasons why L2 cache accesses are slower than L1 accesses. First, it takes the CPU time to determine that the data it is seeking is not in the L1 cache. By the time it determines that the data is not present in the L1 cache, the memory-access cycle is nearly complete, and there is no time to access the data in the L2 cache. Secondly, the circuitry of the L2 cache may be slower than the circuitry of the L1 cache in order to make the L2 cache less expensive. Third, L2 caches are usually 16 to 64 times larger than L1 caches, and larger memory subsystems tend to be slower than smaller ones. All this adds up to additional wait states when accessing data in the L2 cache. As noted earlier, the L2 cache can be as much as one order of magnitude slower than the L1 cache.

The L1 and L2 caches also differ in the amount of data the system fetches when there is a cache miss (see Chapter 6). When the CPU fetches data from or writes data to the L1 cache, it generally fetches or writes only the data requested . If you execute a mov(al,memory); instruction, the CPU writes only a single byte to the cache. Likewise, if you execute the mov(mem32,eax); instruction, the CPU reads exactly 32 bits from the L1 cache. However, access to memory subsystems below the L1 cache does not work in small chunks like this. Usually, memory subsystems move blocks of data, or cache lines , whenever accessing lower levels of the memory hierarchy. For example, if you execute the mov(mem32,eax); instruction, and mem32 's value is not in the L1 cache, the cache controller doesn't simply read mem32 's 32 bits from the L2 cache, assuming that it's present there. Instead, the cache controller will actually read a whole block of bytes (generally 16, 32, or 64 bytes, depending on the particular processor) from the L2 cache. The hope is that the program exhibits spatial locality and therefore that reading a block of bytes will speed up future accesses to adjacent objects in memory. The bad news, however, is that the mov(mem32,eax); instruction doesn't complete until the L1 cache reads the entire cache line from the L2 cache. This excess time is known as latency . If the program does not access memory objects adjacent to mem32 in the future, this latency is lost time.

A similar performance gulf separates the L2 cache and main memory. Main memory is typically one order of magnitude slower than the L2 cache. To speed up access to adjacent memory objects, the L2 cache reads data from main memory in blocks (cache lines) to speed up access to adjacent memory elements.

Standard DRAM is three to four orders of magnitude faster than disk storage. To overcome this difference, there is usually a difference of two to three orders of magnitude in size between the L2 cache and the main memory so that the difference in speed between disk and main memory matches the difference in speed between the main memory and the L2 cache. (Having balanced performance characteristics is an attribute to strive for in the memory hierarchy in order to effectively use the different memory levels.)

We will not consider the performance of the other memory-hierarchy subsystems in this chapter, as they are more or less under programmer control. Because their access is not automatic, very little can be said about how frequently a program will access them. However, in Chapter 12 we'll take another look at issues regarding these storage devices.




Write Great Code. Understanding the Machine, Vol. 1
The Art of Assembly Language
ISBN: 1593270038
EAN: 2147483647
Year: 2003
Pages: 144
Authors: Randall Hyde

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net