The CPUMemory Interface | Embedded Systems Firmware Demystified (With CD-ROM)

The CPU/Memory Interface

The most critical interface in any computing system is the connection between memory and the CPU. If this interface doesnt function properly, the CPU cannot function because it cannot retrieve instructions. If the processor cant reliably retrieve instructions, it really doesnt matter if anything else on the board works you wont be using it anyway.

Understanding the CPU/memory interface is important to more than just the data and instruction stream. In most systems, peripherals share the data and address buses with memory. Thus, understanding the protocol for these buses is important to understanding much of the hardware.

This section explains, in general terms, how the CPU uses the address and data buses to communicate with other parts of the system. To make the discussion more concrete, I describe the operation in terms of the hypothetical machine detailed in the simplified schematics in Figure 1.5 and Figure 1.6. Dont worry, you wont need an electrical engineering degree to interpret these drawings. Ive omitted everything except the relevant connections. The result is only slightly more detailed than the typical functional block diagram, but it is also representative of the portion of a real schematic that you would need to understand to work with most embedded processors. If you can identify the control, data, and address lines in your system, you probably know all you need to know about how to read a schematic.

Figure 1.5: Schematic CPU.

In this schematic, the signals have been grouped to show how they relate to the various system buses. Notice that nearly all of the CPU pins are dedicated to creating these buses.

Figure 1.6: Schematic Flash and RAM.

The memory devices connect directly to the system buses. Because each device is only 32K, each uses only 15 address lines. Notice how each device is activated by a separate chip select.

The CPU

In the CPU diagram (Figure 1.5), there is only one big part: the CPU. There are also two other blocks of components on this page: the clock and the reset circuit. The clock provides the CPU with the ability to step through processing states, which can vary from one cycle per instruction (RISC) to sometimes over a dozen cycles per instruction (CISC). The clock can be a crystal, or it can be a complete clock circuit, depending on the needs of the CPU. The reset/WDT circuit provides the processor with a logic level on the reset pin that forces the CPU back to its starting point. This particular circuit uses separate logic to assure that the processors reset pin is held low for the required amount of time after a power-up or when the Switch_1 switch is pressed. Notice that a PIO line out of the CPU feeds into this circuit to provide the WDT with a periodic sanity pulse.

The CPU in this design uses 16-bit addresses but transfers data eight bits at a time. Thus, it has a 16-bit address bus and an 8-bit data bus. Using these 16 bits, the processor can address a 64K memory space. In this design, half of that space is occupied by 32K of flash memory and the other half by 32K of SRAM (Figure 1.6).

Each CPU pin belongs to one of four groups: address, data, control, or I/O.

In this simple design, the majority of the CPU pins are committed to creating address and data buses. Since each memory component houses only 32K of address space the memory chips have only 15 address lines. In this design, the low-order 15 address bits are directly connected to these 15 lines on the memory components. Two additional CPU control signals ChipSelect_0 and ChipSelect_1 are used to activate the appropriate memory device. The most significant bit Addr_15 is unused. (If the CPU did not provide conveniently decoded chip select lines, we could have used the high-order bit of the address bus and some additional logic (called address decode logic ) to activate the appropriate memory device.

Whenever the CPU wants to read or write a particular byte of memory, it places the address of that byte on the address lines. If the address is 0x0000 , the CPU would drive all address lines to a low voltage, logic 0 state. If the address were 0xFFFE , the CPU would drive all except the least significant address line to a high voltage, logic 1 state.

When a device is not selected, it is in a high-impedance state (electrically disconnected) mode. Two more control lines (read and write) on the CPU control how a selected device connects to the data bus. If the read line is active, then the selected memory chips output circuits are enabled, allowing it to drive a data value onto the data bus. If the write line is active, the CPUs output circuits are enabled, and the selected memory chip connects only its input circuits to the data bus.

Collectively the read, write, and chip select pins are called the control/status lines.

Hexadecimal and Bus Signals

Its common to refer to address and data bus values in ASCII-coded hex, for example Put 0xBE at 0x26A4. In this usage put 0xBE means place 0xBE on the data bus, while at 0x26A4 means place the address 0x26A4 on the address bus. To find out what happens to individual address or data lines, you need to expand the hex to its binary equivalent. Each hex digit represents four bits as in the following chart:

Hexadecimal	Binary	Hexadecimal	Binary
0x0	0000	0x1	0001
0x2	0010	0x3	0011
0x4	0100	0x5	0101
0x6	0110	0x7	0111
0x8	1000	0x9	1001
0xA	1010	0xB	1011
0xC	1100	0xD	1101
0xE	1110	0xF	1111

Thus, 0xBE represents the eight bits 1011 1110 .

Note that the number of hexadecimal digits implies the size of the bus. 0xBE con tains only two hexadecimal digits, implying that the data bus is only eight bits wide. The four hexadecimal digits in 0x26A4 , on the other hand, suggest that the address bus is 16 bits wide. Because of this implicit relationship between the hex representation and the bus size, it is accepted convention to pad addresses and data values with zeros so that all bits in the bus are specified. For example, when referencing address 1 in a machine with 32-bit addresses, one would write 0x00000001 , not 0x1 .

Figure 1.7 summarizes the connections between the CPU and the flash device. If you compare this diagram to the schematic, you can see that the only additional connections on the flash device are for power and ground.

Figure 1.7: Connection Between CPU and Boot Flash Device.

The CPU uses the read and write signals to control the output drivers on the various memory and peripheral devices, and thus, controls the direction of the data bus.

The CPU-to-flash device interaction can be summarized with the following steps:

CPU places the desired address on the address pins.
CPU brings the read line active.
CPU brings the appropriate chip select line active.
The flash device places the data located at the specified address onto the data bus.
CPU reads in the data that has been placed on the bus by the flash device.
CPU releases the chip select line and processes the data.

This sequence of steps allows the CPU to retrieve the bytes from memory that ultimately become instructions. Its commonly referred to as instruction fetching . If you understand this interface, its easy to connect to other devices, because they are all fundamentally the same. The SRAM interface is identical except that a different chip select line is activated. The different chip select lines are configured so that each line is active for a 32K address space. ( ChipSelect_0 is used for the 032K range, and ChipSelect_1 is used for the 32K64K range.) A write access is essentially the same thing except that now the Write line is used and the data flow is from CPU to memory, not memory to CPU.

So thats essentially it for the address, data, and control. You have a certain number of address bits (dependent on the actual size of the device), 8 data bits (16 or 32 if you were using a different device), and a few control lines (read, write, and chip select). All that remains in the schematic is the serial port. Since most modern processors have a UART built in, theres only a driver to attach, and that does not involve any CPU interaction. In other words, you now understand the fundamentals of a simple microprocessor-based hardware design!

The Power (and Pitfalls) of Cache

For standard programming environments, cache is a blessing. It provides a real speed increase for code that is written to use it properly (refer to Figure 1.8). Caching takes advantage of a phenomenon known as locality of reference . ^[1] Locality of reference states that at any given point in a programs execution, the CPU is accessing some small block of memory repeatedly. The ability to pull that small block of memory into a faster memory area is a very effective way to speed up what would otherwise be a relatively slow rate of memory access.

Figure 1.8: Cache Between CPU and External Devices.

There are different levels of cache, the fastest (usually called level 1) cache used by the CPU is located on the CPU chip.

Cache is a fast chunk of memory placed between the memory access portion of the CPU and the real memory system in order to provide an enhancement to the access time between the CPU and external memory. There are several different types of cache implementations. Discussion of these various implementations is beyond the scope of this text. Instead, this section focuses on the two major types of caches used in todays embedded systems: instruction cache (I-cache) and data cache (D-cache).

As the names imply, the two different caches are used for the two different types of CPU memory access: accesses to instruction space and accesses to data space, respectively. Caches are divided into these two major types because of the difference in the way the CPU accesses these two areas of memory. The CPU reads and writes to data space but tends to read from instruction space much more often than it writes to it.

The only limitation that cache puts on typical high-level system programmers is that it can be dangerous to modify instruction space, sometimes called self-modifying code . This is because any modification to memory space is done through a data access.

The D-cache is involved in the transaction; hence, the instruction destined for physical memory can reside in the D-cache for some undetermined amount of time after the code that actually made the access has completed. This behavior presents a double chance of error: the data written to the instruction space might not be in physical memory (because it is still in the D-cache), or the contents of the instructions address might already be in the I-cache, which means the fetch for the instruction does not actually access the physical memory that contains the new instruction.

Figure 1.9 shows how the cache gets between the data write and the instruction read. Step A shows the CPU writing to memory through the D-cache. Step B shows the transfer of the contents of that D-cache location to physical memory. Step C represents the transfer of the physical memory to the I-cache, and step D shows the CPUs memory access unit retrieving the instruction from the I-cache. If the sequence of events was guaranteed to be A-B-C-D, then everything would work fine. However, this sequence cannot be guaranteed , because that would eliminate the efficiency gained by using the cache. The whole point behind cache is to attempt to eliminate the B and C steps whenever possible. The ultimate result is that the instruction fetch may be corrupted as a result of skipping step B , step C, or both.

Figure 1.9: Data-Cache Instruction-Cache Inconsistency.

Cache increases performance by allowing the CPU to fetch frequently used values from fast, internal cache instead of from slower external memory. However, because the cache control mechanism makes different assumptions about how data and instruction spaces are manipulated, self-modifying code can create problems. Cache can also create problems if it masks changes in peripheral registers.

For embedded systems, the problem just gets worse . Understanding the above problem makes the secondary problems fairly clear. Notice in Figure 1.8 that there is a flash device, DRAM, and a UART. Two additional complexities become apparent:

The UART is on the same address/data bus as the memory, which means that accesses to a UART or any other physical device outside the CPU must deal with the fact that cache can get in the way. Hardware must be designed with this consideration in mind (or the firmware must configure the hardware) so that certain devices external to the CPU can easily be accessed without the data cache being involved.
The UART may be configured to use DMA to place incoming characters into memory space. In many systems, DMA and cache are independent of each other. The data cache is likely to be unaware of memory changes due to DMA transfers, which means that if the data cache sits between the CPU and this memory space, more inconsistencies must be dealt with.

The complexity of the hardware and the need for speed make these issues tricky but not insurmountable. Most of the prior problems are solved through good hardware and firmware design. The initial issue of I-cache and D-cache inconsistency can be resolved by invoking a flush of the data cache and an invalidation of the instruction cache. A flush forces the content of the data cache out to the real memory. An invalidation empties the instruction cache so that future CPU requests retrieve a fresh copy of corresponding memory.

Also, there are some characteristics of cache that can help resolve these problems. For example, a write through data cache ensures that data written to memory is placed in cache but also written out to physical memory. This guarantees that data writes will be loaded to cache and will pass through the cache to real memory; hence it really only provides a speed improvement for data reads. Also, a facility in some CPUs called bus snooping helps with the memory inconsistency issues related to DMA and cache. Bus snooping hardware detects when DMA is used to access memory space that is cached and automatically invalidates the corresponding cache locations. Bus snooping hardware isnt available on all systems however. Additionally, to avoid the problem with cache being between the CPU and the UART, devices can usually be mapped to a memory location that doesnt involve the cache at all. It is very common for a CPU to restrict caching to certain specific cachable regions of memory, rather than designating its entire memory space cacheable. The bottom line is that the firmware developer must be aware of these hardware and firmware capabilities and limitations in order to deal with these complexities efficiently .

^[1] M. Morris Mano, Computer System Architecture , Second Ed. (Englewood Cliffs, NJ: Prentice Hall, 1982), pg 501.