4.5 Data Access Instructions
We now discuss data access using the Itanium load and store instructions,
For some architectures, the presence and nature of cache structures are left entirely to implementation. Cache is then not a concern of architecture, since there is no way to interact with cache structures through the instruction set. One can only observe the effect of the cache by comparing the execution times of a benchmark program on systems that have or that lack particular cache structures.
For other architectures,
4.5.1 Itanium Cache Structures
The Itanium architecture specifies that cache structures be explicitly visible to the assembly language programmer and the compiler writer. Table 4-3 gives a simplified view of memory hierarchy using the quantitative details for the Itanium 2 processor. The
for a cache specifies the granularity with which data move between a cache level and higher-numbered levels or main memory. For example, the L3 cache always reads 128 bytes from main memory or copies back 128 bytes to main memory. The line size does not preclude moving smaller
Table 4-3. Characteristics of Itanium 2 Memory Hierarchy
Cache structures improve access to data at the expense of greater cost of implementation. For the Itanium 2 processor, the L1, L2, and L3 cache structures are all on the main processor chip. Chapter 13 details the differences in cache structure for the original Itanium processor. (Refer to hardware organization books for
Figure 4-4 shows the relationships among the
Figure 4-4. Itanium 2 system cache relationships
The history of computers has given us two
Most contemporary systems are designed using von Neumann's principles, but may have Harvard-style structures at the innermost (L1) cache. With RISC and EPIC designs having many registers, sequences of instructions should be able to work with data already retrieved by the CPU. A separate read-only cache for instructions (L1-I) can have a different design from a read/write cache for data (L1-D), and their separate connections to the control and data paths of the CPU can be kept short.
Table 4-3 shows timing differences because the Itanium 2 processor connects floating-point data only to the L2 cache structure. Hence only integer and logical operations, which are more
Compared to caches in other contemporary designs, the innermost (L1, L2) Itanium cache structures are not especially large. The Itanium ISA provides ways to optimize utilization of the cache structures by
The integer store and load instructions, discussed
4.5.2 Integer Store Instructions
There are two kinds of integer store instructions for Itanium architecture, the normal form and the spill form, having the following assembler syntax:
st sz.sttype.sthint [ r3 ]= r2 // mem[ r3 ] <- r2 st sz.sttype.sthint [ r3 ]= r2 , imm9 // mem[ r3 ] <- r2 // r3 <- r3 + sext( imm9 ) st8.spill. sthint [ r3 ]= r2 // spill data and NaT bit st8.spill. sthint [ r3 ]= r2 , imm9 // spill data and NaT bit // r3 <- r3 + sext( imm9 )
is the size of the information unit in memory into which the
Store operations can be susceptible to hardware-
There are two values for sttype (the store type completer): none at all and rel . None at all corresponds to an ordinary store operation, and rel corresponds to an ordered store performed with release semantics that we shall not discuss further.
There are two values for
(the store hint completer): none at all and
. None at all corresponds to an ordinary store operation; the processor hardware then assumes that the program
The Itanium store instructions provide for postmodification of the value in the pointer register
by a signed adjustment
The spill form always copies 8 bytes, along with a validity bit associated with register r2 . The spill form is used to save register contents when an operating system switches context from one process to another or an application uses a preserved register.
4.5.3 Integer Load Instructions
There are two kinds of integer load instructions for Itanium architecture, the normal form and the fill form, having the following assembler syntax:
ld sz.ldtype.ldhintr1 =[ r3 ] // r1 <- mem[ r3 ] ld sz.ldtype.ldhintr1 =[ r3 ], r2 // r1 <- mem[ r3 ] // r3 <- r3 + r2 ld sz.ldtype.ldhintr1 =[ r3 ], imm9 // r1 <- mem[ r3 ] // r3 <- r3 + sext( imm9 ) ld8.fill. ldhintr1 =[ r3 ] // fill data and NaT bit ld8.fill. ldhintr1 =[ r3 ], r2 // fill data and NaT bit // r3 <- r3 + r2 ld8.fill. ldhintr1 =[ r3 ], imm9 // fill data and NaT bit // r3 <- r3 + sext( imm9 )
where sz is the size of the information unit in memory at the location specified by register r3 from which 1, 2, 4, or 8 bytes are to be copied into the lowest-order 1, 2, 4, or 8 bytes of register r1 by the normal form. The upper-order 7, 6, 4, or 0 bytes of register r1 are zeroed—put another way, the loaded data are zero-extended to a full 64 bits in width and then placed in register r1 . The simplest load instruction uses register indirect addressing for the source operand and register direct addressing for the destination operand.
Load operations can be susceptible to
There are several values for
(the load type completer). None at all corresponds to an ordinary load operation. Other values
There are three values for
(the load hint completer): none at all,
. None at all corresponds to an ordinary load operation; the processor hardware is given the hint that the program associates temporal locality in the L1 cache with the value loaded. At the other extreme,
provides the hint that the program considers the value loaded to have nontemporal locality at all levels of cache and memory, while
provides the hint that the program considers the value loaded to have nontemporal locality in just L1 cache. The use of
may thus avoid knocking out of the caches some other data that may be reused. The use of
may avoid knocking out of L1 cache some other data that may be reused, but suggests defensively that the data loaded may need to be
Like the Itanium store instructions, the load instructions provide for postmodification of the pointer value in register r3 by a signed amount, which can range from -256 to +255 address units (immediate constant) or can be a full 64-bit signed amount (in register r2 ).
The fill form always copies 8 bytes along with a validity bit associated with register r1 . The fill form is used to restore register contents when an operating system switches context from one process to another or an application uses a preserved register.
4.5.4 Move Long Immediate Instruction
Address pointers are essential for instructions that perform store and load operations. In RISC and EPIC architectures, the instructions that use such pointers expect a pointer value to be already loaded in a register. We have seen how the mov pseudo-op can place an integer into a register, but the width of that integer is limited to the 22 bits of immediate data that can be encoded into an underlying addl instruction.
The SQUARES and HEXNUM programs have already shown one way around that limitation. In those programs, we used capabilities of the assembler and linker to develop addl instructions containing a signed 22-bit offset from the value in a global pointer register. That method works so long as the total amount of data that we need to reach does not exceed the relative addressing capability of 22 bits (4 MiB). Otherwise, another pointer register may be needed to span the address range for the data.
What could be a more general approach? One answer would be an instruction that would load a full-width address value as immediate data from the instruction stream into a general register.
Itanium architecture provides a special instruction called movl that can accommodate a full 64-bit immediate value:
movl r1 = imm64 // r1 <- imm64 movl r1 = label // r1 <- 64-bit address for label
where the 64-bit immediate value, or the full 64-bit address finally determined by the linker for a label , is copied into the general register r1 . Unlike the addl instruction that restricted source register r3 to be Gr -Gr 3 , the movl instruction can put the value into any general register, Gr 1 -Gr 127 .
This instruction provides a complete solution to the challenge of full-range addressing. The 64-bit immediate value fits into unused space in one instruction slot and the next entire instruction slot; thus, movl occupies two slots in a 128-bit instruction bundle and must be decoded for execution as a special case.
Since Itanium assemblers permit the 64-bit immediate value to be specified symbolically, we now have another method for setting up pointers for store and load operations. For instance, in HEXNUM, we could replace the lines:
addl r14 = @gprel(H1),gp;; // Point to storage ld8 r21 = [r14];; // for 1st hex digit
with the lines:
movl r14 = H1;; // Point to storage ld8 r21 = [r14];; // for 1st hex digit
but not with the lines:
mov r14 = H1;; ld8 r21 = [r14];;
which would not assemble and link successfully because the mov pseudo-op would only generate an adds or addl instruction (see Section 4.2.4), and the true address determined by the linker for H1 would be wider than 14 or 22 bits.
4.5.5 Accessing Simple Record Structures
Application data can be
The same fields of information would occur for each of the 40 kinds of widgets, and it makes sense to store those fields in the same order for each kind of widget. This can be done by defining address offsets as symbolic displacements:
MODEL = 0 COST = 8 PRICE = 16 FREIGHT = 24 STOCK = 32 RECSIZE = 40
In a program, one register can be stepped along by the total record size (here, 40 bytes), and the offsets will then specify each field. For example, the total cost of manufacture of all the widgets in the inventory can be computed schematically as
movl r14 = TABLE-RECSIZE;; // Pre-index for data mov r20 = 0;; // Value accumulator loop: adds r14 = RECSIZE,r14;; // Point to next record adds r15 = COST,r14;; // Set pointer ld8 r21 = [r15];; // to get cost adds r15 = STOCK,r14;; // Set pointer ld8 r22 = [r15];; // to get quantity <now need to multiply: r23 = r21 * r22> add r20 = r23,r20;; // Running value <goto loop if not at end of table> done: <print report using r20 as value of inventory>
where TABLE is an address label for the start of the data. The address pointer ( r14 ) is incremented at the top of the loop, even for the first traversal, and thus must be preindexed to the address value given by the symbolic expression TABLE - RECSIZE .
The addressing technique here is limited to a RECSIZE that does not exceed the capacity of an immediate value for the adds instruction. We have not attempted to provide details for the actual multiplication, loop control, or printing.
The capability of the assembler to handle symbolic offsets and address subtraction has allowed us to write a readable and
4.5.6 Access to Specialized CPU Registers
In keeping with EPIC principles, the Itanium architecture provides the programmer with access to many internal resources of the processor, since more compile-time as well as runtime decisions are necessary to exploit the full power of EPIC features than has been necessary for CISC or RISC architectures.
In addition to large
mov r1 = reg // r1 <- contents of reg mov reg = r2 // reg <- contents of r2 mov reg = imm8 // reg <- sext( imm8 ) // (immediate form only for ar.xx)
These instructions must execute in either M- or I-units, depending on the particular specialized resource, as tabulated in Appendix C. The instruction pointer can be read, but not modified, using a mov instruction. The form with an 8-bit immediate value as the source is only available when the destination is one of the application registers, Ar -Ar 127 .
We show examples of these mov instructions when needed in our sample programs; the next chapter introduces the application register called ar.lc , the loop count register.