4.6 Other ALU Instructions | ItaniumR Architecture for Programmers. Understanding 64-Bit Processors and EPIC Principles

The principal arithmetic and logical instructions of the Itanium ISA operate on full 64-bit quantities. We have seen, however, that load and store instructions can access smaller information units.

When data are stored, only the lowest-order 1, 2, 4, or 8 bytes in the source register are copied to the destination in memory; the highest 7, 6, 4, or 0 bytes are discarded. For example, the number 2 would be truncated from 0xfffffffffffffffe (64-bit two's complement) to 0xfe (8-bit two's complement) by an st1 instruction. If overflow has already occurred with respect to the destination size, the result stored into memory will be misleading. For example, the number 257₁₀ is truncated from 0xfffffffffffffeff to 0xff (which is +255₁₀) by an st1 instruction.

When data are loaded, only the lowest-order 1, 2, 4, or 8 bytes in the destination register are filled with data copied from the information unit of that same size in memory. Itanium integer load instructions put zeros into the remaining higher-order bytes in the destination register (Section 4.5.3).

Some other architectures provide load instructions in pairs, one form that zero-extends and loads the data, and another that sign-extends and loads the data. Since sign-extension can prolong the execution time of load instructions, the Itanium architecture has separate instructions to sign-extend and zero-extend data in a register to a full 64 bits. This has the further advantage of making those operations available for other uses.

4.6.1 Sign-Extend Instruction

Suppose a load instruction has put a zero-extended quantity of byte, word, or double word size into a register. Further suppose that the quantity should be construed as having been a signed value as a byte, word, or double word in memory.

We need some method for sign-extension so that the full 64-bit pattern in the destination register will behave as a proper 64-bit two's complement signed quantity. The Itanium architecture requires a follow-up instruction to perform sign-extension when that outcome is wanted:

 sxtxsz   r1=r3         // r1 <- sext(r3)

where xsz is 1, 2, or 4 to select the bit position (7, 15, or 31) in source register r3 from which the sign bit is to be propagated all the way to bit <63> in the destination register r1. The contents of register r3 remain unchanged, although it is permissible to specify the same general register Gr_n for both r1 and r3.

As an example, sign-extension of a byte that had been loaded from memory into a register as 0x000000000000008a will become 0xffffffffffffff8a after application of an sxt1 instruction, because the sign bit of the byte is 1 (0x8 expands to binary 1000).

4.6.2 Zero-Extend Instruction

Suppose some instruction sequence has placed a value into a register, but will then use only the lowest-order quantity of byte, word, or double word size from that register. To accomplish this, we can use masking to force to zero some portion of the full bit pattern.

Most architectures use a logical AND to accomplish masking with complete versatility, as discussed in the next chapter. The Itanium architecture also provides a special instruction that performs zero-extension:

 zxtxsz   r1=r3         // r1 <- zext(r3)

where xsz is 1, 2, or 4 to select the range of bits (<63:8>, <63:16>, or <63:32>) that will be set to zero in the destination register r1; bits <7:0>, <15:0>, or <31:0> will be copied from the source register r3. The contents of register r3 remain unchanged, although it is permissible to specify the same general register Gr_n for both r1 and r3. Zero-extension of the value 0x1234567890fedc8a will become 0x000000000000008a after execution of the zxt1 instruction.

The combination of a full-width load instruction followed by zero-extension could be preferable to using a byte load instruction if byte access from memory is more time-consuming. This typifies the choices that compilers can make in order to produce a program that will run better on a particular system, where each ISA presents more than one valid instruction sequence for a task. Either sequence will produce correct results on any implementation, but the relative performance merits may vary from one implementation to another.

4.6.3 Instructions for Quantities Less Than 64 Bits in Width

While we primarily illustrate the integer instructions for full 64-bit operands, the Itanium ISA has instructions that perform arithmetic operations on other data widths:

addp4 and shladdp4 are designed to yield full 64-bit addresses from 32-bit pointers and are used to migrate 32-bit code; and
Multimedia instructions, such as pmpy2 (Section 4.2.5), are designed to exploit the 64-bit datapath by working in parallel on integer values of byte, word, and double word width.

When full numeric precision and large address spaces are not needed, data cache effectiveness (Section 4.5.1) can actually improve if 32-bit address pointers and data widths less than 64 bits are used.

As the focus of this book, and the industry, is on 64-bit architectures, programs in this book demonstrate the use of 64-bit values and addresses.