12.2 Integer Parallel Operations

When we first introduced floating-point instructions, we correlated them with their integer counterparts. Similarly, we now introduce the Itanium integer parallel operations by analogy to operations on full-width data.

Table 12-2 lists the Itanium instructions that perform operations on multiple bytes, words, or double words packed in the 64-bit integer registers. Because of the heritage from other architectures, these instructions for integer parallel operations are also commonly called multimedia instructions. Analogies are drawn to instructions that operate with 64-bit data where appropriate.

The final character in the opcode for any of these instructions (1, 2, or 4) signifies that it works with 8 bytes, 4 words, or 2 double words, respectively.

The analogies in Table 12-2 are just that, since some significant differences distinguish these parallel instructions from their nearest counterparts for full-width data. For instance, the parallel compare instructions write their results as bytes, words, or double words of all 1 or 0 bits into an integer register rather than just 1 or 0 into a pair of single-bit predicate registers.

Table 12-2. Itanium Integer Parallel Instructions

Instruction Name

Opcode(s)

Analogy

Compute Zero Index

czx1, czx2

 

Mix

mix1, mix2, mix4

 

Mux

mux1, mux2

 

Pack

pack2, pack4

 

Parallel Add

padd1, padd2, padd4

add

Parallel Average

pavg1, pavg2

 

Parallel Average Subtract

pavgsub1, pavgsub2

 

Parallel Compare

pcmp1, pcmp2, pcmp4

cmp, cmp4

Parallel Maximum

pmax1, pmax2

 

Parallel Minimum

pmin1, pmin2

 

Parallel Multiply

pmpy2

xmpy

Parallel Multiply and Shift Right

pmpyshr2

 

Parallel Shift Left

pshl2, pshl4

shl

Parallel Shift Left and Add

pshladd2

shladd

Parallel Shift Right

pshr2, pshr4

shr, shr.u

Parallel Shift Right and Add

pshradd2

 

Parallel Subtract

psub1, psub2, psub4

sub

Parallel Sum of Absolute Differences

psad1

 

Population Count

popcnt

 

Unpack

unpack1, unpack2, unpack4

 

These instructions gain remarkable versatility because of the many permutations that can be specified through completers or counts in a register or as immediate data. The compute zero index (czx), mix, and parallel multiply (pmpy2) instructions use l and r completers to indicate left-right directionality of action, and the unpack instructions use h and l completers to indicate high-low positioning. The parallel average (pavg) instruction can round in the usual way or can round away from zero. The parallel shift right (pshr) and the parallel shift right and add (pshradd2) instructions can operate in either a sign-preserving (arithmetic) or an unsigned (logical) way.

These Itanium instructions for parallel integer operations and their counterparts in other computer architectures find uses in fast algorithms for multimedia applications, such as compression and decompression.

In the absence of a completer, the parallel add (padd) and subtract (psub) instructions perform a modulo wrap in two's complement representation, like ordinary arithmetic instructions, possibly resulting in integer overflow (Section 4.2.2). That outcome would adversely affect colors or sounds being represented numerically in multimedia applications. Hence the parallel add, parallel subtract, and pack instructions can use completers to specify another mode of either signed or unsigned clipping saturation. With signed saturation, for instance, parallel addition will saturate each individual result at the largest positive or negative value that can be represented in 1, 2, or 4 bytes rather than produce arithmetic overflow.

Were we to discuss these versatile instructions in fullest detail, we would stray too far from the intended scope for this book. Therefore, we refer you to the Itanium "Instruction Set Reference" for complete descriptions and to the Intel Itanium 2 Processor Reference Manual for Software Development and Optimization for further information.

The integer parallel instructions have latencies greater than 1 and subtle interdependencies with other instructions, both parallel and nonparallel. Since many of them require execution unit I0 specifically on early Itanium implementations, they may be resource-limited in the processor. Consequently, logically equivalent sequences of nonparallel instructions may sometimes exhibit similar or even better throughput.



ItaniumR Architecture for Programmers. Understanding 64-Bit Processors and EPIC Principles
ItaniumR Architecture for Programmers. Understanding 64-Bit Processors and EPIC Principles
ISBN: N/A
EAN: N/A
Year: 2003
Pages: 223

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net