12.2 Integer Parallel Operations

When we first introduced floating-point instructions, we correlated them with their integer counterparts. Similarly, we now introduce the Itanium integer parallel operations by analogy to operations on full-width data.

Table 12-2 lists the Itanium instructions that perform operations on multiple bytes, words, or double words packed in the 64-bit integer registers. Because of the heritage from other architectures, these instructions for integer parallel operations are also commonly called multimedia instructions. Analogies are drawn to instructions that operate with 64-bit data where appropriate.

The final character in the opcode for any of these instructions (1, 2, or 4) signifies that it works with 8 bytes, 4 words, or 2 double words, respectively.

The analogies in Table 12-2 are just that, since some significant differences distinguish these parallel instructions from their nearest counterparts for full-width data. For instance, the parallel compare instructions write their results as bytes, words, or double words of all 1 or 0 bits into an integer register rather than just 1 or 0 into a pair of single-bit predicate registers.

Table 12-2. Itanium Integer Parallel Instructions
Instruction Name	Opcode(s)	Analogy
Compute Zero Index	`czx1, czx2`
Mix	`mix1, mix2, mix4`
Mux	`mux1, mux2`
Pack	`pack2, pack4`
Parallel Add	`padd1, padd2, padd4`	`add`
Parallel Average	`pavg1, pavg2`
Parallel Average Subtract	`pavgsub1, pavgsub2`
Parallel Compare	`pcmp1, pcmp2, pcmp4`	`cmp, cmp4`
Parallel Maximum	`pmax1, pmax2`
Parallel Minimum	`pmin1, pmin2`
Parallel Multiply	`pmpy2`	`xmpy`
Parallel Multiply and Shift Right	`pmpyshr2`
Parallel Shift Left	`pshl2, pshl4`	`shl`
Parallel Shift Left and Add	`pshladd2`	`shladd`
Parallel Shift Right	`pshr2, pshr4`	`shr, shr.u`
Parallel Shift Right and Add	`pshradd2`
Parallel Subtract	`psub1, psub2, psub4`	`sub`
Parallel Sum of Absolute Differences	`psad1`
Population Count	`popcnt`
Unpack	`unpack1, unpack2, unpack4`

These instructions gain remarkable versatility because of the many permutations that can be specified through completers or counts in a register or as immediate data. The compute zero index (czx), mix, and parallel multiply (pmpy2) instructions use l and r completers to indicate left-right directionality of action, and the unpack instructions use h and l completers to indicate high-low positioning. The parallel average (pavg) instruction can round in the usual way or can round away from zero. The parallel shift right (pshr) and the parallel shift right and add (pshradd2) instructions can operate in either a sign-preserving (arithmetic) or an unsigned (logical) way.

These Itanium instructions for parallel integer operations and their counterparts in other computer architectures find uses in fast algorithms for multimedia applications, such as compression and decompression.

In the absence of a completer, the parallel add (padd) and subtract (psub) instructions perform a modulo wrap in two's complement representation, like ordinary arithmetic instructions, possibly resulting in integer overflow (Section 4.2.2). That outcome would adversely affect colors or sounds being represented numerically in multimedia applications. Hence the parallel add, parallel subtract, and pack instructions can use completers to specify another mode of either signed or unsigned clipping saturation. With signed saturation, for instance, parallel addition will saturate each individual result at the largest positive or negative value that can be represented in 1, 2, or 4 bytes rather than produce arithmetic overflow.

Were we to discuss these versatile instructions in fullest detail, we would stray too far from the intended scope for this book. Therefore, we refer you to the Itanium "Instruction Set Reference" for complete descriptions and to the Intel Itanium 2 Processor Reference Manual for Software Development and Optimization for further information.

The integer parallel instructions have latencies greater than 1 and subtle interdependencies with other instructions, both parallel and nonparallel. Since many of them require execution unit I0 specifically on early Itanium implementations, they may be resource-limited in the processor. Consequently, logically equivalent sequences of nonparallel instructions may sometimes exhibit similar or even better throughput.

Table 12-2. Itanium Integer Parallel Instructions