6.1 Logical Functions

At the heart of the central processing unit in every computer is the actual calculating component, called the arithmetic and logic unit (ALU). This unit performs both arithmetic and logical operations as required for each opcode encountered. We have split our discussion of the numerous Itanium instructions that operate on integer data across several chapters. We began in Chapter 4 with purely arithmetic instructions and continued in Chapter 5 with instructions that involve logical comparisons based on the numeric values of operands. Computers can also store logical data in their information units. In this section, we discuss a group of instructions that implement logical functions that work directly with such data.

6.1.1 Boolean Functions of Two Variables

In the nineteenth century, George Boole (1815 1864) initiated a new branch of mathematics, now known as Boolean algebra, in which variables take on only two values, representing truth and falsity. Boolean algebra is a complete algebraic system with axioms, constants, variables, operators, theorems, and so forth. Rather than stray too far into the details of this algebra, we will assume of our readers an intuitive understanding of such logical concepts as AND, OR, and NOT. In computer storage, the bit value 1 (on) is commonly assigned to truth and the bit value 0 (off) to falsity, while in the C programming language any nonzero value is considered to represent a Boolean true.

Suppose we have two independent Boolean (logical) variables, A and B. Since each of these can independently take on the values 0 and 1, there are four ways to find A and B: 00, 01, 10, and 11. Now envision a function of these two variables, F(A,B). What will be the properties of this function? The function can be described by stating its value (0 or 1) for each of the four ways to find its input variables, A and B.

The possible Boolean functions, F_n, of two independent input variables, A and B, are 16 in number. These 16 possibilities represent the enumerations of four 0 and 1 values, as depicted in Table 6-1 (adapted from Langholz). Two of these functions (F₀, F₁₅) always have constant values irrespective of the values of variables A and B. Four of these functions (F₃, F₅, F₁₀, F₁₂) reflect the unary operations: two that simply "transfer," or copy, A or B (F₃, F₅), and two that "complement," or reverse, the logical state of B or A (F₁₀, F₁₂).

The remaining ten functions reflect various binary operations. Four of these ten, as detailed by the comments in Table 6-1, are implemented as binary-level logical instructions in the Itanium ISA. These instructions perform logical functions on 64 bits in parallel (e.g., A AND B consists in detail of A₀ AND B₀, A₁ AND B₁, …, A₆₃ AND B₆₃), where each resultant bit is 1 if and only if both source bits are 1 as well.

Table 6-1. Boolean Functions of Two Variables
A	B	F₀	F₁	F₂	F₃	F₄	F₅	F₆	F₇	F₈	F₉	F₁₀	F₁₁	F₁₂	F₁₃	F₁₄	F₁₅
0	0	0	0	0	0	0	0	0	0	1	1	1	1	1	1	1	1
0	1	0	0	0	0	1	1	1	1	0	0	0	0	1	1	1	1
1	0	0	0	1	1	0	0	1	1	0	0	1	1	0	0	1	1
1	1	0	1	0	1	0	1	0	1	0	1	0	1	0	1	0	1
F₀	Constant 0
F₁	AND			`and` instruction of Itanium architecture, A AND B
F₂				`andcm` instruction of Itanium architecture, A AND (NOT B)
F₃	Transfer A
F₄
F₅	Transfer B
F₆	XOR			`xor` instruction of Itanium architecture, A XOR B
F₇	OR			`or` instruction of Itanium architecture, A OR B
F₈	NOR
F₉	NXOR
F₁₀	NOT B
F₁₁
F₁₂	NOT A
F₁₃
F₁₄	NAND
F₁₅	Constant 1

6.1.2 Logical Instructions

The Itanium logical functions, like those for addition and subtraction, operate with 64-bit data in two source registers and one destination register; an 8-bit immediate value can be substituted for the left-hand source:

 and    r1=r2,r3        // r1 <  r2 & r3 and    r1=imm8,r3      // r1 <  sext(imm8) & r3 andcm  r1=r2,r3        // r1 <  r2 & ~r3 andcm  r1=imm8,r3      // r1 <  sext(imm8) & ~r3 or     r1=r2,r3        // r1 <  r2 | r3 or     r1=imm8,r3      // r1 <  sext(imm8) | r3 xor    r1=r2,r3        // r1 <  r2 ^ r3 xor    r1=imm8,r3      // r1 <  sext(imm8) ^ r3

where the C-like notations & (AND), ~ (NOT, or one's complement), | (OR), and ^ (XOR) have been used.

Note carefully that imm8 (the immediate operand) is sign-extended before its use in all of these logical instructions. Sometimes that may not give the most intuitive result for bits <63:8> of the full result.

While the bitwise instructions operate according to their corresponding logical functions in Table 6-1, some examples may be helpful. Suppose that the 8-bit immediate value is 0xa5 and that register r3 contains a value 0x26…19:

 <-imm8: 1111 1111...1010 0101   <-imm8:1111 1111...1010 0101    r3:  0010 0110...0001 1001       r3:0010 0110...0001 1001   and:  0010 0110...0000 0001       or:1111 1111...1011 1101 <-imm8: 1111 1111...1010 0101   <-imm8:1111 1111...1010 0101   ~r3:  1101 1001...1110 0110       r3:0010 0110...0001 1001 andcm:  1101 1001...1010 0100      xor:1101 1001...1011 1100

where we have used the notation <-imm8 to indicate the sign extension of the immediate value and where we have shown only two bytes of the full 64-bit operations, bits <63:56> and bits <7:0>.

Three of the Itanium logical instructions correspond to operations in Boolean algebra that obey a commutative law. That is, we can say

F(A,B) = (A operation B) gives the same result as F(B,A) = (B operation A)

for the and, or, and xor instructions, as seen from the illustration above. More formally, we can look back at Table 6-1 and observe that certain of the 16 logical functions, including F₁ (AND), F₆ (OR), and F₇ (XOR), are invariant to an interchange of the A and B argument values over all of the rows in the domain.

Contrast those three functions with F₂ (AND NOT), for which an interchange of the A and B argument values over all rows in the domain yields F₄ instead of F₂. Thus the Itanium andcm instruction is not a commutative operation.

6.1.3 Applications of Logical Functions

Most architectures have machine instructions corresponding to the Boolean functions AND, OR, and XOR. The AND function is sometimes called the logical product, since it sets its result to zero if either source is zero, just as zero multiplied by anything yields zero in ordinary algebra. The OR function is sometimes called the logical sum, since it seems to accumulate 1 bits from either source into the result. The XOR function is sometimes called the logical difference, since it indicates a 1 in the result at bit positions where the two sources show different values (01 or 10) but a 0 wherever the two sources have identical values (00 or 11).

From another viewpoint, the particular logical functions that are implemented as Itanium instructions have uses in setting, clearing, toggling, and testing bits. The metaphor of a mask frequently accompanies descriptions of such uses, where the pattern of 0 and 1 bits in one of the source operands can be thought of as the holes and surrounding frame in a stencil.

Setting bits

The Itanium or instruction performs this task, where either source may serve as the mask. Keep in mind that an immediate value, if used, will be sign-extended. To ensure that certain bit positions in the result have 1 values, first copy the other source into the result and then imagine a stencil with cutout 1 symbols through which spray paint will deposit 1 symbols onto the tentative result, covering whatever 0 or 1 symbols were there previously.

Clearing bits

The Itanium andcm instruction performs this task, where the value in register r3 serves as the mask. To ensure that certain bit positions in the result have 0 values, first copy the other source into the result and then imagine a stencil with cutout 0 symbols through which spray paint will deposit 0 symbols onto the tentative result, obliterating whatever 0 or 1 symbols were there previously.

Toggling bits

The Itanium xor instruction performs this task, where a mask of all 1 bits in one source will toggle all of the bit positions in the other source as the result. At any bit position where the mask has 0 bits, the original bit value (0 or 1) in the other source will just be copied. In this case, the imaginary application of spray paint through the 1 holes in the stencil might involve multiple individual stencils in a larger frame, placing the opposite value in each position.

Testing bits

The Itanium and instruction performs the task of testing for 1 bits; either source may serve as the mask. Again, keep in mind the sign-extension of the immediate value. To test the nonmask source, first copy it into the result and then imagine a stencil with cutout 0 symbols at positions that are not of interest. Spraying paint through all of the 0 cutouts covers whatever 0 or 1 symbols were there. In the positions that are of interest, the original 0 or 1 values remain as the result.

The AND test results in a 64-bit pattern in which several bit positions may or may not be 1 values. To find these, we need to perform further testing. Another and instruction or the use of compare instructions with anticipated bit patterns might narrow our focus, or we could use shift instructions (discussed later in this chapter) to slide the bit pattern around, possibly moving one bit into the "sign" position for a test based on negative or positive. We can also use the Itanium tbit instruction to test for a 0 or 1 value at a single bit position (Section 6.1.4).

Changing the case of ASCII characters

One very common application of or and and instructions is to convert between uppercase and lowercase ASCII alphabetic characters, as their codes differ consistently by just one bit: 00100000₂ = 20₁₆ = 32₁₀ (see Table 2-3). Assume that the code for an alphabetic character is in register r3:

 or      r1=0x20,r3     // Force character to lowercase and     r1=~0x020,r3   // Force character to uppercase

where ~ is the unary complementation operator (Table 3-5). What would the xor instruction do to the code for an alphabetic ASCII character?

6.1.4 The Single-Bit Test Instruction

Many instruction sets offer the capability to determine the value of a single bit in an N-wide register, though it may only be the least significant bit. The method of using the and instruction previously discussed certainly works just as well for single and multiple bits, but lacks the hallmark of Itanium architecture: predication.

For predicate setting based on testing any single bit in a 64-bit register, the Itanium ISA offers the tbit instruction, which operates like compare instructions to set a predicate pair:

 tbit.trel.ctype  pt,pf = r3,pos6

where pos6 is an unsigned field in the instruction that encodes which bit position <63:0> in the value in register r3 is to be tested. That single bit is complemented or not, depending on trel (the test relationship completer), and the result is written to the two predicate destination registers.

Two values for trel (the test relationship completer), nz (nonzero) and z (zero), are recognized for the sense of the test. Predicate pt is set to 1 and predicate pf is set to 0 if the tested bit is actually 1 and trel was designated as nz, or if it is 0 and trel was designated as z; otherwise, the predicates are set the other way around.

All of the values noted previously (Section 5.2.1) for ctype (the compare type completer) can be used with this instruction.

6.1.5 Parallel (Logical) Conditions

A building's fire alarm can be activated by any one or simultaneous combination of pull boxes in the system, which are arranged in a "wired-or" configuration. Digital "wired-or" circuits that activate a single result have applications such as connecting multiple external devices to a single interrupt input of a CPU chip.

Recall that normal conditional instructions with a ctype (conditional type completer) of none at all produce a predicate pt for a true outcome of crel (the conditional relationship completer) and a predicate pf for the opposite outcome:

 cmp.crel.ctype  pt,pf = operand1,operand2 tbit.trel.ctype pt,pf = register,pos6

Some additional values of ctype indicate the ability to execute parallel logical or "wired" compares or bit tests.

Table 6-2 gives the effect on destination predicate registers for each of the parallel compare types. Blank entries in the table indicate that the previous states of both predicate registers are carried forward. Notice that several of these compare types set both predicates to the same value.

Table 6-2. Effect of Parallel Comparison Types
	true outcome		false outcome
ctype	pt	pf	pt	pf
`or`	1	1
`and`			0	0
`or.andcm`	1	0
`orcm`			1	1
`andcm`	0	0
`and.orcm`			0	1

These parallel compare types do not offer the full versatility of normal compare instructions. When crel is eq or ne, the source operands must be either two registers or one register and one signed 8-bit immediate value. When crel is one of the signed inequalities (lt, le, gt, ge), both operands must be general registers, one always being Gr₀ (the constant 0). Moreover, these parallel operations are not available for use with unsigned inequalities. Several parallel compares in the same instruction group are allowed to write the same value to a predicate register, while ordinary compares cannot.

These instruction forms can offer a high degree of parallelism in favorable circumstances. Since the Itanium 2 processor has four M-type and four I-type execution units, we might anticipate that up to six simultaneous type A instructions could execute in parallel in a single clock cycle:

 <set pt=0 somewhere earlier> if a1=b1 or a2=b2 or a3=b3 ... then <block predicated on at least one equality being true>

which could be written as:

                             // Load registers earlier cmp.ne    pt,p0 = r0,r0;;   // Do this earlier cmp.eq.or pt,p0 = ra1,rb1   // Compare 1 cmp.eq.or pt,p0 = ra2,rb2   // Compare 2 cmp.eq.or pt,p0 = ra3,rb3;; // Compare 3 (pt)      ...               // Predicated THEN block

where one could use or.andcm instead of just the or instruction to set another predicate pf if an ELSE block was present. Since we are using OR comparison types, if any one of the three instructions sets pt to 1, the predicated THEN block will execute, as none of them can set pt to 0.

6.1.6 The Logical Basis of Addition

If people wanted only to perform parallel logical operations on independent bit patterns representing Boolean values, computers could be extraordinarily fast. Such logical operations can be implemented directly as digital circuitry. Actually, of course, people also want to perform arithmetic operations on interdependent bit patterns representing numeric values. Interdependence means not only the static assignment of weighting coefficients (Section 1.8.1), but also considerations of carries among bit positions (Section 1.8.3) and potential concern about arithmetic overflow (Section 4.2.2).

Let us show how computer arithmetic, specifically unsigned integer addition, can be built from logical operations already discussed. Rows in Table 6-3 enumerate the four possible results of adding two single-bit integers, A and B.

Table 6-3. Addition of Two Single-Bit Integers
A	B	Carry	Sum
0	0	0	0
0	1	0	1
1	0	0	1
1	1	1	0

We recognize Table 6-3 as a subset of Table 6-1: the sum is function F₆(A XOR B), and the carry is function F₁(A AND B). These results correspond to all possible results of addition at the lowest-order bit of unsigned integers.

At the hardware level, single-bit addition with carry simply requires digital logic elements that perform XOR and AND on the same pair of inputs, A and B. This is known as a half adder circuit.

Multibit addition then moves to the left for the next set of inputs, A_i and B_i. Since we might have a bit C_i₁ carried over from the first operation, that extra bit from the right needs to be added into the current sum S_i using an additional half adder circuit. This creates a full adder, which operates on three inputs (A_i, B_i, C_i₁) and produces two outputs (S_i, C_i). Using Itanium instructions, a method for obtaining the sum at position i can be expressed as:

 xor      rs=ra,rb;;    // partial sum, A(i)+B(i) xor      rs=rs,rc      // complete sum including C(i-1)

The full sum rs requires two XOR operations to account for the carry from the earlier operation.

The outgoing carry C_i is 1 if the sum of the three inputs is greater than unity i.e., if two or more inputs are 1. Using Itanium instructions, the value of the carry rc can be computed as:

 cmp.eq     pt,pf=rc,0;;  // If no incoming carry (pt) and   rc=ra,rb      //  then C(i) if both A(i), B(i) (pf) or    rc=ra,rb      //  else C(i) if either/both

The carry is obtained in two steps using predicated AND and OR operations in an if…then…else construct. More instructions, perhaps including branches, might be needed in non-EPIC architectures.

Other arithmetic operations, such as subtraction, multiplication, and division, can be implemented with simple logical operations. In all cases, multibit arithmetic requires propagation of carries, which introduces a sequential delay in each bitwise operation. Such delays can be reduced, but never eliminated, by choosing digital circuit elements having inherent parallelism, though at greater economic cost.