2.5 Itanium Information Units and Data Types

The basic information unit for the Itanium architecture is the 8-bit byte. Individual bytes are given 64-bit addresses, but it is also important to understand that groups of adjacent bytes have addresses, as shown in Figure 2-5. These multibyte units include the 16-bit word (2 bytes), the 32-bit double word (4 bytes), and the 64-bit quad word (8 bytes). For little-endian systems, such units are addressed by the low-order byte of the group. Similarly, the addresses of the higher-order bytes within the larger information units take on successive values beyond the address of the lowest-order byte.

Figure 2-5. Itanium information units

graphics/02fig05.gif

In the convention that has been used by Intel and Digital Equipment Corporation, the individual bits within any information unit are numbered from the least significant bit on the right, bit 0. The most significant bit is then bit 7 for a byte, bit 15 for a word, bit 31 for a double word, and bit 63 for a quad word. Some other machine designers, including Hewlett-Packard Company, have adopted the opposite numbering convention of naming the most significant bit on the left as bit 0. Note that the convention for the Itanium architecture does have the convenience of corresponding directly to the positional weighting scheme for evaluating binary values presented in Chapter 1. That is, the weight of bit i is 2ⁱ.

The corresponding convention for ordering the bytes within words, double words, and quad words is to store the lowest order byte of the group at the lowest address. This is the little-endian convention. The opposite convention, where the highest order byte of a group is stored at the lowest address, is called big-endian, which has historically been followed by Hewlett-Packard and Motorola. When character string data are transmitted between systems, the bytes travel in the same order as letters in words and words in sentences in Western languages. But when little-endian and big-endian systems attempt to break up, say, a 32-bit binary number into four 8-bit binary bytes for sequential transmission, what one system views as WXYZ will be perceived by the other as ZYXW when reassembled. This problem affects only the byte ordering; all systems agree on the ordering (but not the numbering!) of the bits within bytes.

Let us consider, as a specific example of little-endian data storage, that the quad word quantity 0x0F0E0D0C0B0A0908 is stored at address Q. Location Q is then also the address of the double word whose value is 0x0B0A0908, the word whose value is 0x0908, and the byte whose value is 0x08. In similar fashion, location Q+1 is the address of the byte whose value is 0x09, the unaligned word whose value is 0x0A09, and so forth. The Itanium instruction set uses separate opcodes (ld1, ld2, ld4, ld8) and (st1, st2, st4, st8), in order to specify what type of information unit (byte, word, double word, quad word) is to be loaded or stored in a data transfer between memory and a register.

Itanium integer registers are 64 bits wide and can thus accommodate any of these four information units. The hardware design specifies precisely how to widen or narrow information of other sizes when it is placed into or retrieved from a 64-bit register.

What interpretations can be made from the bit patterns stored in these information units? The fundamental data types supported by the instruction set of the Itanium architecture are integers and floating-point numbers. In some contexts, an integer will represent an address instead of a data value. In addition, a compiler program or an assembly language programmer can impart further purposes to integers for example, to represent characters or Boolean variables.

2.5.1 Integers

We reviewed the concepts of binary representation of integers in Chapter 1. A span of N bits can be used in one of two ways: to represent a range of unsigned integers, 0 to 2^N 1, or to represent a range of signed integers, 2^N¹ through 0 to +2^N¹ 1. Table 2-1 shows the numeric ranges for the various integer sizes that are pertinent to Itanium contexts.

The Itanium architecture has integer arithmetic instructions only for data of quad word width, even though the detailed operation of Itanium load and store instructions facilitates packing and unpacking information units of smaller widths. Itanium logical instructions work only with quad word data, but these instructions provide some capability for access to data packed at the bit or group-of-bits level.

2.5.2 Floating-Point Numbers

Since integers may lack the dynamic range necessary for certain scientific applications, most computer architectures provide for floating-point numbers, which correspond to scientific notation. Whereas hand-held calculators display numbers as a decimal number that is multiplied by some positive or negative power of 10, computers typically represent non-integer data as a significand that is multiplied by some power of 2. The exponent and sign of a number can be bit-packed with the significand into an information unit in several ways.

Table 2-1. Integer Data Types
Type	Bits	Bytes	Numeric Range (expressed in decimal radix)
Type	Bits	Bytes	Signed	Unsigned
Byte	8	1	128 to +127	0 to 255
Word	16	2	32,768 to +32,767	0 to 65,535
Double word	32	4	2,147,483,648 to +2,147,483,647	0 to 4,294,967,295
Quad word	64	8	9,223,372,036,854,775,808 to +9,223,372,036,854,775,807	0 to 18,446,744,073,709,551,615

In the past, various computer manufacturers represented floating-point data in ways that were not fully compatible across architectures. Accordingly, concerns arose about inaccuracies that might compound in repeated mathematical operations. Some of these difficulties fell away as the computer industry began to consolidate, but a satisfactory solution came about only through participation in the agreed-upon standards documented in ANSI/IEEE 754, IEEE Standard for Binary Floating-Point Arithmetic.

Two fundamental formats have been supported by nearly every new architecture introduced since the standard emerged: single and double, requiring respectively 32 bits (4 bytes) and 64 bits (8 bytes) for storage. Two additional IEEE formats, extended single and extended double, provided some leeway within which certain older formats could be retained e.g., an Intel format requiring 80 bits (10 bytes) for storage. In this book, we shall discuss only the widely supported IEEE single and double formats for floating-point data, whose characteristics are summarized in Table 2-2.

The IEEE representations not only facilitate direct interchange of data between computer systems with different architectures, but also provide for special values that could not be represented in some of the older proprietary formats. For example, special bit patterns are assigned to represent positive infinity and negative infinity. These obey standard algebraic rules, ensuring, for example, that positive infinity plus a valid finite number yields positive infinity as the sum. Other special bit patterns are called NaN, not a number. These can be used when a computed result is algebraically indeterminate, such as infinity minus infinity.

Double precision

An IEEE double-precision datum occupies 8 adjacent bytes in memory. In order to minimize the time required to load and store the datum, it should start on an address boundary that is evenly divisible by 8; that is, the datum should be naturally aligned, which is along quad word boundaries for the Itanium architecture. In a little-endian representation the bits are labeled from right to left, 0 through 63, as follows, where D denotes the lowest byte address of the information units storing the datum:

Bit 63 is the sign bit, bits <62:52> represent the exponent of 2 biased by addition of 1023 to the true value, and bits <51:0> represent a 52-bit fraction. If all the bits in the representation are zero, the number represented is zero by convention.

Table 2-2. IEEE Floating-Point Numbers
	Single	Double
Size of representation in memory
Sign	1 bit	1 bit
Exponent	8 bits	11 bits
Fraction^[*]	23 bits	52 bits
Bias for exponent	127	1023
Minimum magnitude	1.175 x 10³⁸	2.225 x 10³⁰⁸
Maximum magnitude	3.403 x 10⁺³⁸	1.798 x 10⁺³⁰⁸
Precision
binary	24 bits	53 bits
decimal	6 decimal digits	15 decimal digits

^[*] The significand consists of an implicit "hidden bit" followed by the fraction.

The significand is adjusted so that it consists of a leading bit of 1 to the left of an implied binary point; that is, it is scaled into the range from 1 up to (but not including) 2. For storage in the information units of memory, this logically known bit to the left of the implied binary point is not represented physically. The precision of the significand is thus one part in 2⁵³ even though only 52 bits store the fraction physically. Except for special cases, the value of the number is

(1 2 x S) x 1.F x 2⁽^E ^B⁾

where S is the sign of the number (0 for positive, 1 for negative), F is the binary fraction, 1.F is the significand, E is the true exponent, and B is the bias (equal to 1023 for double precision).

In order to facilitate certain IEEE constraints on accuracy when rounding computed results, as well as to accommodate the 80-bit (10-byte) extended double-precision format brought forward from Intel's IA-32 architecture, the datapath for floating-point manipulations in an Itanium processor (including 128 floating-point registers) has a total width of 82 bits.

When the various bit regions of a double-precision datum are retrieved from memory into an Itanium floating-point register, their arrangement is as follows:

The "hidden bit" that is suppressed for economy of storage in memory is thus made explicit in the representation of a floating-point number in a processor register. We defer discussion of the expansion of space for the exponent to a later chapter.

Single precision

An IEEE single-precision datum occupies 4 adjacent bytes in memory. In order to minimize the time required for loading and storing the datum, it should start on an address boundary that is evenly divisible by 4; that is, the datum should be naturally aligned (i.e., double word aligned). In a little-endian representation the bits are labeled from right to left, 0 through 31, as follows, where S denotes the lowest byte address of the information units storing the datum:

Bit 31 is the sign bit, bits <30:23> represent the exponent of 2 biased by addition of 127 to the true value, and bits <22:0> represent a 23-bit fraction. If all the bits in the representation are zero, the number represented is zero by convention.

The significand is adjusted so that it consists of a leading bit of 1 to the left of an implied binary point; that is, it is scaled into the range from 1 up to (but not including) 2. For storage in the information units of memory, this logically known bit to the left of the implied binary point is not represented physically. The precision of the significand is thus one part in 2²⁴ even though only 23 bits store the fraction physically. Except for special cases, the value of the number is

(1 2 x S) x 1.F x 2⁽^E ^B⁾

where S is the sign of the number (0 for positive, 1 for negative), F is the binary fraction, 1.F is the significand, E is the true exponent, and B is the bias (equal to 127 for single precision).

When the various bit regions of a single-precision datum are retrieved from memory into an Itanium floating-point register, their arrangement is as follows:

Again, the "hidden bit" that is suppressed for economy of storage in memory is thus made explicit in the representation of a floating-point number in a processor register. We defer discussion of the expansion of space for the exponent to a later chapter.

The Itanium processor hardware can thus work with the same 82-bit register representation of floating-point quantities, while the memory representation takes different amounts of storage space depending on the precision required for an application.

As one example, the IEEE single-precision representation for the decimal number 4.25 as it would be stored in memory can be constructed using the following steps:

4.25₁₀	=	2² + 2² = 100.01₂	(convert from base 10 to base 2)
	=	1.0001 x 2²	(shift into normalized form)

After the "hidden bit" is suppressed, the binary fraction is F = 00010000000000000000000 (23 bits in all). The true exponent is 2, but with the bias of 127₁₀ this becomes 129₁₀ or E = 10000001₂. The sign is S = 0, since the number is positive. Putting all those pieces together in the order S-E-F, we have

0	10000001	00010000000000000000000
S	E	F

By reclustering 4 bits at a time, we can deduce what this would look like if it were to be printed as an unsigned hexadecimal number:

0100 0000 1000 1000 0000 0000 0000 0000₂ = 40880000₁₆

Note that only a real number whose fractional part can be represented exactly as a sum of inverse powers of 2 can be stored exactly. Common decimal fractions like 0.1 or 0.7 cannot be stored exactly.

As another example, the 32-bit pattern 41260000₁₆ for a single-precision number stored in memory can be interpreted by reversing the steps just illustrated:

41260000₁₆	=	0100 0001 0010 0110 0000 0000 0000 0000₂
	=
	=	+ 1.010011₂ x 2^{130 127} = 1.010011₂ x 2³ = 1010.011₂
	=	+ (8 + 2 + 0.25 + 0.125)₁₀ = 10.375₁₀

Conversions for double-precision numbers would proceed in a similar fashion.

2.5.3 Alphanumeric Characters

Binary numbers can encode any information, including alphanumeric characters (letters and numerals) and punctuation marks. The development of coded character sets is an old and continuing story. Morse code for telegraphy in the nineteenth century, the use of punch cards for tabulating the US census in the early twentieth century, the later spread of computer applications into business and commerce, and present requirements for encoding the character sets of all the world's written human languages all require compact and consistent encoding schemes.

Providing enough codes while facilitating efficient storage and convenient sorting algorithms has led to many different systems, incompatibilities, and compromises. As a consensus, Unicode^® provides methods for accommodating about a million different historical and currently used character symbols, requiring 21 bits for encoding. Several Unicode transformation formats (UTF) have been defined:

The UTF-32 convention uses codes values running from 0 to 10FFFF₁₆, treated as 32-bit data elements.
The UTF-16 variable-length convention represents 1,112,064 codes, 65,536 as 2 bytes and the rest as 4 bytes.
The UTF-8 variable-length convention represents 128 characters (ASCII, see below) as 1 byte, 1920 characters as 2 bytes (European, Hebrew, and Arabic elements), 63,488 characters as 3 bytes (Chinese, Japanese, Korean elements), and 2,147,418,112 additional characters using up to 6 bytes.

To ensure unambiguous transmission, both big-endian (default) and little-endian variants are defined for UTF-16 and UTF-32.

Linux^® and other contemporary programming environments support UTF-8, which can handle the full generality of Unicode definitions, while still taking advantage of the efficiency of a single-byte coding scheme when possible. The programming language Java includes Unicode support.

The American Standard Code for Information Interchange (ASCII) character set includes both uppercase and lowercase alphabetic characters (A through Z, and a through z), the decimal digits (0 through 9), punctuation marks, and special control characters. The ASCII code was accepted by the American National Standards Institute (ANSI) to standardize the exchange of textual information between computers and peripherals of different manufacturers. This code exists in 7-bit and 8-bit forms; for simplicity we show the 7-bit chart that is compatible with UTF-8 as Table 2-3.

Anyone even slightly familiar with world languages and cultures will perceive at a glance the inadequacy of 7-bit ASCII. There are no diacritically marked letters as used in most Western languages. The symbol $ is not universally used for currency, and the symbol ¢ is not included. Some of these needs for Western languages can be accommodated with extensions of ASCII character representations to 8 bits, but a truly global solution obviously requires Unicode.

Table 2-3. ASCII Character Encoding
	Hex	ASCII		Hex	ASCII		Hex	ASCII		Hex	ASCII
000	00	NUL	032	20	SP	064	40	@	096	60	`
001	01	SOH	033	21	!	065	41	A	097	61	a
002	02	STX	034	22	"	066	42	B	098	62	b
003	03	ETX	035	23	#	067	43	C	099	63	c
004	04	EOT	036	24	$	068	44	D	100	64	d
005	05	ENQ	037	25	%	069	45	E	101	65	e
006	06	ACK	038	26	&	070	46	F	102	66	f
007	07	BEL	039	27	'	071	47	G	103	67	g
008	08	BS	040	28	(	072	48	H	104	68	h
009	09	HT	041	29	)	073	49	I	105	69	i
010	0A	LF	042	2A	*	074	4A	J	106	6A	j
011	0B	VT	043	2B	+	075	4B	K	107	6B	k
012	0C	FF	044	2C	,	076	4C	L	108	6C	l
013	0D	CR	045	2D	-	077	4D	M	109	6D	m
014	0E	SO	046	2E	.	078	4E	N	110	6E	n
015	0F	SI	047	2F	/	079	4F	O	111	6F	o
016	10	DLE	048	30	0	080	50	P	112	70	p
017	11	DC1	049	31	1	081	51	Q	113	71	q
018	12	DC2	050	32	2	082	52	R	114	72	r
019	13	DC3	051	33	3	083	53	S	115	73	s
020	14	DC4	052	34	4	084	54	T	116	74	t
021	15	NAK	053	35	5	085	55	U	117	75	u
022	16	SYN	054	36	6	086	56	V	118	76	v
023	17	ETB	055	37	7	087	57	W	119	77	w
024	18	CAN	056	38	8	088	58	X	120	78	x
025	19	EM	057	39	9	089	59	Y	121	79	y
026	1A	SUB	058	3A	:	090	5A	Z	122	7A	z
027	1B	ESC	059	3B	;	091	5B	[	123	7B	{
028	1C	FS	060	3C	<	092	5C	\	124	7C	\|
029	1D	GS	061	3D	=	093	5D	]	125	7D	}
030	1E	RS	062	3E	>	094	5E	^	126	7E	~
031	1F	US	063	3F	?	095	5F	_	127	7F	DEL

Any ASCII character-oriented peripheral device, such as a printer, will output an A when the ASCII code for A (0x41) is sent to it. Similarly, such devices should provide a horizontal space in response to the SP nonprinting character (0x20). The ASCII encoding of the string My Itanium would be as follows:

graphics/02fig05e.gif

Note that each character uses one byte of storage, shown here as two hex digits. The entire string can be referenced by the address of its first byte containing the representation of the character M at the address symbolized by STRING.

The arrangement of Table 2-3 makes evident the convenient feature of ASCII coding that corresponding uppercase and lowercase letters differ by only a single bit. Uppercase A is 0x41 (0100 0001), and lowercase a is 0x61 (0110 0001). This relationship simplifies case conversion or collapsing of the two cases to facilitate certain alphabetic sorting operations.

About one-fourth of the 7-bit ASCII codes designate control characters intended for device control. The presence of these extra codes has given the ASCII code its versatility in such areas as the control of laboratory instrumentation through relatively simple interfaces attached to the serial communication ports of inexpensive microcomputers.

Viewed as a data structure, any string has two attributes: an address and a length in bytes (or number of characters). The VAX architecture and a few others have machine instructions intended specifically for manipulating strings as a special data type. The Itanium architecture and most others typically ensure that the features of machine instructions that handle small information units (e.g., byte, word, and double word) can also handle string manipulations. Therefore, the programmer or compiler must take responsibility for managing strings as data structures.

Figure 2-5. Itanium information units

2.5.1 Integers

2.5.2 Floating-Point Numbers

Table 2-1. Integer Data Types

Double precision

Table 2-2. IEEE Floating-Point Numbers

Single precision

2.5.3 Alphanumeric Characters

Table 2-3. ASCII Character Encoding