3.5 The Functions of a Symbolic Assembler

Symbolic assembler programs translate assembly language source files line by line into corresponding binary machine language. During translation, the assembler performs the following functions:

It builds one or more internal symbol tables that contain the values of all user-defined labels and other symbols.
It maintains location counters to determine where the next instruction or data item will be placed in memory.
It translates the symbolic instruction opcodes and operand specifiers into binary machine code, producing an object file.
It may produce a listing file showing the instructions and data and how these were translated and assigned to unique memory locations.

We look next at the elements and mechanisms an assembler utilizes during the translation process. These features include the specification of constants, the definition and use of symbols, the allocation of storage, the location counters, the evaluation of expressions, control statements, and sometimes the generation of a listing file.

3.5.1 Constants

Itanium assemblers interpret all constants appearing in the specifier field as non-negative integers by default. Negative numbers can be produced using the unary minus operator, which we will describe later. Integer constants are not followed by a decimal point. The specification of floating-point constants, where a decimal point may occur, is taken up in Chapter 8.

Table 3-2 summarizes how to control the assembler's interpretation of the radix for a constant. Assemblers usually follow the same convention as C compilers: an octal number begins with a zero, a decimal number begins with a non-zero digit, and a hexadecimal number begins with the prefix 0x. Here are some examples in base 2, 8, 10, and 16 for the decimal value twenty-nine:

035

0x1d

(HP-UX)

0b11101

035

0x1d

(Linux)

Case does not matter in the HP-UX environment when specifying the digits of hexadecimal numbers (0x or 0X; a f or A F) to the assembler.

Several mov statements in SQUARES (Figure 1-3) contain numeric constants. Sometimes it is convenient to use a symbolic representation instead of the actual numeric representation. For example, the direct assignment statements

`sixteen_ones = 0xffff`	(HP-UX or Linux)
`sixteen_ones = 0b1111111111111111`	(Linux)

both define (or redefine) a symbol to have the decimal value 65,535. Actually, in most contexts the symbol sixteen_ones will behave as the zero-extended 64-bit hexadecimal value 0x000000000000ffff.

3.5.2 Symbols or Identifiers

A symbol is a string of characters. Some documentation may also use the word identifier as a synonym for symbol. Symbols may include the letters a-z and A Z (case is typically significant for HP-UX) and the numerals 0 9. The first character of a symbol cannot be a numeral; furthermore, symbols are restricted from conflicting not only with any register name but also with certain built-in directives beginning with a dot. The dot (.), dollar sign ($), and underscore (_) characters may appear in symbols, but should be avoided in normal programming since languages like C and FORTRAN give them special significance.

Table 3-2. Specifying the Radix of Constants with Itanium Assemblers
Radix	Valid Characters	HP-UX	Linux
Binary	0 and 1		`0b…`
Octal	0 to 7	`0…`	`0…`
Decimal	0^[*] to 9	`…`	`…`
Hexadecimal	0 to 9 and a to f	`0x…`	`0x…`

^[*] Decimal constants do not begin with the 0 numeral, in order to be distinguishable from octal constants.

Symbolic constants are a second type of symbol. These assignments are allowed for convenience and documentation only. Declaration of symbolic constants near the head of a routine has a parallel in the compile-time symbolic declarations supported by some high-level languages. Such declarations do not generate machine instructions. The assembler or compiler simply substitutes the equivalent numeric or string value whenever the symbol occurs in any lines below the point of definition. For some assemblers, defining a symbolic constant with a double equal sign (==) is persistent, whereas defining a symbolic constant with a single equal sign (=) would allow subsequent redefining to a different value.

3.5.3 Storage Allocation

Assemblers provide several types of directives or declarative statements to facilitate the storage of data and parameters for a program module. Some of these directives for Itanium assemblers are listed in Table 3-3. For numeric data, there are groups of directives that reserve memory cells and store particular values into those cells, and there are other directives that simply reserve storage but do not initialize it in any particular way. The former are appropriate where the data are "givens" for the program to work with, while the latter are appropriate to allocate space for scratch usage or for computed results.

Table 3-3 conveys the awkward fact that different assemblers do not recognize all directives, although most assemblers recognize common subsets. This is another impediment to widespread use of assembly language across environments. Partially effective ways to get around such difficulties include using the preprocessor capabilities of the assembler or C compiler or using the "include" directive to bring in source text lines appropriate to the current programming environment. Insofar as possible, we prepare sample programs for this book in a format acceptable to most Itanium assemblers.

Historically, many assemblers have begun the names of their storage directives with a dot character (.), but Intel's assembler has departed from that tradition for storage directives such as data1 and stringz. For convenience, the Itanium implementations of the GNU and HP-UX assemblers also recognize these alternates to the traditional forms.

Table 3-3. Assembler Directives for Storage Allocation
Directive	Meaning
Recognized by GNU, Intel, and HP-UX
`.lcomm label, size, alignment`	Reserve number of bytes specified by `size`, with `label` corresponding to the first address. This storage will be in data section `.bss` with the specified alignment.
`label: .skip size`	Reserve number of bytes specified by `size`, with `label` corresponding to the first address.
`label: data1 value_list`	Store specified values in successive bytes in memory from symbolic address `label` onwards.
`label: data2 value_list`	Store successive words.
`label: data4 value_list`	Store successive double words.
`label: data8 value_list`	Store successive quad words.
`label: real4 value_list`	Store successive single-precision floating-point data.
`label: real8 value_list`	Store successive double-precision floating-point data.
`label: string "string"`	Store ASCII representation of `string`.
`label: stringz "string"`	Store ASCII representation of `string` followed by a zero byte (the ASCII NUL character).
Recognized by GNU
`label: .byte value_list`	Store specified values in successive bytes in memory from symbolic address `label` onwards.
`label: .word value_list`	Store successive words.
`label: .long value_list`	Store successive double words.
`label: .quad value_list`	Store successive quad words.
`label: .single value_list`	Store successive single-precision floating-point data.
`label: .double value_list`	Store successive double-precision floating-point data.
`label: .ascii "string"`	Store ASCII representation of `string`.
`label: .asciz "string"`	Store ASCII representation of `string` followed by a zero byte (the ASCII NUL character).

When we want to allocate memory units and initialize them with particular values, we use statements of the form:

 list:    data8  123,987,42 last:    data8  -1

The result in memory will look like this:

graphics/03fig01a.gif

where the address last will be 24 bytes beyond the address list. On the other hand, if we only want to reserve memory units for a four-element data structure, but not give them initial values, we could simply use the construct:

 list:  .skip  3*8 last:  .skip  8

Again, the first memory unit has a byte address list, the second list+8, etc.

Value lists for the floating-point directives support a standard form of scientific notation e.g., 6.02E+23. Be sure to specify enough significant digits for quantities such as p or fundamental physical constants like the speed of light.

In all instances, you should appreciate that assembly language is not at all "typed" like some high-level languages. The storage directives merely allocate memory units and optionally associate a symbolic label with the address of the first byte. Subsequently, you can freely access those memory units in other ways. For example, a string of eight ASCII spaces can be viewed as the quad word integer 0x2020202020202020. There is no runtime information to tell the hardware implementation how this information was defined or how it should be treated. The hardware will execute whatever instructions the programmer specifies.

3.5.4 The Location Counter

Programmers normally write blocks of statements with an implicitly sequential flow of control. Thus most types of instructions do not need to contain an address field specifying which instruction should be executed next. If a computer is built to execute instructions in sequence, the instructions must be stored that way in memory. Automatically incrementing the instruction pointer while one instruction is being executed (Figure 2-4) sets the instruction pointer for fetching the next instruction.

The assembler maintains a location counter, analogous to the instruction pointer (program counter) of the CPU. This location counter keeps track of where to store the next instruction as the executable program is being constructed line by line. Assemblers employ additional location counters to construct the data regions for a program. The reason for multiple location counters is that sophisticated multi-user operating systems can manage physical memory in such a way as to assure that program code is read-only (i.e., cannot corrupt itself), while data regions can be set up to be either read-only or read-write as appropriate to their intended purposes. In addition, system libraries such as mathematical functions may be loaded into memory regions that can be shared among processes.

The assembly process usually begins with the initialization of every location counter to zero. Since there cannot be more than one information unit with an address of zero, the linker utility program must arrange the executable code region and all the data regions into some suitable overall order, thus in effect adding constants to most or all of those zeros. Hence the location counter values really specify offsets relative to the starting address of each region.

As the assembler reads instructions or data from the source program, it translates and outputs the equivalent binary patterns to the object file, appropriately incrementing the location counter. For every bundle of three Itanium instructions, the location counter for executable code advances by 16 bytes, as each bundle is 128 bits wide. The .skip directive defines a label as the current location counter value, and then advances that data location counter by a specified number of bytes, or address values. That is how sq2 in SQUARES (Figure 1-3) acquires an address, before and after the linking process, which is sq1+8.

With many assemblers, the dot character (.) symbolizes the current value of a location counter. Suppose we want to establish a 10-element data structure whose first element is self-referential i.e., the value stored at location block is the address corresponding to the symbol block. We can accomplish this in the following way using the GNU assembler:

 block:   .quad    .      // refers to location 'block'          .skip    9*8

Note that a second use of the dot character will, in general, refer to a different address:

 block:   .quad    .      // refers to location 'block'          .quad    .      // refers to location 'block+8'

because the location counter has advanced by eight units in providing for the first quad word at symbolic address block.

3.5.5 Expressions

Some of the power of a modern assembler program stems from its support of expressions. An expression is a combination of terms joined by binary operators. The most useful binary operators used by Itanium assemblers are defined in Table 3-4.

Table 3-4. Assembler Arithmetic and Logical Binary Operators
Binary Operator	Example	Meaning
`+`	`A + B`	Integer addition of `B` to `A`
`-`	`A - B`	Integer subtraction of `B` from `A`
`*`	`A * B`	Integer multiplication of `A` by `B`
`/`	`A / B`	Integer division of `A` by `B`
`&`	`A & B`	Logical AND of `A` and `B`
`\|`	`A \| B`	Logical OR of `A` and `B`
`^`	`A ^ B`	Logical EXCLUSIVE OR of `A` and `B`

A term can be a number, a symbol that has been given a value, or an expression. Any term may be preceded by one of the arithmetic and logical unary operators in Table 3-5.

Table 3-5. Assembler Arithmetic and Logical Unary Operators
Unary Operator	Example	Meaning
`+`	`+A`	Results in the value of `A`
`-`	`-A`	Results in the negation (two's complement) of `A`
`~`	`~A`	Results in the binary complement (one's complement) of `A`

An assembler typically considers the unary operators (Table 3-5) to have highest precedence, the binary plus and minus operators to have lowest precedence, and all other binary operators to have intermediate precedence. If you do not want A+B*C to be interpreted as A+(B*C), you must explicitly specify (A+B)*C instead. Parentheses can be nested to clarify the order of evaluation for expressions that are substantially more complicated, such as (A+(B-C)*(D+E))/F. These precedence rules are different from those of C.

Expressions can be used anywhere within a program where values would be legal, as in these assembly language examples:

 somenumber:  data8   35*(7+6) quadarray:   .skip   arraysize*8 length       =        rows*columns

All symbols within an expression must be either constants or other symbols that have already been defined. The evaluation of all terms and expressions by Itanium assemblers occurs using at least 64-bit precision.

3.5.6 Control Statements

We are using the concept of control statements to include both system-supplied routines that affect the assembly process and certain assembler directives that affect the behavior of the assembler itself. This latter category includes the GNU assembler .title and .sbttl directives, which have the forms:

 .title     title phrase .sbttl     subtitle phrase

These directives provide annotations for the assembly listing file. Some assemblers provide the .eject directive, which forces a new page in the listing file.

Many assemblers recognize the .include "FILE" directive that results in insertion of the named source file at that line. The assembler processes lines of the outer file up to this directive, then processes all the lines of the inner file, and finally processes the remaining lines of the outer file.

3.5.7 Elements of a Listing File

Long ago, when assembly language predominated for program development and source code had to be prepared using card punches or very primitive text editors, a programmer would take away an assembly listing file to study. The listing file produced by an assembler would typically reproduce the source file line by line, and would show in columns at the left such features as a line number for reference, the address at which the data or instruction was to be placed (i.e., the location counter value), and the numeric representation for the data or instruction. Errors in assembly language syntax, symbols that were multiply defined or undefined, and illegal characters would be flagged for the programmer's attention.

Some assemblers would also append a symbol table, giving the numeric value for each symbol in the assembly language program. These might be marked with designations to show which ones were absolute (defined as a constant value), relocatable (subject to an additive constant at link time), or global (accessible to other routines and shown in a link map).

For the development of big programs, a cross-reference table could be especially helpful. This table would serve as an alphabetized index of symbols, giving page and line number of every occurrence.

The GNU assembler listing file for SQUARES shown in Figure 3-2 was obtained by including Wa, -a in the gcc command line. (Some spaces and tabs have been removed to fit the page size of this book.) Note that the location counter (LC) and generated code are shown in hexadecimal.

Figure 3-2 SQUARES listing file (GNU)

 Line   LC   Code         Source Program GAS LISTING squares.s                   page 1    1                    // SQUARES      Table of Squares    2                            .data                    // Declare storage    3                            .align  8                // Desired alignment    4 0000 00000000      sq1:    .skip   8                // To store 1 squared    4      00000000    5 0008 00000000      sq2:    .skip   8                // To store 2 squared    5      00000000    6 0010 00000000      sq3:    .skip   8                // To store 3 squared    6      00000000    7                                                     // etc.    8                            .text                    // Section for code    9                            .align  32               // Desired alignment   10                            .global main             // These three lines   11                            .proc   main             //  mark the mandatory   12                    main:                            //   'main' program entry   13                            .body                    // Now we really begin...   14                    first:  mov     r21 = 1;;        // Gr21 = first difference   15                            mov     r22 = 2;;        // Gr22 = 2nd difference   16                            mov     r20 = 1;;        // Gr20 = first square   17 0000 0BA80400              addl    r14 = @gprel(sq1),gp;; // Point to storage   17      00216011   17      00004200   17      00000400   18                            st8     [r14] = r20;;    //         for sq1   19 0010 0BA00400              add     r21 = r22,r21;;  // Adjust first difference   19      0021E000   19      04004800   19      00000400   20                            add     r20 = r21,r20;;  // Gr20 = second square   21 0020 0B00501C              addl    r14 = @gprel(sq2),gp;; // Point to storage   21      981150B1   21      54004000   21      00000400   22                            st8     [r14] = r20;;    //         for sq2   23 0030 0BA05428              add     r21 = r22,r21;;  // Adjust first difference   23      0020E000   23      04004800   23      00000400   24                            add     r20 = r21,r20;;  // Gr20 = third square   25 0040 0B00501C              addl    r14 = @gprel(sq3),gp;; // Point to storage   25      981150B1   25      54004000   25      00000400   26                            st8     [r14] = r20;;    //         for sq3   27                                                     // etc.   28 0050 0BA05428      done:   mov     r8 = 0;;         // Signal all is normal   28      0020E000   28      04004800   28      00000400   29                            br.ret.sptk.many b0;;   // Back to command line   30 0060 1D00501C              .endp   main            // Mark end of procedure   30      98110000   30      00020000   30      00000020   30      0A400000 GAS LISTING squares.s                  page 2 DEFINED SYMBOLS           squares.s:4       .data:0000000000000000 sq1           squares.s:5       .data:0000000000000008 sq2           squares.s:6       .data:0000000000000010 sq3           squares.s:17      .text:0000000000000000 main           squares.s:17      .text:0000000000000000 first           squares.s:30      .text:0000000000000070 done NO UNDEFINED SYMBOLS

In the listing file for SQUARES, we can see the consequences of bundling Itanium instructions. The location counter (LC) is shown changing by 0x10 (16 bytes) because of the 128-bit width of each bundle. The four 4-byte hexadecimal numbers associated with each location counter value thus contain the numeric representation for three instructions. The positioning of those numbers does not precisely track with the programmer's lines of assembly language code because of the need for the assembler to insert nop (no-operation) instructions in order to avoid certain constraints that are discussed in later chapters of this book. We do examine one of the instruction bundles shown in Figure 3-2 for SQUARES in greater detail later.

When assemblers do not produce listing files, partial workarounds may be available and fairly satisfactory. For instance, HP-UX and Linux programming environments offer the nm command, which can print a table of symbols from an executable file. Symbol values shown by nm will reflect any adjustments made by the linker.

3.5.1 Constants

3.5.2 Symbols or Identifiers

Table 3-2. Specifying the Radix of Constants with Itanium Assemblers

3.5.3 Storage Allocation

Table 3-3. Assembler Directives for Storage Allocation

3.5.4 The Location Counter

3.5.5 Expressions

Table 3-4. Assembler Arithmetic and Logical Binary Operators

Table 3-5. Assembler Arithmetic and Logical Unary Operators

3.5.6 Control Statements

3.5.7 Elements of a Listing File

Figure 3-2 SQUARES listing file (GNU)