Symbolic assembler programs translate assembly language source files line by line into corresponding binary machine language. During translation, the assembler performs the following functions:
We look next at the elements and mechanisms an assembler utilizes during the translation process. These features include the specification of constants, the definition and use of symbols, the allocation of storage, the location counters, the evaluation of expressions, control statements, and sometimes the generation of a listing file. 3.5.1 ConstantsItanium assemblers interpret all constants appearing in the specifier field as non-negative integers by default. Negative numbers can be produced using the unary minus operator, which we will describe later. Integer constants are not followed by a decimal point. The specification of floating-point constants, where a decimal point may occur, is taken up in Chapter 8. Table 3-2 summarizes how to control the assembler's interpretation of the radix for a constant. Assemblers usually follow the same convention as C compilers: an octal number begins with a zero, a decimal number begins with a non-zero digit, and a hexadecimal number begins with the prefix 0x. Here are some examples in base 2, 8, 10, and 16 for the decimal value twenty-nine:
Case does not matter in the HP-UX environment when specifying the digits of hexadecimal numbers (0x or 0X; a f or A F) to the assembler. Several mov statements in SQUARES (Figure 1-3) contain numeric constants. Sometimes it is convenient to use a symbolic representation instead of the actual numeric representation. For example, the direct assignment statements
both define (or redefine) a symbol to have the decimal value 65,535. Actually, in most contexts the symbol sixteen_ones will behave as the zero-extended 64-bit hexadecimal value 0x000000000000ffff. 3.5.2 Symbols or IdentifiersA symbol is a string of characters. Some documentation may also use the word identifier as a synonym for symbol. Symbols may include the letters a-z and A Z (case is typically significant for HP-UX) and the numerals 0 9. The first character of a symbol cannot be a numeral; furthermore, symbols are restricted from conflicting not only with any register name but also with certain built-in directives beginning with a dot. The dot (.), dollar sign ($), and underscore (_) characters may appear in symbols, but should be avoided in normal programming since languages like C and FORTRAN give them special significance.
Symbolic constants are a second type of symbol. These assignments are allowed for convenience and documentation only. Declaration of symbolic constants near the head of a routine has a parallel in the compile-time symbolic declarations supported by some high-level languages. Such declarations do not generate machine instructions. The assembler or compiler simply substitutes the equivalent numeric or string value whenever the symbol occurs in any lines below the point of definition. For some assemblers, defining a symbolic constant with a double equal sign (==) is persistent, whereas defining a symbolic constant with a single equal sign (=) would allow subsequent redefining to a different value. 3.5.3 Storage AllocationAssemblers provide several types of directives or declarative statements to facilitate the storage of data and parameters for a program module. Some of these directives for Itanium assemblers are listed in Table 3-3. For numeric data, there are groups of directives that reserve memory cells and store particular values into those cells, and there are other directives that simply reserve storage but do not initialize it in any particular way. The former are appropriate where the data are "givens" for the program to work with, while the latter are appropriate to allocate space for scratch usage or for computed results. Table 3-3 conveys the awkward fact that different assemblers do not recognize all directives, although most assemblers recognize common subsets. This is another impediment to widespread use of assembly language across environments. Partially effective ways to get around such difficulties include using the preprocessor capabilities of the assembler or C compiler or using the "include" directive to bring in source text lines appropriate to the current programming environment. Insofar as possible, we prepare sample programs for this book in a format acceptable to most Itanium assemblers. Historically, many assemblers have begun the names of their storage directives with a dot character (.), but Intel's assembler has departed from that tradition for storage directives such as data1 and stringz. For convenience, the Itanium implementations of the GNU and HP-UX assemblers also recognize these alternates to the traditional forms.
When we want to allocate memory units and initialize them with particular values, we use statements of the form: list: data8 123,987,42 last: data8 -1 The result in memory will look like this: where the address last will be 24 bytes beyond the address list. On the other hand, if we only want to reserve memory units for a four-element data structure, but not give them initial values, we could simply use the construct: list: .skip 3*8 last: .skip 8 Again, the first memory unit has a byte address list, the second list+8, etc. Value lists for the floating-point directives support a standard form of scientific notation e.g., 6.02E+23. Be sure to specify enough significant digits for quantities such as p or fundamental physical constants like the speed of light. In all instances, you should appreciate that assembly language is not at all "typed" like some high-level languages. The storage directives merely allocate memory units and optionally associate a symbolic label with the address of the first byte. Subsequently, you can freely access those memory units in other ways. For example, a string of eight ASCII spaces can be viewed as the quad word integer 0x2020202020202020. There is no runtime information to tell the hardware implementation how this information was defined or how it should be treated. The hardware will execute whatever instructions the programmer specifies. 3.5.4 The Location CounterProgrammers normally write blocks of statements with an implicitly sequential flow of control. Thus most types of instructions do not need to contain an address field specifying which instruction should be executed next. If a computer is built to execute instructions in sequence, the instructions must be stored that way in memory. Automatically incrementing the instruction pointer while one instruction is being executed (Figure 2-4) sets the instruction pointer for fetching the next instruction. The assembler maintains a location counter, analogous to the instruction pointer (program counter) of the CPU. This location counter keeps track of where to store the next instruction as the executable program is being constructed line by line. Assemblers employ additional location counters to construct the data regions for a program. The reason for multiple location counters is that sophisticated multi-user operating systems can manage physical memory in such a way as to assure that program code is read-only (i.e., cannot corrupt itself), while data regions can be set up to be either read-only or read-write as appropriate to their intended purposes. In addition, system libraries such as mathematical functions may be loaded into memory regions that can be shared among processes. The assembly process usually begins with the initialization of every location counter to zero. Since there cannot be more than one information unit with an address of zero, the linker utility program must arrange the executable code region and all the data regions into some suitable overall order, thus in effect adding constants to most or all of those zeros. Hence the location counter values really specify offsets relative to the starting address of each region. As the assembler reads instructions or data from the source program, it translates and outputs the equivalent binary patterns to the object file, appropriately incrementing the location counter. For every bundle of three Itanium instructions, the location counter for executable code advances by 16 bytes, as each bundle is 128 bits wide. The .skip directive defines a label as the current location counter value, and then advances that data location counter by a specified number of bytes, or address values. That is how sq2 in SQUARES (Figure 1-3) acquires an address, before and after the linking process, which is sq1+8. With many assemblers, the dot character (.) symbolizes the current value of a location counter. Suppose we want to establish a 10-element data structure whose first element is self-referential i.e., the value stored at location block is the address corresponding to the symbol block. We can accomplish this in the following way using the GNU assembler: block: .quad . // refers to location 'block' .skip 9*8 Note that a second use of the dot character will, in general, refer to a different address: block: .quad . // refers to location 'block' .quad . // refers to location 'block+8' because the location counter has advanced by eight units in providing for the first quad word at symbolic address block. 3.5.5 ExpressionsSome of the power of a modern assembler program stems from its support of expressions. An expression is a combination of terms joined by binary operators. The most useful binary operators used by Itanium assemblers are defined in Table 3-4.
A term can be a number, a symbol that has been given a value, or an expression. Any term may be preceded by one of the arithmetic and logical unary operators in Table 3-5.
An assembler typically considers the unary operators (Table 3-5) to have highest precedence, the binary plus and minus operators to have lowest precedence, and all other binary operators to have intermediate precedence. If you do not want A+B*C to be interpreted as A+(B*C), you must explicitly specify (A+B)*C instead. Parentheses can be nested to clarify the order of evaluation for expressions that are substantially more complicated, such as (A+(B-C)*(D+E))/F. These precedence rules are different from those of C. Expressions can be used anywhere within a program where values would be legal, as in these assembly language examples: somenumber: data8 35*(7+6) quadarray: .skip arraysize*8 length = rows*columns All symbols within an expression must be either constants or other symbols that have already been defined. The evaluation of all terms and expressions by Itanium assemblers occurs using at least 64-bit precision. 3.5.6 Control StatementsWe are using the concept of control statements to include both system-supplied routines that affect the assembly process and certain assembler directives that affect the behavior of the assembler itself. This latter category includes the GNU assembler .title and .sbttl directives, which have the forms: .title title phrase .sbttl subtitle phrase These directives provide annotations for the assembly listing file. Some assemblers provide the .eject directive, which forces a new page in the listing file. Many assemblers recognize the .include "FILE" directive that results in insertion of the named source file at that line. The assembler processes lines of the outer file up to this directive, then processes all the lines of the inner file, and finally processes the remaining lines of the outer file. 3.5.7 Elements of a Listing FileLong ago, when assembly language predominated for program development and source code had to be prepared using card punches or very primitive text editors, a programmer would take away an assembly listing file to study. The listing file produced by an assembler would typically reproduce the source file line by line, and would show in columns at the left such features as a line number for reference, the address at which the data or instruction was to be placed (i.e., the location counter value), and the numeric representation for the data or instruction. Errors in assembly language syntax, symbols that were multiply defined or undefined, and illegal characters would be flagged for the programmer's attention. Some assemblers would also append a symbol table, giving the numeric value for each symbol in the assembly language program. These might be marked with designations to show which ones were absolute (defined as a constant value), relocatable (subject to an additive constant at link time), or global (accessible to other routines and shown in a link map). For the development of big programs, a cross-reference table could be especially helpful. This table would serve as an alphabetized index of symbols, giving page and line number of every occurrence. The GNU assembler listing file for SQUARES shown in Figure 3-2 was obtained by including Wa, -a in the gcc command line. (Some spaces and tabs have been removed to fit the page size of this book.) Note that the location counter (LC) and generated code are shown in hexadecimal. Figure 3-2 SQUARES listing file (GNU) Line LC Code Source Program GAS LISTING squares.s page 1 1 // SQUARES Table of Squares 2 .data // Declare storage 3 .align 8 // Desired alignment 4 0000 00000000 sq1: .skip 8 // To store 1 squared 4 00000000 5 0008 00000000 sq2: .skip 8 // To store 2 squared 5 00000000 6 0010 00000000 sq3: .skip 8 // To store 3 squared 6 00000000 7 // etc. 8 .text // Section for code 9 .align 32 // Desired alignment 10 .global main // These three lines 11 .proc main // mark the mandatory 12 main: // 'main' program entry 13 .body // Now we really begin... 14 first: mov r21 = 1;; // Gr21 = first difference 15 mov r22 = 2;; // Gr22 = 2nd difference 16 mov r20 = 1;; // Gr20 = first square 17 0000 0BA80400 addl r14 = @gprel(sq1),gp;; // Point to storage 17 00216011 17 00004200 17 00000400 18 st8 [r14] = r20;; // for sq1 19 0010 0BA00400 add r21 = r22,r21;; // Adjust first difference 19 0021E000 19 04004800 19 00000400 20 add r20 = r21,r20;; // Gr20 = second square 21 0020 0B00501C addl r14 = @gprel(sq2),gp;; // Point to storage 21 981150B1 21 54004000 21 00000400 22 st8 [r14] = r20;; // for sq2 23 0030 0BA05428 add r21 = r22,r21;; // Adjust first difference 23 0020E000 23 04004800 23 00000400 24 add r20 = r21,r20;; // Gr20 = third square 25 0040 0B00501C addl r14 = @gprel(sq3),gp;; // Point to storage 25 981150B1 25 54004000 25 00000400 26 st8 [r14] = r20;; // for sq3 27 // etc. 28 0050 0BA05428 done: mov r8 = 0;; // Signal all is normal 28 0020E000 28 04004800 28 00000400 29 br.ret.sptk.many b0;; // Back to command line 30 0060 1D00501C .endp main // Mark end of procedure 30 98110000 30 00020000 30 00000020 30 0A400000 GAS LISTING squares.s page 2 DEFINED SYMBOLS squares.s:4 .data:0000000000000000 sq1 squares.s:5 .data:0000000000000008 sq2 squares.s:6 .data:0000000000000010 sq3 squares.s:17 .text:0000000000000000 main squares.s:17 .text:0000000000000000 first squares.s:30 .text:0000000000000070 done NO UNDEFINED SYMBOLS In the listing file for SQUARES, we can see the consequences of bundling Itanium instructions. The location counter (LC) is shown changing by 0x10 (16 bytes) because of the 128-bit width of each bundle. The four 4-byte hexadecimal numbers associated with each location counter value thus contain the numeric representation for three instructions. The positioning of those numbers does not precisely track with the programmer's lines of assembly language code because of the need for the assembler to insert nop (no-operation) instructions in order to avoid certain constraints that are discussed in later chapters of this book. We do examine one of the instruction bundles shown in Figure 3-2 for SQUARES in greater detail later. When assemblers do not produce listing files, partial workarounds may be available and fairly satisfactory. For instance, HP-UX and Linux programming environments offer the nm command, which can print a table of symbols from an executable file. Symbol values shown by nm will reflect any adjustments made by the linker. |