F.1 HP-UX C Compilers

The C compilers for HP-UX support inline assembly for Itanium program development with intrinsic functions in the form of _Asm_ opcode (...). This extension to the C language has the syntax:

 result = _Asm_ opcode(  completers,  operands,  constraints ); 

where result and operands are expressed as unsigned 64-bit integers in the places where the corresponding Itanium assembly language statement would use general registers. A header file provides the definitions necessary to use inline assembly in a C program.

Markstein illustrates the usefulness of inline assembly extensions for producing and maintaining optimized floating-point libraries. Inline assembly allows retention of the full precision of the Itanium register format for intermediate results throughout a sequence of floating-point operations, the last of which produces a final result in IEEE (memory) format. This is made possible by defining intermediate results as a new C data type __fpreg.

Not all Itanium opcodes are supported (see Inline Assembly for Itanium-based HP-UX). Priority for incorporating specific intrinsics into the C compilers has been guided by needs in system software development, such as access to application and control registers and exploitation of full floating-point precision. Providing accessibility to the Itanium parallel operations in the high-level language is also clearly desirable.

On the other hand, little need has been seen for implementing straightforward arithmetic (e.g., add) or logical (e.g., and) instructions for the _Asm functionality. C compilers already use an add instruction for the + operation and an and instruction for the & operation. Inline assembly also does not provide visibility of or access to the Itanium branch instructions or predication, two of the most powerful capabilities of the Itanium ISA.

Consider the Itanium application register ar.itc (interval time counter), which increments at a fixed relationship to the processor clock. The difference of two readings of ar.itc should thus reflect the relative cost in cycles of the intervening code.

Figure F-1 illustrates the _Asm_mov_from_ar function, with the argument _AREG_ITC, inserted into the C version of the SQUARES program (Section 1.7.1). We compiled this program using the +O0 option to inhibit optimization. When we ran the program on a first-generation Itanium system, it printed a value of over 1300 for the difference t2-t1, but on successive executions it produced lower values, which we attribute to the effects of cache.

When we permitted the compiler to optimize this program at the +O1 level, the value printed for the difference t2-t1 was about 200. Checking the actions of the compiler with the -S command-line option revealed that the compiler now used an all-register approach, never performing any load or store operations, in contrast to the +O0 level. The performance improvement by a factor of 6.5 can be attributed primarily to elimination of load instructions and secondarily to fewer instructions overall.

Attempts to show further improvement at the +O2 level of optimization, where the compiler evaluates the successive squares at compile time, were rather illusory. The compiler moved the two _Asm_mov_from_ar operations close to one another, no longer bracketing the body of instructions comprising the heart of the program. This side effect of normal compiler heuristics thus reduces the usefulness of this method as a way to explore relative timing for various levels of optimization of an algorithm.

Figure F-1 SQUARES program containing the _Asm function
 #include <stdio.h> #include <machine/sys/inline.h> int main() {     unsigned long long t1, t2;     long long sq1, sq2, sq3;     long long temp, diff1, diff2;     t1 = _Asm_mov_from_ar( _AREG_ITC );     diff1 = 1;     diff2 = 2;     temp = 1;     sq1 = temp;     diff1 = diff2 + diff1;     temp = diff1 + temp;     sq2 = temp;     diff1 = diff2 + diff1;     temp = diff1 + temp;     sq3 = temp;     t2 = _Asm_mov_from_ar( _AREG_ITC );     printf("%lld\t%lld\t%lld\n", sq1, sq2, sq3);     printf("%lld\t%lld\t%lld\n", t1, t2, t2-t1);     return 0; } 

This observation emphasizes that optimizing compilers should be used knowledgeably, lest incorrect inferences be drawn. Direct inspection of critical sections of generated machine code may be required for full understanding.

Were the set of inline assembly functions more complete, we might have suggested embedding _Asm functions within a C program as a convenient means of studying some aspects of Itanium assembly language.



ItaniumR Architecture for Programmers. Understanding 64-Bit Processors and EPIC Principles
ItaniumR Architecture for Programmers. Understanding 64-Bit Processors and EPIC Principles
ISBN: N/A
EAN: N/A
Year: 2003
Pages: 223

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net