Exercises | ItaniumR Architecture for Programmers. Understanding 64-Bit Processors and EPIC Principles

1:	When a needed information unit cannot be found in a cache i.e., when a cache miss occurs the design of cache systems typically calls for some fixed number of adjacent information units to be loaded into the cache, that number of bytes (probably a power of 2) being called the cache line size. Explain why this method may lower the miss rate for an instruction cache to a greater extent than for a data cache. Decide whether a compiler should generate a modestly longer alternate instruction sequence if doing so could reduce data cache misses in a particular section of a program.
2:	Explain why truly global optimization (`+O4` in Table 11-3) would usually be more associated with the operation of a linker than a compiler.
3:	Explain why the branch hints in the instruction clusters numbered 7 in Table 11-4 are appropriate.
4:	How many extra bytes of stack space, beyond what it actually used, did `gcc` claim at the `-O0` level of optimization for the `com_c.c` program?
5:	Sketch and label how each compiler organizes the memory stack for the program in Tables 11-4, 11-5, and 11-6. Put one or more register pointers on each sketch to clarify how addressing in the program works.
6:	Critically compare `gcc` optimization levels `-O1` and `-O2` for the `com_c.c` program (Linux).
7:	Critically compare two `cc` or `aCC` optimization levels for the `com_c.c` program (HP-UX).
8:	Experiment with a compiler that uses the loop count register (`ar.lc`) at one of its optimization levels in order to see if there might be some small number of traversals at which it would instead adopt some other strategy for processing a loop (such as loop unrolling).
9:	How could you simply prevent an optimizing compiler from considering as "dead code" the entire body of the `com_c.c` program? If you have access to one of the compilers that removes dead code, analyze the optimized program that it can actually produce.
10:	Define or explain what is meant by the following: loop unrolling; inline expansion; instruction scheduling.
11:	Prepare an analytical report that compares various optimizations of one of these language variants of the MATRIX program containing a two-dimensional matrix that is involved in matrix multiplication. PROGRAM MATRIX_F DOUBLE PRECISION A(11,17), B(17), C(11) INTEGER 8 I, J DO I=1,11 C(I)=0 DO J=1,17 C(I)=C(I)+A(I,J)B(J) END DO END DO PRINT , C(9) END // Program matrix_c main () { double a[11][17], b[17], c[11]; long long i,j; for (i=0; i<11; i++) { c[i] = 0; for (j=0; j<17; j++) c[i] = c[i] + a[i][j] b[j]; } } Comment whether any difficulties or ambiguities arise because the elements of `A` and `B` are uninitialized.
12:	Express as a function of n the total number of machine cycles that `gcc` expects its optimized version of the `fib` function (Section 11.7) to require to compute F_n on the first Itanium processor implementation.
13:	Adapt the `fib` function (Section 11.7) to accept n as a fourth argument. Make as many variants of the source file as levels of optimization that your C compiler offers, and give each a different function name. Compile these separately to `.o` files at the different levels of optimization. Insert calls to these variant functions into the TESTFIB program (Section 10.7.4). Correlate your observations of relative performance with the nature of the machine code produced by the compiler in a written analysis that also considers `fib1` (Section 10.7.1), `fib2` (Section 10.7.2), and `fib3` (Section 10.7.3).