11.4. X11-based Applications and Libraries

< Day Day Up >

11.3. Architectural Issues

There are a few architectural issues to be aware of when developing or porting software on Mac OS X. In particular, vectorization, pointer size, endianness, and inline assembly code tend to be the most common issues developers run into.

11.3.1. AltiVec

The Velocity Engine, Apple's name for Motorola 128-bit AltiVec vector processor that allows up to 16 operations in a single clock cycle, is supported on both G4 and G5 processors by the Mac OS X GCC implementation. The Velocity Engine, which executes operations currently with existing integer and floating-point units, can result in significant performance gains, especially for highly parallel operations. The compiler flag -maltivec can be specified to compile code engineered to use the AltiVec instruction set. Inclusion of this command-line option to cc defines the preprocessor symbol _ _VEC_ _. (See Table 11-3 for more AltiVec-related compiler flags.)

11.3.2. 64-bit Computing

On a 32-bit system, such as Mac OS X running on the G3 or G4, C pointers are 32 bits (4 bytes). On a 64-bit system, such as Mac OS X running on the G5, they are 64 bits (8 bytes). As long as your code does not rely on any assumptions about pointer size, it should be 64-bit clean. For example, on a 32-bit system, the following program prints "4", and on a 64-bit system, it prints "8":

     #include <stdio.h>     int main( )     {       printf("%d\n", sizeof(void *));       return 0;     }

Some 64-bit operating systems, such as Solaris 8 on Ultra hardware (sun4u) and Mac OS X Tiger on G5 hardware, have a 64-bit kernel space, but support both 32- and 64-bit mode applications, depending on how they are compiled. On a G5 system, the pointer size is 64-bits. Other data types are mapped onto the 64-bit data type. For example, single precision floats, which are 32-bit, are converted to double precision floats when they are loaded into registers. In the registers, single precision instructions operate on these single precision floats stored as doubles performing the required operations on the data. The results, however, are rounded to single precision 32-bit. Apple has provided several technical documents containing information and advice on optimizing code to take advantage of the 64-bit G5 architecture:

64-Bit Transition Guide at file:///Developer/ADC Reference Library/documentation/Darwin/Conceptual/64bitPorting/index.html
Developing 64-Bit Applications at http://developer.apple.com/macosx/tiger/64bit.html
TN2086: Tuning for G5: A Practical Guide at http://developer.apple.com/technotes/tn/tn2086.html
TN2087: PowerPC G5 Performance Primer at http://developer.apple.com/technotes/tn/tn2087.html

Additional information can be found at http://developer.apple.com/hardware/ve/g5.html. These documents describe in detail the issues involved in tuning code for the G5. We note here only a few issues.

The architecture of the G5 allows for much greater performance relative to the G4. This performance potential is partly due to the fact that the G5 allows 200 instructions in core, compared to only 30 for the G4. Moreover, the G5 has 16 pipeline stages, 2 load/store units, and 2 floating points units, compared to 7 pipeline stages, 1 load/store unit, and 1 floating points unit on the G4. The L1 cacheline size is also 128 bytes on the G5, compared to 32 bytes on the G4. Additionally the processor and memory bandwidth is much greater on the G5, relative to the G4. The technical notes mentioned earlier in this section have additional information on hardware differences.

One important implication of the greater number of pipeline stages on the G5 relative to the G4 is that instruction latencies are greater on the G5. You can often gain significant improvements in performance by using performance tools to identify loops that account for a large percentage of computation time. Once identified, you can either manually unroll these loops, or use the -funroll-loops compiler flag. The compiler flag -mtune-970 can also be useful in this situation, as it schedules code more efficiently for the G5. The -fast compiler flag sets these options (among others) automatically.

To better take advantage of the longer cacheline size in L1 cache on the G5, algorithms should be designed for greater data locality, and use contiguous memory accesses when possible. For example, arrays in C store entries row-wise. To ensure contiguous memory accesses, design your code so that it accesses array elements row-by-row. The G5 has four hardware prefetchers, which (if accesses to memory are contiguous) are triggered automatically to help reduce cache misses. Performance tools, such as the CHUD suite (see Chapter 12), can help you optimize code by profiling computation and memory usage some of them even make suggestions on how to improve performance.

While the G5 running Mac OS X Panther provided a fine computing platform, Mac OS X Tiger, which allows applications to access a 64-bit address space, opens up a new realm of computational capabilities. Since Tiger supports 64-bit arithmetic instructions on PowerPC architectures, even if your code is compiled in 32-bit mode, your code will not necessarily run more efficiently when compiled in 64-bit mode. It should be noted that, even on a G3 system, 32-bit applications have a 128-bit long-double data type, and a 64-bit long-long data type.

To compile 64-bit code using GCC, be sure to use the GCC 4.0, and specify the ppc64 architecture with -arch ppc64. The -arch ppc compiler flag together with -arch ppc64 produces a "fat" binary, that is, one that can be run on either 32-bit or 64-bit systems. When a fat binary is run on a 64-bit system, it runs as a 64-bit executable. On the other hand, when the same fat binary is run on a 32-bit system, it runs as a 32-bit executable. Specifying the -arch ppc compiler flag alone produces a 32-bit executable. Since 32-bit is the default, it is unnecessary to specify this flag alone, Additionally, the -Wconversion compiler flag may be useful when converting 32-bit code to 64-bit code. The _ _LP64_ _ and _ _ppc_ _ macros can be used to conditionally compile 64-bit code. At the time of this writing, only C and C++ can be compiled in 64-bit mode. Following is a list of things to bare in mind when engaging in 64-bit computing on Tiger.

Tiger follows the LP64 64-bit data model, also used by SUN and SGI: ints are 32-bit, while longs, long-longs, and pointers are 64-bit.
In 64-bit code, ints cannot hold pointers.
Use of a cast between a 64-bit type and a 32-bit type can destroy data.
In Tiger, only non-GUI applications can be compiled as 64-bit. You can, however, use a 32-bit GUI to launch and control the a 64-bit application.
Compiling an application as 64-bit produces a 64-bit version of the Mach-O binary format, used in Mac OS X. You can determine if a program was compiled as 64-bit, 32-bit, or flat using the file command.
64-bit applications may use only 64-bit frameworks, while 32-bit applications may use only 32-bit frameworks.
Tiger ships with only two 64-bit frameworks: System and Xcelerate.

11.3.3. Endian-ness

CPU architectures are designed to treat the bytes of words in memory as being arranged in big- or little-endian order . Big-endian ordering has the most significant byte in the lowest address, while little-endian has the most significant byte at the highest byte address.

The PowerPC is bi-endian, meaning that the CPU is instructed at boot time to order memory as either big- or little-endian. In practice, bi-endian CPUs run exclusively as big- or little-endian. In general, Intel architectures are little-endian, while most, but not all, Unix/RISC machines are big-endian. Table 11-4 summarizes the endian-ness of various CPU architectures and operating systems. As shown in Table 11-4, Mac OS X is big-endian.

Table 11-4. Endian-ness of some operating systems
CPU type	Operating system	Endian-ness
Dec Alpha	Digital Unix	little-endian
Dec Alpha	VMS	little-endian
Hewlett Packard PA-RISC	HP-UX	big-endian
IBM RS/6000	AIX	big-endian
Intel x86	Windows	little-endian
Intel x86	Linux	little-endian
Intel x86	Solaris x86	little-endian
Motorola PowerPC	Mac OS X	big-endian
Motorola PowerPC	Linux	big-endian
SGI R4000 and up	IRIX	big-endian
Sun SPARC	Solaris	big-endian

11.3.4. Inline Assembly

As far as inline assembly code is concerned if you have any it will have to be rewritten. Heaven help you if you have to port a whole Just-In-Time (JIT) compiler! For information on the assembler and PowerPC machine language , see the Mac OS X Assembler Guide (/Developer/ADC Reference Library/documentation/DeveloperTools/Reference/Assembler/index.html).