8.3 Architectural Issues | Mac OS X Panther for Unix Geeks

There are a few architectural issues to be aware of when developing or porting software on Mac OS X. In particular, pointer size, endianness, and inline assembly code tend to be the most common issues developers run in to.

On a 32-bit system, such as Mac OS X running on the G3 or G4, C pointers are 32 bits (4 bytes). On a 64-bit system, they are 64 bits (8 bytes). As long as your code does not rely on any assumptions about pointer size, it should be 64-bit clean. For example, on a 32-bit system, the following program prints "4", and on a 64-bit system, it prints "8":

 #include <stdio.h> int main( ) {   printf("%d\n", sizeof(void *));   return 0; }

Some 64-bit operating systems, such as Solaris 8 on Ultra hardware (sun4u), have a 64-bit kernel space, but support both 32- and 64-bit mode applications, depending on how they are compiled. On a G5 system, the pointer size is 64-bits. Other data types are mapped onto the 64-bit data type. For example, single precision floats, which are 32-bit, are converted to double precision floats when they are loaded into registers. In the registers, single precision instructions operate on these single precision floats stored as doubles performing the required operations on the data. The results, however, are rounded to single precision 32-bit. Quad precision floating point numbers , defined by the IEEE as 128-bit are not directly supported on current PowerPC hardware. Apple has provided at least two technical notes containing information and advice on optimizing code to take advantage of the G5 architecture:

TN2086: Tuning for G5: A Practical Guide http://developer.apple.com/technotes/tn/tn2086.html
TN2087: PowerPC G5 Performance Primer http://developer.apple.com/technotes/tn/tn2087.html

Additional information can be found at http://developer.apple.com/hardware/ve/g5.html. These documents describe in detail the issues involved in tuning code for the G5. We note only a few issues here.

The architecture of the G5 allows for much greater performance relative to the G4. This performance potential is partly due to the fact that the G5 allows 200 instructions in core , compared to only 30 for the G4. Moreover, the G5 has 16 pipeline stages, 2 load/store units, and 2 floating points units, compared to 7 pipeline stages, 1 load/store unit, and 1 floating points unit on the G4. The L1 cacheline size is also 128 bytes on the G5, compared to 32 bytes on the G4. Additionally the processor and memory bandwidth is much greater on the G5, relative to the G4. The technical notes mentioned earlier in this section have additional information on hardware differences.

One important implication of the greater number of pipeline stages on the G5 relative to the G4 is that instruction latencies are greater on the G5. You can often gain significant improvements in performance by using performance tools to identify loops that account for a large percentage of computation time. Once identified, you can either manually unroll these loops , or use the - funroll-loops compiler flag. The compiler flag - mtune-970 can also be useful in this situation, as it schedules code more efficiently for the G5. The - fast compiler flag sets these options (among others) automatically.

To better take advantage of the longer cacheline size in L1 cache on the G5, algorithms should be designed for greater data locality, and use contiguous memory accesses when possible. For example, arrays in C store entries row-wise. To ensure contiguous memory accesses, design your code so that it accesses array elements row-by-row. The G5 has four hardware prefetchers, which (if accesses to memory are contiguous) are triggered automatically to help reduce cache misses. Performance tools, such as the CHUD suite (see Chapter 9), can help you optimize code by profiling computation and memory usage; some of them even make suggestions on how to improve performance.

CPU architectures are designed to treat the bytes of words in memory as being arranged in big- or little-endian order. Big-endian ordering has the most significant byte in the lowest address, while little endian has the most significant byte at the highest byte address.

The PowerPC is biendian , meaning that the CPU is instructed at boot time to order memory as either big or little endian. In practice, biendian CPUs run exclusively as big or little endian. In general, Intel architectures are little endian, while most, but not all, Unix/RISC machines are big endian. Table 8-4 summarizes the endianness of various CPU architectures and operating systems. As shown in Table 8-4, Mac OS X is big endian .

Table 8-4. Endianness of some operating systems

CPU type	Operating system	Endianness
Dec Alpha	Digital Unix	little endian
Dec Alpha	VMS	little endian
Hewlett Packard PA-RISC	HP-UX	big endian
IBM RS/6000	AIX	big endian
Intel x86	Windows	little endian
Intel x86	Linux	little endian
Intel x86	Solaris x86	little endian
Motorola PowerPC	Mac OS X	big endian
Motorola PowerPC	Linux	big endian
SGI R4000 and up	IRIX	big endian
Sun SPARC	Solaris	big endian

As far as inline assembly code is concerned ”if you have any ”it will have to be rewritten. Heaven help you if you have to port a whole Just-In-Time (JIT) compiler! For information on the assembler and PowerPC machine language, see the Mac OS X Assembler Guide ( /Developer/Documentation/DeveloperTools/Reference/Assembler/index.html ).