If you want to build a high-performance application that runs on multiprocessor machines, you must be aware of CPU cache lines. When a CPU reads a byte from memory, it does not just fetch the single byte; it fetches enough bytes to fill a cache line. Cache lines consist of 32 or 64 bytes (depending on the CPU) and are always aligned on 32-byte or 64-byte boundaries. Cache lines exist to improve performance. Usually, an application manipulates a set of adjacent bytes. If these bytes are in the cache, the CPU does not have to access the memory bus, which requires much more time.
However, cache lines make memory updates more difficult in a multiprocessor environment, as you can see in this example:
This scenario would be disastrous. Of course, chip designers are well aware of this problem and design their CPUs to handle this. Specifically, when a CPU changes bytes in a cache line, the other CPUs in the machine are made aware of this and their cache lines are invalidated. So in the scenario above, CPU2's cache is invalidated when CPU1 changes the value of the byte. In step 4, CPU1 has to flush its cache to RAM and CPU2 has to access memory again to refill its cache line. As you can see, the cache lines can help performance, but they can also be a detriment on multiprocessor machines.
What all of this means is that you should group your application's data together in cache line-size chunks and on cache-line boundaries. The goal is to make sure that different CPUs access different memory addresses separated by at least a cache line boundary. Also, you should separate your read-only data (or infrequently read data) from read-write data. And you should group together pieces of data that are accessed around the same time.
Here is an example of a poorly designed data structure:
|  struct CUSTINFO { DWORD dwCustomerID; // Mostly read-only int nBalanceDue; // Read-write char szName[100]; // Mostly read-only FILETIME ftLastOrderDate; // Read-write };  | 
Here is an improved version of this structure:
|  // Determine the cache line size for the host CPU. #ifdef _X86_ #define CACHE_ALIGN 32 #endif #ifdef _ALPHA_ #define CACHE_ALIGN 64 #endif #ifdef _IA64_ #define CACHE_ALIGN ?? #endif #define CACHE_PAD(Name, BytesSoFar) \ BYTE Name[CACHE_ALIGN - ((BytesSoFar) % CACHE_ALIGN)] struct CUSTINFO { DWORD dwCustomerID; // Mostly read-only char szName[100]; // Mostly read-only // Force the following members to be in a different cache line. CACHE_PAD(bPad1, sizeof(DWORD) + 100); int nBalanceDue; // Read-write FILETIME ftLastOrderDate; // Read-write // Force the following structure to be in a different cache line. CACHE_PAD(bPad2, sizeof(int) + sizeof(FILETIME)); };  | 
The CACHE_ALIGN macro defined above is good but not great. The problem is that you must manually enter each member variable's byte size in the macro. If you add, move, or remove a data member, you must also update the call to the CACHE_PAD macro. In the future, Microsoft's C/C++ compiler will support a new syntax that makes aligning data members easier. It will look something like _ _declspec(align(32)).
NOTE
It is best for data to be always accessed by a single thread (function parameters and local variables are the easiest way to ensure this) or for the data to be always accessed by a single CPU (using thread affinity). If you do either of these, you avoid cache line issues entirely.
