11.5 Multiprocessor Environments and Processor Affinity

Having multiple processors in your system can certainly be an advantage; having two people to do your work for you is certainly better than one. As far as scheduling processes is concerned, it does make for some interesting decisions as far as operating system design is concerned . When we have two processors available, which processor should a thread execute on? Ideally, a thread should execute on the processor it was executing on previously. In an ideal world, this is what will happen. Threads know where they executed last and would prefer to execute there next time around. Obviously, if that particular processor is busy, the operating system will have to make a decision as to what to do next. Do we let the thread block wait for the processor to become free, or do we move it to another processor? The answer is that unless instructed otherwise , the scheduler will choose the next best processor based on the overall load averages of all processors and how busy each processor is at the moment. Remember our discussion regarding scheduling threads? Every second, the kernel will recalculate the priority of every active thread. At the same time, it will recalculate the load averages for all processors in the system. This gives the kernel a good picture of how busy each processor is. When deciding on the best processor for a particular thread, these statistics are part of the decision-making process. The question is, "Can we influence that decision-making process by asking for certain processors to be considered special for a particular group of processes?"

11.5.1 cc-NUMA and other deviants

HP currently sells multiprocessor solutions where groups of processors are physically distant from other processors. For example, take a multi-node Scalable Computing Architecture (SCA) complex of four V-Class nodes, or even a Superdome complex consisting of 16 cells , each cell having four processors installed. Conceptually, we could have an architecture that looks something like the diagram in Figure 11-7.

Figure 11-7. Multiprocessor configuration.

graphics/11fig07.gif

This has a bearing on the decision of where to run a thread. The idea here is that we have configured our hardware in such a way that all the CPUs are part of the same partition/server configuration. If a thread was executing on a processor in node/cell(0) and the operating system decided to move a thread to node/cell(1), the operating system would have to move the data and instructions that were located in RAM on node/cell(0) across the high-speed bus/switch to node/cell(1) before it could start executing the thread again. The time it takes to move the data and instructions is a latency that we cannot ignore. If the high-speed bus has enough bandwidth to accommodate every node/cell conversing simultaneously , there may be a minimal latency involved. There are two issues here:

Is there a perceptible latency moving data from one node/cell to another?
If there is a latency involved, can we do anything about it?

The first question is due to the underlying architecture of the high-speed bus/switch and how each node/cell interfaces with it. Theoretically, the high-speed bus/switch needs to operate at an aggregate bandwidth equal to the bandwidth of the bus internal to each and every node/cell in the configuration. This would accommodate every node communicating over the bus all the time with no chance of experiencing a delay in communicating over the bus/switch. Having a bus/switch with such a bandwidth is seldom possible. This is not necessarily a minus point. It is also seldom that every node will need to communicate over the bus/switch simultaneously all the time. There will be a proportion of the time when a node/cell will be able to resolve data and instruction requests from node/cell-local memory. This raises the question of how the operating system views this hardware configuration. In other words, does the operating system have any concept of a locality domain ? If so, the operating system will understand the distance between individual processors and non-local memory. If so, the operating system will factor this latency into its equation regarding the next best processor to execute a thread. The latency in moving data and instructions between nodes/cells makes access to non-local memory non-uniform . When choosing the next processor for a thread to execute on, the operating system will choose:

The same processor the thread ran on previously
A processor in the same locality domain
Some other processor based on the load average of all other processors in the system

This idea is at the heart of a cc-NUMA (cache- coherent Non-Uniform Memory Access) hardware architecture. Here's our situation: We have HP hardware that is physically configured similar to Figure 11-7, but the operating system does not understand the distance between nodes/cells. Currently, HP-UX 11i version 1 has no concept of the cc-NUMA architecture features of a cell-based system. Subsequent releases of HP-UX (11i version 2) have started to introduce concepts such as cell-local memory whereby we as administrators can configure resources such that data/instructions will not move to other nodes/cells unless absolutely necessary. Having multiple locality domains has a distinct influence on the scheduling of processes/threads in a multiprocessor environment. In versions of HP-UX that support cc-NUMA features, having multiple locality domains introduces multiple scheduling allocation domains into the scheduling algorithm when choosing the next-best processor to execute a thread. Currently, HP-UX 11i version 1 views all CPU/RAM as a simple SMP configuration where access to any memory location takes the same time regardless of where it is physically located. As such, HP-UX 11i version 1 views all processors, memory, and associated devices as one locality domain and, hence, one scheduling allocation domain .

If we are running HP-UX 11i version 1 in such a configuration, is there anything we can do about inter-cell latency? Yes, there is. We can utilize one of two features to help locate processes/threads with a certain locality :

Utilize the mpctl() system call to ensure that a process/thread is executed within a certain processor or locality domain .
Utilize the concept of Processor Sets to set aside certain processors in their own scheduling allocation domain , to be used by an application.

11.5.2 The mpctl() system call and processor affinity

What we are trying to achieve here is to ensure that all threads for a given process are executed on a particular processor or location domain. This is known as processor of locality domain affinity . In HP-UX 11i version 1, we only have one big locality domain, so we would be restricted to locating a process/threads onto a particular processor. This can still be advantageous, because the data/instructions frequently used by the threads will already be loaded on the processor cache. This can significantly speed up processing time when we don't have to relocate the data/instructions for threads to another processor, especially if that involves communicating over a switch/bus with the inherent latencies; however, there are drawbacks with this scenario:

If we are tying threads to a particular processor, we may not achieve maximum parallelism by having separate threads running on separate processors. The only way to alleviate this is to ensure that the application was coded in such a way that each individual thread has the intelligence to control its own locality by making calls to the mpctl() system call itself.
This configuration does not stop other threads from executing on this processor. The scheduler can still choose this processor for other threads on the system. If chosen to execute a different thread on the current processor, we may have to refresh lines in the TLB and cache in order to run a different thread. When the original thread is allowed to return to this processor, we will need to reload the TLB and cache before allowing the thread to execute. It may have been more efficient to allow the thread to migrate to another processor.
If the application hasn't been explicitly coded to take advantage of the mpctl() system call, we will have to write our own interface to ensure that the application processes/threads execute under processor affinity controls. This may go against application support criteria or against internal support standards.
The comment from the man page for mpctl() is quite telling: " Much of the functionality of this capability is highly dependent on the underlying hardware. An application that uses this system call should not be expected to be portable across architectures or implementations . "
Anyone considering using this idea in conjunction with real-time priority processes must consult the man page for this system call. Compliance with POSIX real-time priority processing is not guaranteed if you force a process onto a particular processor.
On systems that support locality domains, we need to understand the launch policy for processes. The launch policy determines how processes are distributed around locality domains . The man page for mpctl() discusses this at length.

I can't address the first two points; you will need to test such a configuration in your own environment, under a typical workload. I can address the third point by showing you a simple program I wrote to exploit the mpctl() system call. On this example system, I have four processors and I am going to attempt to launch a process on a particular processor. In doing so, all the threads created by the process will inherit the processor scheduling requests of the parent, i.e., processor affinity requirements. Let's have a look at my test programs in operation (the source code for these programs is available in Appendix B):

 #  ./numCPU  Number of locality domains = 1 Number of processors = 4 # # #  ioscan -fnkC processor  Class       I  H/W Path  Driver    S/W State H/W Type  Description =================================================================== processor   0  2/10      processor CLAIMED   PROCESSOR Processor processor   1  2/11      processor CLAIMED   PROCESSOR Processor processor   2  2/12      processor CLAIMED   PROCESSOR Processor processor   3  2/13      processor CLAIMED   PROCESSOR Processor #

If I want to tie a process to a particular processor, I will need to use the processor Instance number along with the program name . In this example, I will use a compute-bound process and launch five copies of the program all on the same processor (processor 2 = 2/12).

 #  ./setCPU 2 ./bigcpu0  Number of locality domains = 1 Number of processors = 4 New processor for proc > 5233 < == 2 #  ./setCPU 2 ./bigcpu1  Number of locality domains = 1 Number of processors = 4 New processor for proc > 5235 < == 2 #  ./setCPU 2 ./bigcpu2  Number of locality domains = 1 Number of processors = 4 New processor for proc > 5237 < == 2 #  ./setCPU 2 ./bigcpu3  Number of locality domains = 1 Number of processors = 4 New processor for proc > 5239 < == 2 #  ./setCPU 2 ./bigcpu4  Number of locality domains = 1 Number of processors = 4 New processor for proc > 5242 < == 2 #

I will now look at the CPU Report screen in glance (from main screen press " a " to get to the CPU Report).

As you can see from Figure 11-8, CPU = 2 is 100 percent busy while other processors in the system are effectively standing idle. You can also see that the bar graph is showing overall system CPU utilization at 25 percent. This makes sense because only one-fourth of the overall processor capacity is being utilized. The fact that one processor is doing all the work is a different discussion altogether. Is this a good thing? It depends . For some applications, being tied to one processor will improve performance, because individual processes/threads are constantly using the instructions/data already loaded in a particular processor's cache. For other applications, individual threads will need to load their own instructions and data anyway. In those situations, tying threads to a particular processor may have no performance improvement; in fact, such a configuration may be detrimental to applications and overall system performance. It all depends on the individual application. A benefit of using this solution is that you can fine-tune the use of the mpctl() calls based on your particular configuration, especially if you have more than one locality domain.

Figure 11-8. Processor affinity.

One thing to note is that only the root user is allowed to use the mpctl() system call. If we want other users to use it, we must give their group the MPCTL privilege using the setprivgrp command.

11.5.3 Processor Sets

An alternative to having to code a self-built solution using the mpctl() system call is to use a piece of software that is free to download from http://software.hp.com. The piece of software is known as Processor Sets. The idea with Processor Sets is to create a scheduling allocation domain from the processors currently on your system. This effectively means that you create a subset of processors that is accessible only to specific users/applications. When we launch an application, we can launch the application within a specific Processor Set. The resulting threads will be allowed to execute on any processor within the Processor Set. All other threads will be limited to the default Processor Set (Processor Set 0), which must contain at least one processor and which is used by all processes/threads not assigned to a specific Processor Set.

Installing the software requires a reboot. Once installed, we need to set up the appropriate processor sets for the number of applications/user groups on the system. Here is the default Processor Set that gets created (and cannot be removed) once the software is installed:

 #  psrset  PSET        0 SPU_LIST    0    1    2    3 OWNID       0 GRPID       0 PERM        755 IOINTR      ALLOW NONEMPTY    DFLTPSET EMPTY       FAIL LASTSPU     DFLTPSET #

For the default Processor Set, I can create subsequent Processor Sets containing specific processors. If I am trying to keep data within a particular cell/node, I will need to understand the underlying hardware architecture of how these processors are physically interconnected . I will assume that we have already established this and decided to create two additional processor sets, one with two processors (processors 1 and 2) and one with a single processor (processor 3), leaving one processor for any other processes/threads that are not executed within a specific processor set:

 #  psrset -c 1 2  successfully created pset 1 successfully assigned processor 1 to pset 1 successfully assigned processor 2 to pset 1 # #  psrset -c 3  successfully created pset 2 successfully assigned processor 3 to pset 2 #  psrset  PSET        0 SPU_LIST    0 OWNID       0 GRPID       0 PERM        755 IOINTR      ALLOW NONEMPTY    DFLTPSET EMPTY       FAIL LASTSPU     DFLTPSET PSET        1 SPU_LIST    1    2 OWNID       0 GRPID       3 PERM        755 IOINTR      ALLOW NONEMPTY    DFLTPSET EMPTY       FAIL LASTSPU     DFLTPSET PSET        2 SPU_LIST    3 OWNID       0 GRPID       3 PERM        755 IOINTR      ALLOW NONEMPTY    DFLTPSET EMPTY       FAIL LASTSPU     DFLTPSET #

As you can see, I don't assign names to Processor Sets; the psrset command will simply create a Processor Set and assign it the next available number. I can now launch applications within a Processor Set by using the psrset command:

 #  psrset -e 1 ./bigcpu0 &  [1]     3355 #  psrset -e 1 ./bigcpu1 &  [2]     3357 #  psrset -e 1 ./bigcpu2 &  [3]     3359 #  psrset -e 1 ./bigcpu3 &  [4]     3362 # #  ps -zZ 1  PSET    PID TTY       TIME COMMAND        1   3363 pts/ta    1:26 bigcpu3        1   3360 pts/ta    1:31 bigcpu2        1   3356 pts/ta    1:49 bigcpu0        1   3358 pts/ta    1:46 bigcpu1 #

If we take a look at the CPU report in glance again (Figure 11-9), we should see that the processors in this Processor Set are somewhat busy.

Figure 11-9. Processor Sets in `glance` .

From Figure 11-9, we can see that as predicted processors 1 and 2 are now busy running my application. Any process/thread not assigned to a specific Processor Set will execute on the processor(s) in the default Processor Set. We can bind existing processes/threads to a Processor Set by using the psrset “b <pset_id> <pid> . The man page for psrset is relatively straightforward to follow. There are a couple of points to note regarding Processor Sets:

Only the root user can assign processes to specific Processor Sets. We can change permissions and ownerships of Processor Sets using the psrset command to allow other users to manage a particular Processor Set. We can also assign the privilege PSET to a group of users using the setprivgrp command:
```
 
```
```
 #  setprivgrp  users PSET  #  getprivgrp  users  users: PSET # 
```
The current Processor Set definitions do not survive a reboot. There are no startup files in order to establish a permanent configuration. You will have to write your own startup script if you want to reestablish the configuration after every reboot.
You can still use the mpctl() system call within Processor Sets, but you need to be careful regarding any assumptions you make about the current number of processors in a Processor Set because we can add to and remove processors from a Processor Set online.
You can still use real-time priorities, but they will apply only to processors in the Processor Set of a specific process/thread.

We can also create and manage Processor Sets from within a piece of software known as Process Resource Manager, which we look at later.

11.5.4 Concurrency in multiprocessor environments

When we have more than one processor in a system, the operating system has to provide a mechanism whereby many threads and many processes can execute code concurrently on any processor while protecting global data structures used by the operating system itself. The issue here is the control of access to these global data structures. We cannot allow one processor to access a global data structure while another processor is updating it. The mechanism employed by the operating system involves locking global structures while updates to those structures are performed. The choice of locking strategy used is a performance issue for the operating system developers themselves . We as administrators can do nothing about this except hope that the kernel developers have made the right choice. The way in which the operating system locks a data structure can have a major influence on overall system performance. You may have heard of some of these locking strategies. The two principle strategies are Spinlocks and Semaphore . Both locking strategies surround critical pieces of code that update global data structures. The main difference is in what happens to the current thread when trying to acquire a particular lock.

If a processor attempts to obtain a spinlock that is held by another processor, it will go into a busy wait state until the spinlock is released. This is commonly referred to as " spinning on a spinlock ". While a processor has hold of a spinlock, interrupts are disabled and the currently executing thread is not allowed to go to sleep until the lock is released. Consequently, spinlocks are used to protect regions of code where an update to a critical data structure is not expected to take much time. The operating system uses spinlocks to protect various resources such as filesystems, memory, networking, and other data structures in the operating system.
Semaphores control access to global data structures by using a blocking strategy. This is a subtle difference over a spinlock, but it is easily understood by the fact that when a processor attempts to acquire a semaphore already held by another processor, it will simply put its current thread to sleep and move its context switch to another process. These semaphores can be thought of as similar in concept to the semaphores used by processes to synchronize activities between themselves.

As I mentioned above, there is nothing we can really do to influence the use of either spinlocks or semaphores ; it was the choice of the operating system designers. In a multiprocessor system, the decision of whether to use spinlocks or blocking semaphores is a performance issue based on the expected time of the busy wait state versus the overhead of a process context switch. The situations where we may get involved with these levels of locking strategies is where a section of operating system code takes longer than expected to complete while holding a spinlock , for example. This may cause unexpected delays in other areas of code causing them unexpected problems. In order to resolve such a situation, HP will need to rewrite the offending code, which we will install via a patch.