11.4 ProcessThread Priorities and Run Queues | HP-UX CSE(c) Official Study Guide and Desk Reference

11.4 Process/Thread Priorities and Run Queues

Process and hence thread priority is a numerical value that has a theoretical limit from (“512) +255. The smaller or more negative the priority, the more likely the associated thread will regain control of the CPU. These numbers are reported by the ps “l command. You might have seen a process called ttisr with a priority of “32. This is very important priority. When this process wants to run, it's going to run. This is known as a POSIX real-time priority process . Your current priority has a direct influence on how your priority decays over time. Some priorities will never decay; they will remain the same all the time . Different ranges of priority mean different things to the kernel. In Figure 11-3, I have listed the various priority ranges and a brief description.

POSIX real-time priorities : Priority range (“512) (“1). These priorities are the strongest priorities on the system. A thread with a priority in this range will preempt all other threads in the system. Commonly, we say that the kernel uses the POSIX real-time scheduler to schedule threads with a priority in this range. The idea of using a different scheduler is just to distinguish between the different scheduling policies used by the kernel. We use the term scheduler frequently in our discussions, but please note that we are really referring to just a different scheduling policy . The kernel parameter rtsched_numpri limits the default range of priorities in use to 32. The priorities in this range do not decay; they stay the same for the life of the process (unless you change them). Signals can be sent to these processes.
HP-UX real-time priorities : Priority range 0 +127. HP introduced these priorities some years ago. A thread with a priority in this range will preempt every lower priority thread in the system. The priorities in this range do not decay; they stay the same for the life of the process. Threads in this priority range are said to use the HP-UX RTPRIO scheduling policy . Threads of the same priority will context switch over a timeslice . Signals can be sent to these processes.
Timeshare priorities : Priority range +128 +255. This entire priority range indicates that the priority will decay over the lifetime of the thread/process. This entire priority range is divided between system and user timeshare priorities:

- System timeshare priorities : Priority range: +128 +177. This priority range is used for processes that are normally asleep, i.e., processes that have performed a system call and are waiting for IO. What the process is waiting for will determine whether we can send signals to the process.

- System timeshare priorities ( non-interruptible ) : If the kernel has assigned a priority in the range +128 +153, the kernel is expecting the wait-event to return soon, e.g., waiting for an IO to complete. Such a process cannot be killed . The kernel will not send signals to such a process until the wait-event has completed. You may have noticed that you cant kill a process that is reading from tape, not even with a kill “9 . Such a process is blocked on an event that the kernel assumes will complete soon, e.g., an IO completing. If the tape drive is broken, there is a chance that you will never be able to kill this process without a reboot. Your only chance to get rid of it is to make it a real-time process and then try to kill it. This might work, but it's not guaranteed . Certain wait-events are given specific priorities; see /usr/include/sys/param.h for more details.

- System timeshare priorities (interruptible) : Still within the range +128 +177, except that we are looking at priorities +154 +177 specifically . These priorities are again for processes that are asleep most of the time. The difference here is that the kernel is not expecting the wait-event to return soon. The classic process in this category is your shell , which spends most of its time asleep at priority 154. This puts the shell into this category. However, unlike our process waiting for IO that we couldnt kill, we can always kill processes in this priority range.
User timeshare priorities : Priority range +178 +255. When we issue a simple command from a shell prompt, the process and subsequent threads will be assigned an initial priority of 178. Over time, the priority of the threads will decay in a linear fashion and regain in an exponential fashion, indicating that the scheduler gives preference to process that are currently not running. The nice value assigned to a process is used in recalculating the priority of a thread. More on nice values later.

Figure 11-3. Priority ranges.

When we issue a command, it is assigned a user timeshare priority unless we instruct the shell to invoke a different scheduler. We can use a different scheduler by using the rtsched or rtprio commands.

11.4.1 Scheduling policies and run queues

A scheduling policy will affect how the kernel treats priorities within a specified range. A task such as deciding which thread to choose when two threads have the same priority is an aspect of a scheduling policy. Different process and hence different threads can employ different scheduling policies, depending on their processing requirements.

A run queue is a linked-list of thread structures ordered by their priority. Threads are added to a run queue when they become runnable . When searching for the next eligible thread to assign to a processor, the kernel will search a run queue in order to find the most eligible thread. Being organized in priority order makes finding the most eligible thread a trivial task; the thread at the head of the list is probably the most eligible thread to regain control of the CPU. As we will see, the kernel maintains system-wide and per-processor run queues.

POSIX real-time scheduling policy : The POSIX real-time scheduling policy involves the highest priority threads on the entire system. Threads with a POSIX real-time priority can preempt any other thread on the system (of an equal or lower priority). The highest priority processes/threads have a POSIX real-time priority. To run a command using the POSIX real-time scheduler, we use the rtsched command. Because we have such high-priority threads, there are three scheduling policies associated with POSIX real-time priorities. Each scheduling policy will influence the selection criteria when we have two threads of the same priority. The three POSIX real-time scheduling policies we have are:

- SCHED_FIFO : The run queue for threads under this scheduling policy will be ordered on priority and then on the time the thread has been in the list without executing. Generally, the thread at the top of list has been in the list the longest time, and the thread at the tail of the list has been in the list the shortest time. The thread at the top of the list will be selected as the next thread to gain control of the CPU.

- SCHED_RR : This policy is similar to the SCHED_FIFO except that if there are threads of the same SCHED_RR priority, the scheduling policy will introduce a Round-Robin Interval, which is applied to all threads while they are running. The idea here is to avoid having one of the threads monopolize the CPU.

- SCHED_RR2 : This scheduling policy is defined but is not fully implemented in HP-UX and is equivalent to SCHED_RR.

NOTE : If you are going to use a POSIX real-time scheduling policy, it's probably best to choose only one of the above scheduling policies to ensure that all POSIX real-time priority processes are scheduled in the same manner.

The priority we specify with the rtsched command is not a direct representation of what appears in the output of ps . Internally, the kernel uses positive integers to represent priorities, offset from the outside world by 512. Don't ask; that's just the way it is. There's much number juggling to convert one to the other. The priority we specify to rtsched is a number in the range 0 - ( rtsched_numpri “1). rtsched_numpri is a configurable kernel parameter that defines the number of POSIX real-time priorities allowed in the system as a whole. By default, rtsched_numpri is set to 32, which means that the priority we specify with the rtsched command is a number in the range 0 31, with 31 being the strongest priority. Even though we have specified a positive number for the priority, ps will see this differently (due to the way the kernel sees priorities). To us as administrators using the ps command, this means that to evaluate a ps priority from the priority we pass to rtsched , we do a little math:

 ps priority = 1  rtsched priority ps priority = -1  31 ps priority = -32

We can now apply this priority to existing processes or when we launch a new process.

IMPORTANT

If you make a new or existing process a POSIX real-time process, there is the possibility that it will preempt all other threads on the system forever. If the process is a CPU-bound process, as soon as it is context switched off the CPU after a timeslice, it will immediately return to the CPU unless there is another POSIX real-time process of the same or higher priority waiting in the run queue. The reason for this is that POSIX real-time priorities never decay. If you are going to test processes under this scheduling policy, it is strongly advised you have a POSIX real-time shell available that can preempt the other POSIX real-time processes on the system. It is strongly advised not to test this on a machine running Serviceguard because the cluster monitoring daemon cmcld runs at an HP-UX real-time priority (=20). This can never preempt a compute-bound POSIX real-time priority on a single processor machine. The result will be a cluster reformation and this machine performing a TOC!

Here's an example. We first make our existing shell a POSIX real-time process:

 root@hpeos003[]  rtsched -s SCHED_FIFO -p 31 -P $$  root@hpeos003[]  ps -lp $$  F S  UID   PID  PPID  C   PRI   NI     ADDR   SZ      WCHAN TTY       TIME COMD   1 R    0  6199  6197  1   -32   20 43cf1800  114          - pts/4     0:01 sh root@hpeos003[]

Now we can make an existing process a POSIX real-time process. If I want to run a program starting with a POSIX real-time priority, I simply replace the “P <pid> on the above rtsched command line with my command and arguments. We will take our nfstcpd daemon that we saw earlier. Currently, it is a timesharing process:

 root@hpeos003[]  ps -lp 4456  F S  UID   PID  PPID  C   PRI   NI     ADDR   SZ      WCHAN TTY   TIME COMD 1003 R   0  4456     0  0   178   20 43cfb200    0          - ?   0:33 nfsktcpd root@hpeos003[]

We could now test this application under the POSIX real-time scheduler.

 root@hpeos003[]  rtsched -s SCHED_FIFO -p 1 -P 4456  root@hpeos003[]  ps -lp 4456  F S  UID   PID  PPID  C   PRI   NI     ADDR   SZ      WCHAN TTY   TIME COMD 1003 R   0  4456     0  0   -2   20  43cfb200   0          -    ? 0:33 nfsktcpd root@hpeos003[]

Let's have a quick look at gpm to see what effect this has had on the threads for this process (Figure 11-4):

Figure 11-4. View a POSIX Real-Time process in gpm.

First, notice that the priority has now been set to “2 as predicted by our calculations in working out POSIX real-time priorities. Also note that gpm is listing the scheduling policy that we are now using under the Scheduler column. The scheduling policy used by these threads is listed in the /usr/include/sys/sched.h file:

 SCHED_INVALID = -1,   SCHED_FIFO =     0,     /* Strict First-In/First-Out policy */   SCHED_RR =       1,     /* FIFO, with a Round-Robin interval */ SCHED_HPUX =     2,     /* the default HP-UX scheduling policy */ SCHED_RR2 =      5,     /* RR, with a per-priority RR interval */ SCHED_RTPRIO =   6,     /* HP-UX rtprio(2) realtime scheduling */ SCHED_NOCHANGE = 7,     /* used internally */ SCHED_NOAGE =    8      /* HPUX without decay from usage */

As we can see, I am using SCHED_FIFO as specified on the rtsched command line. I can use rtsched to specify any of the schedulers for a given process/thread. Once I am finished testing, I could return my nfsktcpd process to be a normal timesharing process:

 root@hpeos003[]  rtsched -s SCHED_HPUX -P 4456  root@hpeos003[]  ps -lp 4456  F S  UID   PID  PPID  C   PRI   NI     ADDR   SZ  WCHAN TTY       TIME COMD 1003 R   0  4456     0  0   152   20 43cfb200    0      - ?       0:36 nfsktcpd root@hpeos003[]

We could check again with gpm that the thread priorities return to a normal level. The process priority we can see with ps is 152. This is defined as a non-interruptible priority because it is in the range 128 153. There is a variable defined in param.h called PTIMESHARE, which effectively =128 when viewed from a ps -priority perspective. It effectively tell the kernel the priority where timesharing priorities begin. If we look in param.h we can see how priority 152 is defined:

 #define PRIUBA  (24+PTIMESHARE) #define PLLIO   (24+PTIMESHARE)

These definitions are also used by commands like glance and top , to try to decipher what a process is blocked on, e.g., PLLIO means Process Low-Level IO, which would translate to " Blocked on IO ".

NOTE : It should be noted that the behavior of the nfsktcpd when we made the process a POSIX real-time priority process might not reflect the behavior of every process in your application. I will quote directly from the manual on the rtsched() system call:

"If the process pid contains more than one thread or lightweight process (that is, the process is multithreaded), this function shall only change the process scheduling policy and priority. Individual threads or lightweight processes in the target process shall not have their scheduling policies and priorities modified. Note that if the target process is multithreaded, this process scheduling policy and priority change will only affect a child process that is created later and inherits its parent's scheduling policy and priority. The priority returned is the old priority of the target process, though individual threads or lightweight processes may have a different value if some other interface is used to change an individual thread or lightweight processes priority."

To boil this down to simple English, it is up to the application to manage what happens when the process scheduling policy changes. As we saw with our example when we changed the process scheduling policy, this was reflected in all the threads. As we can see from the rtsched() man page, the operating system will not necessarily do this; it's up to the application. Inevitably, if you are going to change the scheduling policy of an existing process, you need to bear this in mind and possibly check the state of individual threads (possibly using gpm , outlined above). If we are to use the rtsched command to launch our application, this is not an issue because we can see from the man page that child processes and threads will inherit their scheduling policy from the parent process.

Run queues for POSIX real-time priorities : The system maintains a single system-wide run queue to house all threads with a POSIX real-time priority process. This means that whenever a processor is freed up and is in search of a new process to run, it will first search the global run queue for any POSIX real-time threads to run.

Access to the rtsched command : By default, only the root user is able to use the rtsched command. If we want to allow other users to use it (which could be dangerous), we must give them the RTSCHED privilege using the setprivgrp command.

HP-UX real-time priorities : HP-UX real-time priorities are not as strong as POSIX real-time priorities but are stronger than timesharing priorities. Like POSIX real-time priorities, they will not decay, staying at the same priority for the life of the process. Again, we need to be sure that we have access to a real-time shell of the same priority or greater than the processes we will test . The valid priorities for HP-UX real-time threads is 0 +127 with 0 being the strongest priority. We can use the rtsched command or the rtprio command, which only deals with HP-UX real-time priorities. Here, I will run a program under the HPUX real-time scheduling policy :

 root@hpeos003[]  rtprio 120 /usr/local/bin/bigcpu &  [1]     4448 root@hpeos003[] root@hpeos003[]  ps -lp 4448  F S UID   PID  PPID  C   PRI   NI       ADDR   SZ    WCHAN TTY      TIME COMD   1 R   0  4448  4361 255   120   24  43d56b80   10        - pts/0    0:15 bigcpu root@hpeos003[]

This is a very compute-bound process. Since the time I started it, just over 5 minutes ago, it has already accumulated 5 minutes 29 seconds of CPU time:

 root@hpeos003[]  ps -fp 4448  UID   PID  PPID  C    STIME TTY   TIME   COMMAND     root  4448  4361 255 12:48:56 pts/0   5:29   /usr/local/bin/bigcpu root@hpeos003[]

IMPORTANT

The only reason I am still getting a response out of this system is that I have a POSIX real-time priority process running on the system console. Be careful if running such a test under an X- Windows session, because the X process commonly gets blocked on priority, freezing all your windows .

Back to bigcpu . Some people will ask me, " Is lots of CPU time a good thing or a bad thing ?" The answer, obviously, is it depends . If we look at the STIME (Start TIME) for the process, we can work out how much time a process has accumulated relative to when it was started. In this case, this is lots of CPU time. Is that a good thing or a bad thing? Yep, you got it right, it depends . If this is the only process on the system, then why not give it all the CPU all the time? Giving a process an HP-UX or POSIX real-time priority is one sure-fire way of ensuring that a process gets a good opportunity at accumulating CPU time. Let's use the rtprio command to put this process back to being a timesharing process:

 root@hpeos003[]  rtprio -t -4448  root@hpeos003[]  ps -lp 4448  F S  UID   PID  PPID  C   PRI   NI     ADDR   SZ   WCHAN TTY       TIME COMD   1 R    0  4448  4361 241   246   24 43d56b80   10       - pts/0    14:37 bigcpu root@hpeos003[]

Now that it is back to being a timesharing process, the nice value ( NI in the output from ps ) will be used to calculate the priority of this process. More on nice values later.

Run queues for HP-UX real-time priorities : Every processor maintains a list of 128 individual run queues for each of the priorities in this range (range = 0 127). This means that once a processor has consulted the system-wide POSIX real-time priority queue, it will then search through its own HP-UX real-time priority queue for the most eligible threads to run. Threads can be added to the list by the kernel when a thread becomes runnable. As with POSIX real-time priorities, the kernel will try to load balance multiple HP-UX real-time priority threads across all processors in the system.

Access to the rptrio command : By default, only the root user is able to use the rtprio command. If we want to allow other users to use it (which could be dangerous), we must give them the RTPRIO privilege using the setprivgrp command.
Timeshare priorities : Timesharing priorities are designed to share time evenly among all runnable processes within this priority range. We discussed the use of priorities in this range used by the operating system when processes are asleep, affectionately known as high-priority sleepers (priority +128 +153; non-interruptible) and low-priority sleepers (priority +154 +177; interruptible). Commands we issue from the shell and background processes are given a priority in the range +178 +255. A threads initial priority is such that it will probably be a good candidate to be next on the CPU ( assuming that no higher priority threads want to run). A thread's initial priority will start at +178. Over time, the priority will decay linearly as the thread is executing. When it is not executing, the priority will grow exponentially in such a way that the thread will regain priority quicker than it loses priority. In this way, a thread has a good chance of regaining the CPU when it currently doesn't have enough priority. At the same time, executing threads will be more willing to relinquish the CPU to other threads that want to run (are runnable ). This is the basis of the HP-UX Timeshare scheduling policy. The assumptions for this policy to operate as described are as follows :

- We have a number of runnable threads all computing for the same CPU.

- All runnable threads have all the resources they need to continue executing, i.e., they are not going to suddenly perform some IO that will put them to sleep.

- All runnable threads are using the same nice value (obtained from the parent process).

This last point is crucial in understanding the way the HP-UX Timeshare scheduling policy works. As we have described previously, every 40 milliseconds the priority of a thread is recalculated. One of the variables in that calculation is the nice value. This mystical calculation has a number of variables including the current priority, the current accumulated CPU time, and the nice value. The resulting priority will be used to determine whether the thread is to be given the CPU next. The nice value is an integer somewhere in the range 0 39. The lower the nice value, the more aggressively the thread will regain priority in relation to other Timesharing threads; in other words, a low nice value equates to an important process that we want to see regain priority quicker. Conversely, a process with a high nice value will regain priority at a slower rate relative to other Timesharing threads. Figure 11-5 tries to depict the way that a more important process (with a lower nice value) will lose and regain priority in relation to a less important process (with higher nice value).

Figure 11-5. Priority changes with different nice values.

graphics/11fig05.jpg

This means that we can affect a thread's priority by changing the nice value of the process. We don't set the priority of a process/thread with the nice value; we only influence it. An important point to remember is that nice values take effect only for Timesharing priorities; POSIX and HP-UX real-time priorities are not influenced by nice values at all . Processes are given a default nice value of 20 (24 for background processes). All other things being equal, all processes should receive the same amount of CPU time. We can influence this by changing the process nice value to be nicer or nastier by using the either the nice command before we run a command, or the renice command to change the nice value for existing processes (Figure 11-6).

Figure 11-6. Nice and renice.

graphics/11fig06.gif

The behavior of the renice command is somewhat perplexing. It depends on whether you have the UNIX95 environment variable set. In Figure 11-6, we can see the effect of the renice command when the variable is and isn't set. Let me explain. We will start by running our process in the background.

 root@hpeos003[]  myprog &  [1]     6800 root@hpeos003[]  ps -lp 6800  F S  UID   PID  PPID  C  PRI   NI   ADDR      SZ  WCHAN TTY     TIME  CMD   1 R  0    6800  5335 255 249   24   43cfbac0   10   -    pts/3   00:15 myprog root@hpeos003[]

We can see that background process are given an initial nice value of 24 as they are background processes and are less important than foreground processes. Our idea is to make this process a nicer process, i.e., acquire less CPU than other processes. The higher the nice value the nicer the process. We then use the renice command to change the nice value to its intended value of 30. In order to do this we use an offset from the default value (=20). To get my nice value to 30 means my offset is 10: 20 + 10 = 30. Sounds weird but that's the default behavior of renice ; it uses an offset from the default (=20). Let's have a look at the output:

 root@hpeos003[]  renice -n 10 -p 6800  6800: old priority 4, new priority 10 root@hpeos003[]  ps -lp 6800  F S  UID   PID  PPID  C  PRI   NI   ADDR      SZ  WCHAN TTY     TIME  CMD   1 R  0    6800  5335 255 217   30   43cfbac0   10   -    pts/3   05:34 myprog root@hpeos003[]

We can further demonstrate this when we want this process to acquire more CPU relative to other processes. We want to set the nice value to be 8 (rather nasty ). Again we need to think of our offset from the default (=20). Our intended nice value is 8, hence, my offset will be “12: 20 “12 = 8. Only the root user can set negative nice values by default. Once again let's have a look:

 root@hpeos003[]  renice -n -12 -p 6800  6800: old priority 10, new priority -12 root@hpeos003[]  ps -lp 6800  F S  UID   PID  PPID  C  PRI   NI   ADDR      SZ  WCHAN TTY     TIME  CMD   1 R  0    6800  5335 255 225   8   43cfbac0    10   -    pts/3   14:57 myprog root@hpeos003[]

Then we throw the UNIX95 environment variable into the mix. With this variable set, the offset is always relative to your current nice value . If we want to make our process nicer, we need to increase its nice value. Our intended nice value is 18. Our current nice value is 8, so we need to use an offset of 10:

 root@hpeos003[]  export UNIX95=1  root@hpeos003[]  renice -n 10 -p 6800  root@hpeos003[]  ps -lp 6800  F S  UID   PID  PPID  C  PRI   NI   ADDR      SZ  WCHAN TTY     TIME  CMD   1 R  0    6800  5335 255 249   18   43cfbac0   10   -    pts/3 01:02:48 myprog root@hpeos003[]

It's a subtle difference in behavior, but worth noting. You can usually spot the difference, because with the UNIX95 environment variable set, we don't get any output from the renice command.

Exception to Timesharing priorities : There is an exception to Timesharing priorities. The exception comes into force when you use a scheduling policy of SCHED_NOAGE . The effect this has is to set a process's (and, hence, a thread's) priority at a preset value within the range +178 +255. Even though the priority is within the Timesharing priority range, the nice value will have no effect and the priority of the process (and threads) will not age. This is because we have explicitly used the SCHED_NOAGE scheduling policy. The reason for having this scheduling policy is to fix a priority at a reasonably high priority without having to resort to a real-time priority. This avoids the potential problem of a compute- intensive real-time priority thread preempting the operating system processes and possibly causing the system to hang. Having a fixed but lower priority means that operating system threads will be able to preempt this thread. To make use of the SCHED_NOAGE scheduling policy, we use the rtsched command. In this example, I will set the priority of my bigcpu process to be +178, the highest priority a Timesharing thread can obtain.
```
 
```
```
 root@hpeos003[]  ps -lp 4448  F S  UID   PID  PPID  C   PRI   NI      ADDR   SZ  WCHAN TTY    TIME COMD   1 R    0  4448  4361 255   249   24  43d56b80   10     - pts/0    124:55 bigcpu root@hpeos003[]  rtsched -s SCHED_NOAGE -p 178 -P 4448  root@hpeos003[]  ps -lp 4448  F S  UID   PID  PPID  C   PRI   NI      ADDR   SZ  WCHAN TTY    TIME COMD   1 R    0  4448  4361 229   178   24  43d56b80   10     - pts/0    125:18 bigcpu root@hpeos003[] 
```
Even if I look some time from now, the priority will stay the same. With this specific scheduling policy, the nice value has no effect; the priority remains constant.
Run Queues for Timesharing priorities : Each processor maintains its own run queues for Timesharing priorities. Unlike the real-time priority run queues, which identified each individual priority with an individual slot in the run queue, the kernel groups together four consecutive priorities into a single run queue entry. This means that for the 128 possible priorities in the range +128 +255, the kernel searches only 32 individual slots in the run queues for any runnable threads. When the kernel is searching for an eligible runnable thread, it now has to search the system-wide POSIX real-time run queues, then the per-processor HP-UX real-time run queues, and then the per-processor Timesharing run queues before finding an eligible runnable thread.

HISTORICAL NOTE : The reason that Timeshare priority run queues are four priorities wide goes back to the dim and distant days when DEC VAX-11 machines were used to help develop the original UNIX kernels . The DEC VAX-11 had instructions to manipulate 32-bit fields. With 128 Timesharing priorities, 32 bits = 4 priorities per bit.