The System Call Interface

Each process thread consists of a bound user/kernel thread pair (unless a user-space scheduler is used). This thread is executing instructions from either the process's user memory space or from the privileged memory space of a legitimate kernel system call at any one instance of time. In addition to the many process-based threads, at times the system is required to halt the execution of the currently scheduled thread and switch to a specific kernel-only thread for the performance of interrupt handling, scheduling, or a host of other operating system-related tasks.

The transition between these various modes of operation is the study of the system call interface and the interrupt system (see Figure 4-4). We discussed the occurrence and handling of internal system-level interruptions in Chapter 1, "HP PA-RISC Architecture," and we explore external interrupts in detail in Chapter 10, "I/O and Device Management."

Figure 4-4. User-to-Kernel Thread Transitions

graphics/04fig04.gif

For now, consider that there are only two means by which a processor's current execution mode may be changed from user mode to kernel mode. The first is as the result of some type of system interruption that results in an immediate switch into the kernel context. The second is the direct result of the current user thread making a request to use one of the many provided kernel routines known as system calls, syscall().

The return to the most deserving user thread is controlled by the same kernel-level mechanism, syscall(), regardless of which event caused the switch to kernel mode. This kernel routine is responsible for the delivery and notification of process-to-process signals and the handling of user-level context-switching, returning from a user-defined signal-handling routine, returning from a system-level interruption handler, and returning from a user thread-requested system call. It therefore must be able to deal with several different user and kernel stacks for the recovery of the user thread's context (a thread's run state context is stored in a process control block, or PCB, structure).

Reference Figure 4-5 as we describe the various tasks performed by a thread during the execution of a system call.

Figure 4-5. System Calls: The Big Picture

graphics/04fig05.gif

We start our discussion by considering that a process's user thread has been selected for execution on an available processor by the kernel scheduler. As execution continues in this user thread's memory space, it becomes necessary to request the services of an HP-UX system call.

The System Call Stub

System calls number in the hundreds and are made available to user threads by way of the system call library. These routines differ from conventional library routines in that they must first transition a user thread to a kernel thread in order to access kernel-resident subroutines. Execution access of code located on kernel-owned pages requires a privilege level of 0. User threads operate with a privilege level of 3; kernel threads operate at a level of 0. By necessity, the mechanism by which a thread's privilege level is set is tightly controlled by the hardware and the kernel to prevent unauthorized use. A listing of available system calls and their symbolic name and number is provided by the file /usr/include/sys/scall_define.h on HP-UX systems.

The Gateway Page

The system call stub is a very small piece of code that is linked into the thread's user-code space by the linking-loader at the time the executable image is built. This small stub appears as a simple intramodule procedure call. Each system call is assigned a unique system call number. It is the job of the compiler (or assembly language programmer) to map high-level language requests to the appropriate system call number and pass it to the stub (in GR22) along with the correct number and type of arguments for the specific system call being requested. This simple stub code blindly forwards the call number, calling register context, return pointer and passed arguments to a highly specialize routine called a promotion block. The return path to the requesting user thread's calling context is back through this stub when the call is completed.

The system call interface for narrow (32-bit) kernels has not changed with recent releases of the kernel; however, the interface for wide (64-bit) kernels follows the new procedure calling convention discussed earlier in this book. In addition, the gateway page layout for narrow and wide kernels differs somewhat. We begin our discussion by examining the B,GATE instruction and the concept of promotion blocks.

The HP PA-RISC processor family (both narrow and wide) provides a highly specialized option to the branch instruction called the GATE completer (or simply the gateway instruction).

 Mnemonic:           B,GATE   target,t Purpose:  Branch to target address, deposit current privilege level into general register t, and set the privilege level according to the rightmost two bits of the current page's TLB  access rights. Example: B,GATE  .+8, r0

This instruction results in a branch eight bytes (two words) forward (.+8) relative to the B,GATE instruction's current address. The current privilege level (for a user thread, the value would be 3) is stored in GR0 (r0; in this example it is simply written to the system register equivalent of the bit bucket), and the privilege level is set equal to the rightmost two bits of the TLB access rights for the page from which the branch was made.

Security in this scheme is provided by the simple fact that only the kernel may set a page's TLB access rights. To enable the B,GATE completer, the TLB access rights on a gateway_page are set to PDE_AR_GATE (0x4C). The promotion takes place only if the processor status word C bit is set to 1 (this disallows gate promotions in real mode). To maximize security considerations, the kernel sets TLB access rights to PDE_AR_GATE on only the system call gateway pages during system initialization (reference pdapage() during kernel initialization).

Promotion Blocks

Individual system calls were originally branched to by means of a simple, four-instruction promotion block. Their basic function was to build an explicit virtual address pointing to the desired system call entry point. A long explicit pointer was needed, since kernel code was outside the scope of the calling thread's virtual memory map. The basic promotion block for a narrow kernel was as follows:

 B,GATE   .+8,ro                # branch to current address+8                                # and raise the privilege level LDIL     L%target,r1           # start building a target address                                # in r1 BE       R   %target,(sr7,r1)# complete the target address                                # using sr7 and branch DEPI     3,31,2,r31            # set the privilege bits to 3 in                                # the stub's return pointer, note                                # this occurs in the BE delay slot

On narrow systems, a thread's fourth quadrant was always mapped to virtual space 0 (see Chapter 6),and space register 7 (sr7) could always be assumed to have the value 0. This is no longer the case for wide kernels, so we must add an additional step to our code to assure that the explicit target address will lie in the first quadrant of virtual address space 0 (the kernel space). We assign and initialize sr2 for that task. The promotion block on a wide HP-UX 11.0 kernel takes the following form:

 MTSP           r0,sr2         # set up sr2 for later use (sr2=0) B,GATE         .+8,ro         # branch to current address+8 and                               # raise the privilege level LDIL           L%target,r1    # start building a target address                               # in r1 BE,N           R%target(sr2,r1) # complete the target address                               # and branch Note: The BE instruction uses nullification of the delay slot  to allow for tightly packing the four-instruction promotion blocks. The task of setting the privilege bits in the stubs return pointer (DEPI) has been moved to the first instruction of the target routine (to preserve the four-instruction block  size).

A fundamental change to promotion block coding came about with the release of HP-UX 11.i. On wide kernels, the new promotion block requires five instructions. The DEPI instruction has been returned to the gateway page code for enhanced security. First, the new five-instruction block for a wide thread on a wide 11.i system:

 B,GATE      .+8,ro           # branch to current address+8 and                              # raise the privilege level MTSP        r0,sr2           # set up sr2 for later use LDIL        L%target,r1      # start building a target address                              # in r1 BE          R% target (sr2,r1) # complete the target address                              # and branch DEPI        3,63,2,r31       # set the privilege bits to 3                              # in the stub's return pointer

Now that the DEPI instruction has been moved back into the gateway64_page, the same is required of the gateway_page. To this end a small additional routine (named gateway_generic) was created in this page's space to keep the narrow promotion blocks at four instructions each. The promotion block for a narrow thread on a wide HP-UX 11.i system is as follows:

 B,GATE      .+8,ro            # branch to current address+8                               # and raise the privilege level LDIL        L% target,r1      # start building a target address                               # in r1 B(gateway_generic)            # branch to our new block of code LDO         R% target (r1),rl# complete the target address    .  .  .  .  .  .  . (gateway_generic) MTSP        r0,sr2            # set up sr2 for later use BE          (sr2,r1)          # branch external to the target                               # we have built DEPI        3,63,2,r31        # set the privilege bits to 3 in                               # the stub's return pointer.

Lightweight System Calls

To minimize system overhead cost, some system calls are implemented via simple assembly language routines called lightweight system calls. The lightweight calls simply set or return a value from a process's or thread's proc_t, kthread_t, or user_t. They never block, sleep, or have to worry about multiprocessor issues. Lightweight system calls include sigblock(), sigvector(), getpid(), getuid(), geteuid(), getgid(), getegid(), and umask() for an HP-UX 10.x kernel.

`gateway_page` on a Narrow Kernel

On early HP-UX systems, there were fewer system calls, which allowed a single page of system memory to contain a simple redirection routine and up to 253 individual promotion blocks. The gateway_page needed to be accessible from a user thread's memory map, so it was located at the beginning of quadrant 4, VAS 0 (explicit address of 0x0.C000000, implicit address of 0xC0000000). This address was hardcoded into the system call stub code for narrow applications and inserted into all SOM executables by the compilers.

As the number of system calls grew, exceeding 253, redirection code at the beginning of the gateway_page was modified so that all system call numbers greater than 253 would be sent to the same address as system call 0. On the kernel side a vector jump table (sysent[]) is maintained and indexed by the system call number (passed by the system call stub in r22) to direct the request to the correct target in the kernel.

Selected Portions of the `gateway_page` from a 32-Bit HP-UX 11.0 Kernel

Listing 4.1 was created using the adb utility:

 #  adb -k /stand/vmunix /dev/kmem >> temp    gateway_page,400?ia

The resulting temp file was then edited. Comments were added and redundant promotion blocks truncated. Please note that adb lists B,GATE as simply GATE.

Listing 4.1. `# adb -k /stand/vmunix /dev/kmem ; gateway_page,400?ia`

 ------------------------------------------ 0xC0000000:   B,N ffffffffc0000018 # Default jump to syscall#0 promotion block 0xC0000004:   SUBI,>>=253,r22,r0 # is  r22 <= 253 ? 0xC0000008:    B,N ffffffffc0000018 # if not then go to syscall#0 0xC000000C:   ZDEPr22,30,30,r1 # else we need to calculate the jump 0xC0000010:   BLR,Nr1,r0 # to the correct promotion block 0xC0000014:   NOP # these two instructions shift the value # of r22 2 bits to the left (x4) this # value is then used for a local branch 0xC0000018:   GATE ffffffffc0000020,r0 # Promotion block for syscall#0 0xC000001C:  LDILL%0x34800,r1 # This is the call to sysinit() 0xC0000020:   BE80(sr7,r1) # (a heavyweight call example) 0xC0000024:   DEPI 3,31,2,r31 ------------------------------------------ 0xC0000158:   GATE ffffffffc0000160,r0 # Promotion Block for syscall #20 0xC000015C:   LDILL%0x385000,r1 # Notice the address is different 0xC0000160:   BE 648(sr7,r1) # This is a lightweight call example 0xC0000164:   DEPI3,31,2,r31 ------------------------------------------ 0xC0000FE8:  GATE ffffffffc0000ff0,r0 # Promotion Block for syscall #253 0xC0000FEC:  LDILL%0x34800,r1 #  another heavyweight call 0xC0000FF0:   BE 80(sr7,r1) 0xC0000FF4:   DEPI 3,31,2,r31 0xC0000FF8:   BREAK # End of promotion blocks! 0xC0000FFC:  BREAK 0xC0001000: # first address of the next page ------------------------------------------ Note: As shown in the listing, system calls 1 through 253 each  have a private promotion block defined, thus allowing them to  define a unique branch target. Most of the calls jump to  syscallinit, but some reference lightweight calls instead. All  the calls above 253 collectively use the promotion block for  syscall #0. ------------------------------------------

One restriction in this model was that no new lightweight calls could be defined with call numbers above 253. Another issue is that since the address of gateway_page is hardcoded into the system call stub it may not be relocated without breaking support of existing binaries.

The Wide Gateway: `gateway_page`, `sysvec`, and `gateway64_page` on a 64-bit Kernel

With the advent of the 64-bit HP-UX kernel, a new gateway page model was implemented. To address the issue of assigning lightweight system call numbers above 253, a user-space vector jump table (indexed by the system call number), sysvec, was created and accessed by a new system call stub for wide applications. The addresses contained in sysvec point to promotion blocks stored on a newly created gateway64_page. The kernel still contains a sysent[] vector jump array, so the gateway64_page needs to contain only a generic promotion block for all the heavyweight system calls and one additional promotion block for each distinct lightweight system call.

To provide continued support of existing narrow applications, a gateway_page is located at the virtual address 0xC0000000, and the gateway64_page and the sysvec page (or pages) are located immediately after. Unlike on narrow systems, these pages are not in the fourth quadrant of virtual space 0; instead, they are in the first quadrant of the space allocated for sharing objects between narrow and wide processes. This space number is stored in the kernel parameter q1_64bit_spaceid.

During kernel initialization, the gateway_page and gateway64_pages are configured with TLB access rights set to PDE_AR_GATE. The following kernel global variables help define the location and size of the gateway and sysvec pages for a wide kernel:

gate64_vaddr Virtual address of the gateway64_page
gate64_pages Number of pages needed for the 64-bit promotion blocks
sysvec_vaddr Virtual address of the 64-bit sysvec table
sysvec_pages Number of pages needed for the sysvec table
sysvec_entries Number of 64-bit promotion blocks
q1_64bit_spaceid Number of the virtual space holding the gateway pages

To make a listing of any of these pages on a wide kernel system, we must first discover their address. The q4 tool may help in this endeavor provided the kernel has been prepared to work with q4 (reference the q4pxdb command in Chapter 16).

Listing the `gateway_page` on a Wide System

The first step is to use q4 to find the address of gateway_page:

 q4> &gateway_page   0440000      147456   0x24000

The last value returned is the address of gateway_page in the kernel. It may be used with adb to create the listing:

 $ adb -k /stand/vmunix /dev/mem > /tmp/gateway_page   24000,400?ia

We did not include a copy of this output for a wide HP-UX 11.00 system because it is virtually the same as that for the narrow kernel, which we have already shown.

Selected Portions of the `sysvec` Page from a 64-Bit HP-UX 11.0 Kernel

Use q4 to find the address of the sysvec page:

 q4> &sysvec  0460000    1155648  0x26000

Next, use adb to create the listing:

 $ adb -k /stand/vmunix /dev/mem > /tmp/sysvec  26000,101?4X

The requested output format is four hex values per line. See Listing 4.2.

Listing 4.2. `# adb -k /stand/vmunix /dev/mem ; 26000,101?4X`

 ------------------------------------------------------------ sysvec: sysvec:     0          25000       0          25000             0          25000       0          25000             0          25000       0          25000             0          25000       0          25000             0          25000       0          25000             0          25000       0          25000             0          25000       0          25000             0          25000       0          25000             0          25000       0          25000             0          25000       0          25000 (idx=20)    0(LW-Call) 25010       0          25000             0          25000       0          25000 (idx=24)    0(LW-Call) 25020       0          25000             0          25000       0          25000 (idx=28)    0(LW-Call) 25030       0          25000             0          25000       0          25000 -------------------------------------------------------- (Note: Many entries have been truncated for brevity) --------------------------------------------------------             0          25000       0          25000 sysvec_end:8000240000

The `gateway64_page` from a 64-Bit HP-UX 11.0 Kernel

Use q4 to find the address of the gateway64_page:

 q4> &gateway64_page   0450000    151552   0x25000

Next, create the listing using adb:

 $ adb -k /stand/vmunix /dev/mem > /tmp/gateway64_page   25000,0x4c?ia

Listing 4.3 shows the individual promotion blocks for the 64-bit system calls. Note that there is only one block per specific call type.

Listing 4.3. `# adb -k /stand/vmunix /dev/mem ; 25000,0x4c?ia`

 ------------------------------------------------- gateway64_page: gateway64_page:                    MTSP r0,sr2 gateway64_page+4:                  GATE gateway64_page+0x000c,r0 gateway64_page+8:                  LDIL L%0x34000,r1 gateway64_page+0xC:                BE,N440(sr2,r1) lw_getpid_gate64: lw_getpid_gate64:                  MTSP r0,sr2 lw_getpid_gate64+4:                GATE lw_getpid_gate64+0x000c,r0 lw_getpid_gate64+8:                LDIL L%0x3b2000,r1 lw_getpid_gate64+0xC:              BE,N120(sr2,r1) lw_getuid_gate64: lw_getuid_gate64:                  MTSP r0,sr2 lw_getuid_gate64+4:                GATE lw_getuid_gate64+0x000c,r0 lw_getuid_gate64+8:                LDIL L%0x3b2000,r1 lw_getuid_gate64+0xC:              BE,N420(sr2,r1) cnx_lw_pmon_read_gate64: cnx_lw_pmon_read_gate64:           MTSP r0,sr2 cnx_lw_pmon_read_gate64+4:         GATE cnx_lw_pmon_read_gate64+0x000c,r0 cnx_lw_pmon_read_gate64+8:         LDIL L%0x3b3000,r1 cnx_lw_pmon_read_gate64+0xC:       BE,N 212(sr2,r1) lw_getgid_gate64: lw_getgid_gate64:                  MTSP r0,sr2 lw_getgid_gate64+4:                GATE lw_getgid_gate64+0x000c,r0 lw_getgid_gate64+8:                LDIL L%0x3b2000,r1 lw_getgid_gate64+0xC:              BE,N728(sr2,r1) lw_test_gate64: lw_test_gate64:                    MTSP r0,sr2 lw_test_gate64+4:                  GATE lw_test_gate64+0x000c,r0 lw_test_gate64+8:                  LDIL L%0x3b2000,r1 lw_test_gate64+0xC:                BE,N104(sr2,r1) lw_mcas_util_gate64: lw_mcas_util_gate64:               MTSP r0,sr2 lw_mcas_util_gate64+4:             GATE lw_mcas_util_gate64+0x000c,r0 lw_mcas_util_gate64+8:             LDIL L%0x3b3000,r1 lw_mcas_util_gate64+0xC:           BE,N1800(sr2,r1) lw_set_userthreadid_gate64: lw_set_userthreadid_gate64:        MTSP r0,sr2 lw_set_userthreadid_gate64+4:      GATE lw_set_userthreadid_gate64+0x000c,r0 lw_set_userthreadid_gate64+8:      LDIL L%0x3b2800,r1 lw_set_userthreadid_gate64+0xC:    BE,N996(sr2,r1) lw_umask_gate64: lw_umask_gate64:                   MTSP r0,sr2 lw_umask_gate64+4:                 GATE lw_umask_gate64+0x000c,r0 lw_umask_gate64+8:LDILL%0x3b2800,r1 lw_umask_gate64+0xC:               BE,N 772(sr2,r1) lw_lwp_getprivate_gate64: lw_lwp_getprivate_gate64:          MTSP r0,sr2 lw_lwp_getprivate_gate64+4:        GATE lw_lwp_getprivate_gate64+0x000c,r0 lw_lwp_getprivate_gate64+8:        LDIL L%0x3b2800,r1 lw_lwp_getprivate_gate64+0xC:      BE,N1656(sr2,r1) lw_lwp_setprivate_gate64: lw_lwp_setprivate_gate64:          MTSP r0,sr2 lw_lwp_setprivate_gate64+4:        GATE lw_lwp_setprivate_gate64+0x000c,r0 lw_lwp_setprivate_gate64+8:        LDIL L%0x3b2800,r1 lw_lwp_setprivate_gate64+0xC:      BE,N1544(sr2,r1) lw_lf_send_gate64: lw_lf_send_gate64:                 MTSP r0,sr2 lw_lf_send_gate64+4:               GATE lw_lf_send_gate64+0x000c,r0 lw_lf_send_gate64+8:               LDIL L%0x3b2800,r1 lw_lf_send_gate64+0xC:             BE,N 2044(sr2,r1) lw_sigvec_gate64: lw_sigvec_gate64:                  MTSP r0,sr2 lw_sigvec_gate64+4:                GATE lw_sigvec_gate64+0x000c,r0 lw_sigvec_gate64+8:                LDIL L%0x3b2000,r1 lw_sigvec_gate64+0xC:              BE,N 1364(sr2,r1) lw_sigblock_gate64: lw_sigblock_gate64:                MTSP r0,sr2 lw_sigblock_gate64+4:              GATE lw_sigblock_gate64+0x000c,r0 lw_sigblock_gate64+8:              LDIL L%0x3b2800,r1 lw_sigblock_gate64+0xC:            BE,N 420(sr2,r1) lw_sigsetmask_gate64: lw_sigsetmask_gate64:              MTSP r0,sr2 lw_sigsetmask_gate64+4:            GATE lw_sigsetmask_gate64+0x000c,r0 lw_sigsetmask_gate64+8:            LDIL L%0x3b2000,r1 lw_sigsetmask_gate64+0xC:          BE,N 1036(sr2,r1) lw_lwp_self_gate64: lw_lwp_self_gate64:                MTSP r0,sr2 lw_lwp_self_gate64+4:              GATE lw_lwp_self_gate64+0x000c,r0 lw_lwp_self_gate64+8:              LDIL L%0x3b2800,r1 lw_lwp_self_gate64+0xC:            BE,N1244(sr2,r1) lw_lf_next_scn_gate64: lw_lf_next_scn_gate64:             MTSP r0,sr2 lw_lf_next_scn_gate64+4:           GATE lw_lf_next_scn_gate64+0x000c,r0 lw_lf_next_scn_gate64+8:           LDIL L%0x3b3000,r1 lw_lf_next_scn_gate64+0xC:         BE,N104(sr2,r1) lw_get_thread_times_gate64: lw_get_thread_times_gate64:        MTSP r0,sr2 lw_get_thread_times_gate64+4:      GATE lw_get_thread_times_gate64+0x000c,r0 lw_get_thread_times_gate64+8:      LDIL L%0x3b2800,r1 lw_get_thread_times_gate64+0xC:    BE,N 1768(sr2,r1) lw_nosys_gate64: lw_nosys_gate64:                   MTSP r0,sr2 lw_nosys_gate64+4:                 GATE lw_nosys_gate64+0x000c,r0 lw_nosys_gate64+8:                 LDIL L%0x3ae800,r1 lw_nosys_gate64+0xC:               BE,N 304(sr2,r1) lw_nosys_gate64+10: (This last promotion block is used to trap unimplemented system calls)

Listing 4.4 is a partial listing of system calls and their numbers taken from the file /usr/include/sys/scall_define.h. It is included for your consideration as you look at the listings of the promotion blocks:

Listing 4.4. `# more /usr/include/sys/scall_define.h`

 misc.dsc:system call 0 is nosys; pm.dsc:              system call 1 is exit; pm.dsc:              system call 2 is fork; fs.dsc:              system call 3 is read; fs.dsc:              system call 4 is write; fs.dsc:              system call 5 is open; fs.dsc:              system call 6 is close; pm.dsc:              system call 7 is wait; fs.dsc:              system call 8 is creat; fs.dsc:              system call 9 is link; fs.dsc:              system call 10 is unlink; vm.dsc:              system call 11 is execv; fs.dsc:              system call 12 is chdir; pm.dsc:              system call 13 is time; fs.dsc:              system call 14 is mknod; fs.dsc:              system call 15 is chmod; fs.dsc:              system call 16 is chown; vm.dsc:              system call 17 is brk; fs.dsc:              system call 18 is lchmod; fs.dsc:              system call 19 is lseek; pm.dsc:              system call 20 is getpid; fs.dsc:              system call 21 is mount; fs.dsc:              system call 22 is umount; pm.dsc:              system call 23 is setuid; pm.dsc:              system call 24 is getuid; pm.dsc:              system call 25 is stime; pm.dsc:              system call 26 is ptrace; signals.dsc:         system call 27 is alarm; pm.dsc:              system call 28 is cnx_lw_pmon_read; pm.dsc:              system call 29 is pause;    ...

Low-Level Initialization

Once the appropriate promotion block is selected, the system will either transfer control to a lightweight system call (also known as a level-2 call) or it will enter the kernel syscallinit() function (a level-1 call) to begin low-level initialization.

In the case of a lightweight call, the action is performed and an immediate return is made to the system call stub. In reality, very little time is spent in the kernel code, so little in fact that the thread accounting system does not even attempt to charge any CPU utilization to the threads' system time bucket. Instead, it is simply considered as user time.

High-Level Initialization

syscallinit() contains an assembly-level initialization code routine designed to set up a save state and stack frame for use as we transition further into the kernel's higher level. As the same kernel code may be called simultaneously by different user threads on SMP systems, each thread brings with it a privately owned kernel stack area. This kstack (three pages for a narrow application and six pages for a wide application) immediately follows the thread's uarea (every thread has its own unique uarea/kstack memory region, discussed in greater detail in Chapter 6). The primary job of syscallinit() is to make sure that the stack pointer (sp), the stub return address (r31), the original caller's return address (rp), the user's data pointer (dp), and the user's space registers are safely stored away so that when all is said and done we will be able to return to the original calling context. The next step is to call the high-level initialization routine syscall().

The syscall() kernel routine is written in C code and starts by collecting performance metrics relating to this thread. The measurement system is initialized, and we now start charging CPU utilization tics to the thread's "system time" bucket. The current system call number is saved in the thread's uarea (at u_syscall) and is used as the index into the sysent[] vector jump table. The kernel also contains the array sysaux[] with specific 64-bit information about the pending call, such as the number of registers used by the call. Any extra arguments passed via the registers are copied to the uarea (u_arg[]), which satisfies the differences between the 32-bit and 64-bit calling conventions. Once the actual system call address is retrieved, a call is made to setjmp().

Argument coercion may be required if the calling user thread is 32 bits and we are running a 64-bit kernel. This action is transparent for the most part, as the values stored in the thread's u_arg[] will have their sign bit extended to 64 bits. The argument coercion stub makes the actual call to the address index in sysent[r22].

The System Call

Now we have arrived at the actual kernel code we requested. At this point, any of many paths may be followed. We may be asking for an I/O operation, in which case our thread will most likely be put to sleep and assigned a wait channel. We might simply perform the task requested and start on the return path, or an interruption may halt our progress. In any case, let's take stock of where we are and what we have working for us.

The thread is now in kernel mode, and required stack space is being provided by the thread's own private kstack area. A save state of the originating context is safely held in the uarea, and we are proceeding with the full authority and privilege of the kernel.

The next step is to begin the return trip to resume execution in our user thread context. We could find ourselves at this juncture as the result of several different scenarios. Perhaps an interrupt signifying the completion of an I/O operation was received, causing some other thread to be preempted; a user-defined signal handler may have completed and made a call to sig_return(); or a system call could be simply finishing its job. The point is that the system call that initiates the return to a user thread is not necessarily the one that started the sequence: the kernel interrupt system, memory manager, I/O system, or any of a number of system services may have affected the sequence of events.

High-Level Return

The high-level return path starts in the midst of the kernel syscall() routine. The return trip start by checking to see if the thread we are about to return to has received any signals. This is the only time that a thread checks for signals, and the programmer may have chosen to ignore or block certain signals. We cover the signal-handling mechanism fully in a separate chapter of this book. For now, we simply want to mention that this is the point at which signals are checked for.

Next, the system call's return value is checked to see if an error occurred during the execution of the call or if the return was interrupted by the receipt of a signal. If all is well, the system call return value is passed back through the low-level return code by placing it into ret of the save state structure (which was created during the low-level initialization).

One last task before transferring control to the low-level return code is to check the processor's runrun flag. This flag is stored in a per-processor mp_info_t (this structure is covered in Chapter 12) and is set if an interrupt handler places a thread on a processor's run-queue and notices that it belongs to a higher priority run-queue than the one from which that processor's current thread is executing (remember on SMP systems, interrupts may be handled for an I/O operation requested from a different processor than the one on which the thread is running). If the runrun flag is set, the current thread is placed back on its run-queue by the setrq() kernel routine, and the kernel routine swtch()is called to make a context-switch to the kthread at the front of the processor's strongest run-queue. Once syscall() is satisfied that it has identified the most deserving thread, the measurement system is instructed to start charging CPU utilization to the user time bucket, and the low-level return routine is called.

Low-Level Return

syscallrtn() is an assembly code routine that basically restores the registers that were saved upon entry to the low-level initialization code: the stack pointer (sp), the stub return address (r31), the original caller's return address (rp), the data pointer (dp), and the user's space registers are all restored prior to branching back to the system call stub's return address (r31). The system call's return value is passed to the original calling thread by way of r22, the same register that was used by the system call stub to pass the system call number. While this describes the normal system call return path, there are a couple of other possibilities that could have brought us here.

If we have arrived here due to the trapping of a signal for which the programmer has defined a signal handler, we will check to see if a save state needs to be placed on a signal stack for future reference, and then we will return directly to the handler routine. If we are returning from a user-defined signal handler, then the save state will need to be recovered from the signal stack prior to completing the return.

If the return is the result of a context-switch, we will recover the save state from the thread's uarea and proceed directly back to the user thread's context. Flags are also checked to see if this thread belongs to a process that is being traced; if so, various actions are taken depending on the level and type of tracing being used.

The final step involves making a branch-external instruction to the stub's return address in r31. This completes the system call and in the process demotes the thread to the user privilege level of 3.

Table of content