Section 2.8. System Calls


2.8. System Calls

System calls are the set of application programming interfaces (APIs) that allow programs to have the kernel perform a privileged service on their behalf. Common examples include memory allocation, file I/O, signal management, and interprocess communication. Standards define the names of the system calls, the arguments they take, the way they behave from the application's perspective, and the values they return to the calling program. System calls are described in section 2 of the man pages.

Because system calls are privileged operations that can only be done by the kernel, making a system call results in the calling process transitioning from operating in user mode to operating in kernel mode. With the process in kernel mode, there is visibility into the kernel's address space (among other things). The platform's trap mechanism manages the transition to kernel mode. That is, when a system call is executed, a trap (a vectored transfer of control to a trap handler) is taken, and the system call trap handler takes over.

Much of the system call entry and setup work depends on the process architecture. The main system call codethe actual system callsare implemented in C language and can be found in usr/src/uts/common/syscall. The trap mechanism, however, is platform specific, and as such the mechanics of handling a system call trap and setting up the thread state and registers for system call execution are different for SPARC systems versus AMD64 systems. The following text walks through a system call on a SPARC system. Note that all the following text describes code written in SPARC assembly language. Some knowledge of SPARC assembly, along with register windows and general register use, is helpful, though not a requirement.

2.8.1. System Calls on SPARC Architectures

An application making a system call actually calls a libc wrapper function that performs any required posturing and then enters the kernel with a software trap instruction. This means that user code and compilers do not need to know the path into the kernel and that binaries can work on later versions of the OS where perhaps the path has been modified, system call numbers were newly overloaded, etc.

Solaris for SPARC supports three software traps for entering the kernel, as listed in Table 2.3.

Table 2.3. Software Traps for System Calls on SPARC Achitectures

Software Trap

Instruction

Description

0x0

ta 0x0

Used for system calls for binaries running in SunOS 4.x binary compatability mode

0x8

ta 0x8

32-bit (ILP32) binary running on 64-bit (ILP64) kernel

0x40

ta 0x40

64-bit (ILP64) binary running on 64-bit (ILP64) kernel


As of Solaris 10, Solaris no longer includes a 32-bit kernel, the ILP32 syscall on ILP32 kernel is no longer implemented.

In the wrapper function, the syscall arguments are rearranged if necessary. The kernel function implementing the syscall may expect the arguments in a different order from that of the syscall API, for example, or multiple related system calls may share a single system call number and select behavior based on an additional argument passed into the kernel. The kernel function then places the system call number in register %g1 and executes one of the above trap-always instructions (for example, the 32-bit libc library uses ta 0x8, and the 64-bit libc uses ta 0x40). There's a lot more activity and posturing in the wrapper functions than described here, but for our purposes we simply note that it all boils down to a ta instruction to enter the kernel.

2.8.1.1. Handling a System Call Trap

A SPARC trap instruction (ta n) executed in userland by the wrapper function results in a trap type 0x100 + n being taken, and we move from trap-level 0 (TL0) (where all userland and most kernel code executes) to trap-level 1 (TL1) in nucleus context. Code that executes in nucleus context has to be hand-crafted in assembler since nucleus context does not comply with the ABI conventions and is generally much more restricted in what it can do. The task of the trap handler executing at TL1 is to provide the necessary glue in order to get us back to TL0 and running privileged (kernel) C code that implements the actual system call.

The trap table entries for the sun4u and sun4v architectures for these traps are identical. In the following examples, we explore the two primary syscall traps and ignore the SunOS 4.x trap. Note that a trap table handler has just eight instructions dedicated to it in the trap table; it must use these to do a little work and then branch elsewhere.

/*  * SYSCALL is used for system calls on both ILP32 and LP64 kernels  * depending on the "which" parameter (should be either syscall_trap  * or syscall_trap32).  */ #define SYSCALL(which)                   \         TT_TRACE(trace_gen)              ;\         set     (which), %g1             ;\         ba,pt   %xcc, sys_trap           ;\         sub     %g0, 1, %g4              ;\         .align  32 ... ... trap_table: scb: trap_table0:         /* hardware traps */         ...         ...         /* user traps */         GOTO(syscall_trap_4x);           /* 100  old system call */         ...         SYSCALL(syscall_trap32);         /* 108  ILP32 system call on LP64 */         ...         SYSCALL(syscall_trap)            /* 140  LP64 system call */         ...                                                   See usr/src/uts/sun4u/ml/trap_table.s 


In both cases we branch to sys_trap, requesting TL0 handler of syscall_trap32 for an ILP32 syscall and syscall_trap for a ILP64 syscall. In both cases, we request the processor interrupt level (PIL) to remain as it currently is (always 0 since we came from userland). The sys_trap code is generic glue that takes us from nucleus (TL > 0) context back to TL0 running a specified handler (address in %g1, usually written in C) at a chosen PIL. The specified handler is called with arguments as given by registers %g2 and %g3 at the time we branch to sys_trap. The SYSCALL macro above does not move anything into these registersno arguments to be passed to handler. sys_trap handlers are always called with a first argument pointing to a struct regs that provides access to all the register values at the time of branching to sys_trap; for syscalls these include the system call number in %g1 and arguments in output registers. Note that %g1 as prepared in the wrapper and %g1 as used in the SYSCALL macro for the trap table entry are not the same register. On a trap we move from regular global registers (as user-land executes in) to alternate global registers, but the sys_trap glue collects all the correct user registers and makes them available in the struct regs it passes to the handler.

The sys_trap glue is also responsible for setting up our return linkage. When the TL0 handling is complete, the handler returns, restoring the stack pointer and program counter as constructed in sys_trap. Since we trapped from userland, user_rtt is interposed as the glue into which TL0 handling code returns, which gets us back out of the kernel and into userland again when the system call completes.

2.8.2. A Tour through a System Call

We follow the ILP32 syscall route; the route for ILP64 is analogous with trivial differences in terms of not having to clear the upper 32 bits of arguments and deal with other items related to the data width. The syscall_trap code runs at TL0 as a sys_trap handler, so it could be written in C. However, for performance it is coded in assembler. Our task is to look up and call the nominated system call handler and perform the required housekeeping along the way.

syscall_trap32(struct regs *rp); ENTRY_NP(syscall_trap32) ldx     [THREAD_REG + T_CPU], %g1       ! get cpu pointer mov     %o7, %l0                        ! save return addr                                              See usr/src/uts/sparc/v9/ml/syscall_trap.s 


First note that we do not obtain a new register window herewe stay within the window that sys_trap crafted for itself. Normally, this would mean that we would have to live within the output registers, but by agreement, handlers called through sys_trap are permitted to use registers %l0 tHRough %l3.

We begin by loading a pointer to the CPU on which this thread is executing into %g1 and saving the return PC (as constructed by sys_trap) in %o7.

! ! If the trapping thread has the address mask bit clear, then it's !   a 64-bit process, and has no business calling 32-bit syscalls. ! ldx     [%o0 + TSTATE_OFF], %l1         ! saved %tstate.am is that andcc   %l1, TSTATE_AM, %l1             !   of the trapping proc be,pn   %xcc, _syscall_ill32            ! mov     %o0, %l1                        ! save reg pointer                                              See usr/src/uts/sparc/v9/ml/syscall_trap.s 


The comment says it all. The AM bit in the PSTATE register at the time we trapped executed the ta instruction and is available in the %tstate register after the trapsys_trap preserved that for us before it could be modified by further traps in the regs structure. Assuming we're not a 64-bit process making a 32-bit syscall, here's what happens.

srl     %i0, 0, %o0                      ! copy 1st arg, clear high bits srl     %i1, 0, %o1                      ! copy 2nd arg, clear high bits ldx     [%g1 + CPU_STATS_SYS_SYSCALL],  %g2 inc     %g2                              ! cpu_stats.sys.syscall++ stx     %g2, [%g1 + CPU_STATS_SYS_SYSCALL]                                               See usr/src/uts/sparc/v9/ml/syscall_trap.s 


The libc wrapper placed up to the first 6 arguments in %o0 through %o5, with the rest, if any, on stack. During sys_trap, a SAVE instruction obtained a new register window, so those arguments are now available in the corresponding input registers, despite our not performing a save in syscall_trap32 itself. We're going to call the real handler, so we prepare the arguments in our outputs, which we're sharing with sys_trap, but outputs are understood to be volatile across calls. The shift-right-logical by 0 bits is a 32-bit operation (that is, not srlx) so it performs no shifting, but it does clear the uppermost 32-bits of the arguments. We also increment the statistic counting the number of system calls made by this CPU; this statistic is in the cpu_t, and the offset is generated by the genasym tool.

! ! Set new state for LWP ! ldx     [THREAD_REG + T_LWP], %l2 mov     LWP_SYS, %g3 srl     %i2, 0, %o2                      ! copy 3rd arg, clear high bits stb     %g3, [%l2 + LWP_STATE] srl     %i3, 0, %o3                      ! copy 4th arg, clear high bits ldx     [%l2 + LWP_RU_SYSC], %g2         ! pesky statistics srl     %i4, 0, %o4                      ! copy 5th arg, clear high bits addx    %g2, 1, %g2 stx     %g2, [%l2 + LWP_RU_SYSC] srl     %i5, 0, %o5                      ! copy 6th arg, clear high bits ! args for direct syscalls now set up                                              See usr/src/uts/sparc/v9/ml/syscall_trap.s 


We continue preparing arguments as above. Interleaved with these instructions we change the lwp_state member of the associated LWP structure to signify that it is running in-kernel (LWP_SYS, would have been LWP_USER before this update) and increment the count of the number of syscall made by this particular LWP.

Next we write a trAPTRACE enTRyonly on DEBUG kernels, which are visible with the MDB's ::traptrace dcmd.

! ! Test for pre-system-call handling ! ldub    [THREAD_REG + T_PRE_SYS], %g3   ! pre-syscall proc? YSCALLTRACE sethi   %hi(syscalltrace), %g4 ld      [%g4 + %lo(syscalltrace)], %g4 orcc    %g3, %g4, %g0                   ! pre_syscall OR syscalltrace? tst     %g3                             ! is pre_syscall flag set? * SYSCALLTRACE */ bnz,pn  %icc, _syscall_pre32            ! yes - pre_syscall needed   nop ! Fast path invocation of new_mstate mov     LMS_USER, %o0 call    syscall_mstate mov     LMS_SYSTEM, %o1 lduw    [%l1 + O0_OFF + 4], %o0         ! reload 32-bit args lduw    [%l1 + O1_OFF + 4], %o1 lduw    [%l1 + O2_OFF + 4], %o2 lduw    [%l1 + O3_OFF + 4], %o3 lduw    [%l1 + O4_OFF + 4], %o4 lduw    [%l1 + O5_OFF + 4], %o5 ! lwp_arg now set up 3:                                              See usr/src/uts/sparc/v9/ml/syscall_trap.s 


If the curthread->t_pre_sys flag is set, then we branch to _syscall_pre32 to call pre_syscall. If that action does not abort the call, then pre_syscall reloads the outputs with the args (they were lost on the call to _syscall_pre32), using lduw instructions from the regs area and loading from just the lower 32-bit word of the args, and branches back to label 3 above. If we don't have pre-syscall work to perform, then we call syscall_mstate(LMS_USER, LMS_SYSTEM) to record the transition from user to system state for microstate accounting. Microstate accounting is always performed in Solaris 10 (in previous releases, it needed to be explicitly enabled).

After the unconditional call to syscall_mstate, we reload the arguments from the regs struct into the output registers (as after the pre-syscall work). Evidently our earlier srl work in the args is a complete waste of time (although not expensive) since we always end up loading the args from the passed regs structure. This is a holdover from days when microstate accounting was not always enabled.

! ! Call the handler.  The %o's have been set up. ! lduw    [%l1 + G1_OFF + 4], %g1         ! get 32-bit code set     sysent32, %g3                   ! load address of vector table cmp     %g1, NSYSCALL                   ! check range sth     %g1, [THREAD_REG + T_SYSNUM]    ! save syscall code bgeu,pn %ncc, _syscall_ill32   sll   %g1, SYSENT_SHIFT, %g4          ! delay - get index add     %g3, %g4, %g5                   ! g5 = addr of sysentry ldx     [%g5 + SY_CALLC], %g3           ! load system call handler brnz,a,pt %g1, 4f                       ! check for indir() mov     %g5, %l4                        ! save addr of sysentry ! ! Yuck.  If %g1 is zero, that means we're doing a syscall() via the ! indirect system call.  That means we have to check the ! flags of the targeted system call, not the indirect system call ! itself.  See return value handling code below. ! set     sysent32, %l4                   ! load address of vector table cmp     %o0, NSYSCALL                   ! check range bgeu,pn %ncc, 4f                        ! out of range, let C handle it   sll   %o0, SYSENT_SHIFT, %g4          ! delay - get index add     %g4, %l4, %l4                   ! compute & save addr of sysent call    %g3                             ! call system call handler nop 4:                                              See usr/src/uts/sparc/v9/ml/syscall_trap.s 


We load the nominated syscall number into %g1, sanity-check it for range, and look up the entry at that index in the sysent32 table of 32-bit system calls, and extract the registered handler (the real implementation). Ignoring the indirect syscall work, we call the handler and the real work of the syscall is executed.

! ! If handler returns long long, then we need to split the 64 bit ! return value in %o0 into %o0 and %o1 for ILP32 clients. ! lduh    [%l4 + SY_FLAGS], %g4           ! load sy_flags andcc   %g4, SE_64RVAL | SE_32RVAL2, %g0 ! check for 64-bit return bz,a,pt %xcc, 5f   srl   %o0, 0, %o0                     ! 32-bit only srl     %o0, 0, %o1                     ! lower 32 bits into %o1 srlx    %o0, 32, %o0                    ! upper 32 bits into %o0                                              See usr/src/uts/sparc/v9/ml/syscall_trap.s 


Once the system call executes, we set up the return value. For ILP32 clients we need to massage 64-bit return types into two adjacent and paired registers.

! ! Check for post-syscall processing. ! This tests all members of the union containing t_astflag, t_post_sys, ! and t_sig_check with one test. ! ld      [THREAD_REG + T_POST_SYS_AST], %g1 tst     %g1                             ! need post-processing? bnz,pn  %icc, _syscall_post32           ! yes - post_syscall or AST set mov     LWP_USER, %g1 stb     %g1, [%l2 + LWP_STATE]          ! set lwp_state stx     %o0, [%l1 + O0_OFF]             ! set rp->r_o0 stx     %o1, [%l1 + O1_OFF]             ! set rp->r_o1 clrh    [THREAD_REG + T_SYSNUM]         ! clear syscall code ldx     [%l1 + TSTATE_OFF], %g1         ! get saved tstate ldx     [%l1 + nPC_OFF], %g2            ! get saved npc (new pc) mov     CCR_IC, %g3 sllx    %g3, TSTATE_CCR_SHIFT, %g3 add     %g2, 4, %g4                     ! calc new npc andn    %g1, %g3, %g1                   ! clear carry bit for no error stx     %g2, [%l1 + PC_OFF] stx     %g4, [%l1 + nPC_OFF] stx     %g1, [%l1 + TSTATE_OFF]                                              See usr/src/uts/sparc/v9/ml/syscall_trap.s 


If post-syscall processing is required, the code branches to _syscall_post32, which calls post_syscall, and then "returns" by jumping to the return address passed by sys_trap (which is always user_rtt for syscalls). If post-syscall processing is not required, then the code changes the lwp_state back to LWP_USER and saves the return value (possibly in two registers as above) in the regs structure, clears the curthread->t_sysnum since a system call is no longer executing, and steps the PC and nPC values so that the RETRY instruction at the end of user_rtt, to which the code is about to "return," does not simply reexecute the ta instruction.

! fast path outbound microstate accounting call mov     LMS_SYSTEM, %o0 call    syscall_mstate mov     LMS_USER, %o1 jmp     %l0 + 8 nop                                              See usr/src/uts/sparc/v9/ml/syscall_trap.s 


The code then captures the transition of the thread state from system to user for microstate accounting and returns through user_rtt as arranged by sys_trap. user_rtt's task is to get us back out of the kernel to resume at the instruction indicated in %tstate (for which the PC and nPC were stepped) and continue execution in userland.

Once a system call has completed, a value is returned to the calling program. The programmer must ensure that return values are checked before execution continues. System calls generally return a minus one (-1) value if they could not complete for some reason and set a system-defined error number (errno) that provides additional information about why the system call failed.

The equivalent code for system calls on x64 platforms can be found in usr/src/ uts/i86pc/ml. The source files syscall_asm.s and syscall_asm_amd64.s contain the assembly language code that handles the system call entry point, register setup, state transition, etc. The code is actually fairly well documented by comments. However, as with SPARC code, some knowledge of x64 assembler and hardware register use will help.




SolarisT Internals. Solaris 10 and OpenSolaris Kernel Architecture
Solaris Internals: Solaris 10 and OpenSolaris Kernel Architecture (2nd Edition)
ISBN: 0131482092
EAN: 2147483647
Year: 2004
Pages: 244

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net