Section 4.10. Page Fault

4.10. Page Fault

Throughout the lifespan of a process, it is possible that it might attempt to access an address that belongs to its address space but is not loaded in RAM. It might alternatively access a page that is in RAM, but attempt action upon it that would violate the page's permission settings (for example, writing in a read-only area). When this happens, the system generates a page fault. The page fault is an exception handler that manages errors in a program's page access. Pages are fetched from storage when the hardware raises this page fault exception that the kernel traps. The kernel then allocates the missing page.

Each architecture has an architecture-dependent function that handles page faults. Both x86 and PPC call the function do_page_fault(). The x86 page fault handler do_page_fault(*regs, error_code) is located in /arch/i386/mm/fault.c. The PowerPC page fault handler do_page_fault(*regs, address, error_code) is located in /arch/ppc/mm/fault.c. The similarities are close enough that a discussion of do_page_fault() for the x86 covers the functionality of the PowerPC version.

The major difference in how the two architectures handle the page fault is in how the fault information is gathered and stored before do_page_fault() is called. We first explain the specifics of the x86 page fault handling and proceed to explain the do_page_fault() function. We follow this explanation by highlighting the differences seen in PowerPC.

4.10.1. x86 Page Fault Exception

The x86 page fault handler do_page_fault() is called as the result of a hardware interrupt 14. This interrupt occurs when the processor identifies the following conditions to be true:

Paging is enabled, and the present bit is clear in the page-directory or page-table entry needed for this address.
Paging is enabled, and the current privilege level is less than that needed to access the requested page.

Upon raising this interrupt, the processor saves two valuable pieces of information:

The nature of the error in the lower 4 bits of a word pushed on the stack. (Bit 3 is not used by do_page_fault().) See Table 4.7 to see what each bit value corresponds to.
Table 4.7. Page Fault error_code

Bit 2
Bit 1
Bit 0
Value = 0
Kernel
Read
Page not present
Value = 1
User
Write
Protection fault
The 32-bit linear address that caused the exception in cr2.

The regs parameter of do_page_fault() is a struct that contains the system registers, and the error_code parameter uses a 3-bit field to describe the source of the fault.

4.10.2. Page Fault Handler

For both architectures, the do_page_fault() function uses the just-given information and takes one of several actions. These code segments follow a fairly complicated series of checks to end up with one of the following:

The offending address being found by handle_mm_fault()
The famous oops dump (no_context:) bad_page_fault() for PowerPC
A segmentation fault (bad_area:) bad_page_fault() for PowerPC
An error returned to the caller (fixup)

 ----------------------------------------------------------------------------- arch/i386/mm/fault.c 212  asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long error_code) 213  { 214   struct task_struct *tsk; 215   struct mm_struct *mm; 216   struct vm_area_struct * vma; 217   unsigned long address; 218   unsigned long page; 219   int write; 220   siginfo_t info; 221 222   /* get the address */ 223   __asm__("movl %%cr2,%0":"=r" (address)); ... 232   tsk = current; 233 234   info.si_code = SEGV_MAPERR; -----------------------------------------------------------------------------

Line 223

The address at which the page fault occurred is stored in the cr2 control register. The linear address is read and the local variable address is set to hold the value.

Line 232

The task_struct pointer tsk is set to point at the task_struct current.

Now, we are ready to find out more about where the address that generated the page fault comes from. Figure 4.14 illustrates the flow of the following lines of code:

 ----------------------------------------------------------------------------- arch/i386/mm/fault.c 246  if (unlikely(address >= TASK_SIZE)) { 247   if (!(error_code & 5)) 248    goto vmalloc_fault; ... 253   goto bad_area_nosemaphore; 254  } ... 257  mm = tsk->mm ... -----------------------------------------------------------------------------

Figure 4.14. Page Fault I

Lines 246248

This code checks if the address at which the page fault occurred was in kernel module space (that is, in a noncontiguous memory area). Noncontiguous memory area addresses have their linear address >= TASK_SIZE. If it was, it checks if bits 0 and 2 of the error_code are clear. Recall from Table 4.7 that this indicates that the error is caused by trying to access a kernel page that is not present. If so, this indicates that the page fault occurred in kernel mode and the code at label vmalloc_fault: is called.

Line 253

If we get here, it means that although the access occurred in a noncontiguous memory area, it occurred in user mode, hit a protection fault, or both. In this case, we jump to the label bad_area_semaphore:.

Line 257

This sets the local variable mm to point to the current task's memory descriptor. If the current task is a kernel thread, this value is NULL. This becomes significant in the next code lines.

At this point, we have determined that the page fault did not occur in a noncontiguous memory area. Again, Figure 4.15 illustrates the flow of the following lines of code:

 ----------------------------------------------------------------------------- arch/i386/mm/fault.c ... 262  if (in_atomic() || !mm) 263   goto bad_area_nosemaphore; 264 265  down_read(&mm->mmap_sem); 266 267  vma = find_vma(mm, address); 268  if (!vma) 269   goto bad_area; 270  if (vma->vm_start <= address) 271   goto good_area; 272  if (!(vma->vm_flags & VM_GROWSDOWN)) 273   goto bad_area; 274  if (error_code & 4) { ... 281   if (address + 32 < regs->esp) 282    goto bad_area; 283  } 284  if (expand_stack(vma, address)) 285   goto bad_area; ... -----------------------------------------------------------------------------

Figure 4.15. Page Fault II

Lines 262263

In this code block, we check to see if the fault occurred while executing within an interrupt handler or in kernel space. If it did, we jump to label bad_area_ semaphore:.

Line 265

At this point, we are about to search through the memory areas of the current process, so we set a read lock on the memory descriptor's semaphore.

Lines 267269

Given that, at this point, we know the address that generated the page fault is not in a kernel thread or in an interrupt handler, we search the address space of the process to see if the address is in one of its memory areas. If it is not there, jump to label bad_area:.

Lines 270271

If we found a valid region within the process address space, we jump to label good_area:.

Lines 272273

If we found a region that is not valid, we check if the nearest region can grow to fit the page. If not, we jump to the label bad_area:.

Lines 274284

Otherwise, the offending address might be the result of a stack operation. If expanding the stack does not help, jump to the label bad_area:.

Now, we proceed to explain what each of the label jump points do. We begin with the label vmalloc_fault, which is illustrated in Figure 4.16:

 ----------------------------------------------------------------------------- arch/i386/mm/fault.c 473  vmalloc_fault:   {    int index = pgd_index(address);    pgd_t *pgd, *pgd_k;    pmd_t *pmd, *pmd_k;    pte_t *pte_k;    asm("movl %%cr3,%0":"=r" (pgd));    pgd = index + (pgd_t *)__va(pgd);    pgd_k = init_mm.pgd + index; 491   if (!pgd_present(*pgd_k))     goto no_context;    pmd = pmd_offset(pgd, address);    pmd_k = pmd_offset(pgd_k, address);    if (!pmd_present(*pmd_k))     goto no_context;    set_pmd(pmd, *pmd_k);    pte_k = pte_offset_kernel(pmd_k, address); 506   if (!pte_present(*pte_k)) 507    goto no_context; 508   return; 509  } -----------------------------------------------------------------------------

Figure 4.16. Label vmalloc_fault

Lines 473509

The current process Page Global Directory is referenced (by way of cr3) and saved in the variable pgd and the kernel Page Global Directory is referenced by pgd_k (likewise for the pmd and the pte variables). If the offending address is not valid in the kernel paging system, the code jumps to the no_context: label. Otherwise, the current process uses the kernel pgd.

Now, we look at the label good_area:. At this point, we know that the memory area holding the offending address exists within the address space of the process. Now, we need to ensure that the access permissions were correct. Figure 4.17 shows the flow diagram:

 ----------------------------------------------------------------------------- arch/i386/mm/fault.c 290  good_area: 291   info.si_code = SEGV_ACCERR; 292   write = 0; 293   switch (error_code & 3) { 294    default:  /* 3: write, present */ ...     /* fall through */ 300    case 2:   /* write, not present */ 301     if (!(vma->vm_flags & VM_WRITE)) 302       goto bad_area; 303     write++; 304     break; 305    case 1:   /* read, present */ 306     goto bad_area; 307    case 0:   /* read, not present */ 308     if (!(vma->vm_flags & (VM_READ | VM_EXEC))) 309       goto bad_area; 310   } -----------------------------------------------------------------------------

Figure 4.17. Label good_area

Lines 294304

If the page fault was caused by a memory access that was a write (recall that if this is the case, our left-most bit in the error code is set to 1), we check if our memory area is writeable. If it is not, we have a mismatch of permissions and we jump to the label bad_area:. If it was writeable, we fall through the case statement and eventually proceed to handle_mm_fault() with the local variable write set to 1.

Lines 305309

If the page fault was caused by a read or execute access and the page is present, we jump to the label bad_area: because this constitutes a clear permissions violation. If the page is not present, we check to see if the memory area has read or execute permissions. If it does not, we jump to the label bad_area: because even if we were to fetch the page, the permissions would not allow the operation. If it does, we fall out of the case statement and eventually proceed to handle_mm_fault() with the local variable write set to 0.

The following label marks the code we fall through to when the permissions checks comes out OK. It is appropriately labeled survive:.

 ----------------------------------------------------------------------------- arch/i386/mm/fault.c survive: 318   switch (handle_mm_fault(mm, vma, address, write)) {    case VM_FAULT_MINOR:     tsk->min_flt++;     break;    case VM_FAULT_MAJOR:     tsk->maj_flt++;     break;    case VM_FAULT_SIGBUS:     goto do_sigbus;    case VM_FAULT_OOM:     goto out_of_memory; 329    default:     BUG();   } -----------------------------------------------------------------------------

Lines 318329

The function handle_mm_fault() is called with the current memory descriptor (mm), the descriptor to the offending address' area, the offending address, and whether the access was a read/execute or write. The switch statement catches us if we fail at handling the fault, which ensures we exit gracefully.

The following code snippet describes the flow of the label bad_area and bad_area_no_semaphore. When we jump to this point, we know that either

The address generating the page fault is not in the process address space because we've searched its memory areas and did not find one that matched.
The address generating the page fault is not in the process address space and the region that would contain it cannot grow to hold it.
The address generating the page fault is in the process address space but the permissions of the memory area did not match the action we wanted to perform.

Now, we need to determine if the access is from within kernel mode. The following code and Figure 4.18 illustrates the flow of these labels:

 ----------------------------------------------------------------------------- arch/i386/mm.fault.c 348  bad_area: 349   up_read(&mm->mmap_sem); 350 351  bad_area_nosemaphore: 352   /* User mode accesses just cause a SIGSEGV */ 353   if (error_code & 4) { 354    if (is_prefetch(regs, address)) 355     return; 356 357    tsk->thread.cr2 = address; 358    tsk->thread.error_code = error_code; 359    tsk->thread.trap_no = 14; 360    info.si_signo = SIGSEGV; 361    info.si_errno = 0; 362    /* info.si_code has been set above */ 363    info.si_addr = (void *)address; 364    force_sig_info(SIGSEGV, &info, tsk); 365    return; 366   } -----------------------------------------------------------------------------

Figure 4.18. Label bad_area

Line 348

The function up_read() releases the read lock on the semaphore of the process' memory descriptor. Notice that we have only jumped to the label bad_area after we place read lock on the memory descriptor's semaphore to look through its memory areas to see if our address was within the process address space. Otherwise, we have jumped to the label bad_area_nosemaphore. The only difference between the two is the lifting of the read lock on the semaphore.

Lines 351353

Because the address is not in the address space, we now check to see if the error was generated in user mode. If you recall from Table 4.7, an error code value of 4 indicates that the error occurred in user mode.

Lines 354366

We have determined that the error occurred in user mode, so we send a SIGSEGV signal (trap 14).

The following code snippet describes the flow of the label no_context. When we jump to this point, we know that either

One of the page tables is missing.
The memory access was not done while in kernel mode.

Figure 4.19 illustrates the flow diagram of the label no_context:

 ----------------------------------------------------------------------------- arch/i386/mm/fault.c 388  no_context: 390   if (fixup_exception(regs))    return; 432   die("Oops", regs, error_code);   bust_spinlocks(0);   do_exit(SIGKILL); -----------------------------------------------------------------------------

Figure 4.19. Label no_context

Line 390

The function fixup_exception() uses the eip passed in to search an exception table for the offending instruction. If the instruction is in the table, it must have already been compiled with "hidden" fault handling code built in. The page fault handler, do_page__fault(), uses the fault handling code as a return address and jumps to it. The code can then flag an error.

Line 432

If there is not an entry in the exception table for the offending instruction, the code that jumped to label no_context ends up with the oops screen dump.

4.10.3. PowerPC Page Fault Exception

The PowerPC page fault handler do_page_fault() is called as a result of an instruction or data store exception. Because of the subtle differences between the various versions of the PowerPC processors, the error codes are in a slightly different format, but yield similar information. The bits of interest are whether the offending operation was a read or write, and if it was a protection fault. The PowerPC page fault handler do_page_fault() does not initiate the oops error.

In PowerPC, the label no_context code is combined with the label bad_area code and placed in a function called bad_page_fault(), which ends by producing a segmentation fault. This function also has the fixup function that traverses the exception_table.