Section 3.8. Asynchronous Execution Flow

3.8. Asynchronous Execution Flow

We mentioned that processes can transition from one state to another by means of interrupts, for instance going from TASK_INTERRUPTIBLE to TASK_RUNNING. One of the ways this is attained is by means of asynchronous execution which includes exceptions and interrupts. We have mentioned that processes move in and out of user and kernel mode. We will now go into a description of how exceptions work and follow it up with an explanation of how interrupts work.

3.8.1. Exceptions

Exceptions, also known as synchronous interrupts, are events that occur entirely within the processor's hardware. These events are synchronous to the execution of the processor; that is, they occur not during but after the execution of a code instruction. Examples of processor exceptions include the referencing of a virtual memory location, which is not physically there (known as a page fault) and a calculation that results in a divide by 0. The important thing to note with exceptions (sometimes called soft irqs) is that they typically happen after an intruction's execution. This differentiates them from external or asynchronous events, which are discussed later in Section 3.8.2, "Interrupts."

Most modern processors (the x86 and the PPC included) allow the programmer to initiate an exception by executing certain instructions. These instructions can be thought of as hardware-assisted subroutine calls. An example of this is the system call.

3.8.1.1. System Calls

Linux provides user mode programs with entry points into the kernel by which services or hardware access can be requested from the kernel. These entry points are standardized and predefined in the kernel. Many of the C library routines available to user mode programs, such as the fork() function in Figure 3.9, bundle code and one or more system calls to accomplish a single function. When a user process calls one of these functions, certain values are placed into the appropriate processor registers and a software interrupt is generated. This software interrupt then calls the kernel entry point. Although not recommended, system calls (syscalls) can also be accessed from kernel code. From where a syscall should be accessed is the source of some discussion because syscalls called from the kernel can have an improvement in performance. This improvement in performance is weighed against the added complexity and maintainability of the code. In this section, we explore the "traditional" syscall implementation where syscalls are called from user space.

Syscalls have the ability to move data between user space and kernel space. Two functions are provided for this purpose: copy_to_user() and copy_from_user(). As in all kernel programming, validation (of pointers, lengths, descriptors, and permissions) is critical when moving data. These functions have the validation built in. Interestingly, they return the number of bytes not transferred.

By its nature, the implementation of the syscall is hardware specific. Traditionally, with Intel architecture, all syscalls have used software interrupt 0x80.^[5]

^[5] In an effort to gain in performance with the newer (PIV+) Intel processors, work has been done with the implementation of vsyscalls. vsyscalls are based on calls to user space memory (in particular, a "vsyscall" page) and use the faster sysenter and sysexit instructions (when available) over the traditional int 0x80 call. Similar performance work is also being pursued on many PPC implementations.

Parameters of the syscall are passed in the general registers with the unique syscall number in %eax. The implementation of the system call on the x86 architecture limits the number of parameters to 5. If more than 5 are required, a pointer to a block of parameters can be passed. Upon execution of the assembler instruction int 0x80, a specific kernel mode routine is called by way of the exception-handling capabilities of the processor. Let's look at an example of how a system call entry is initialized:

 set_system_gate(SYSCALL_VECTOR,&system_call);

This macro creates a user privilege descriptor at entry 128 (SYSCALL_VECTOR), which points to the address of the syscall handler in entry.S (system_call).

As we see in the next section on interrupts, PPC interrupt routines are "anchored" to certain memory locations; the external interrupt handler is anchored to address 0x500, the system timer is anchored to address 0x900, and so on. The system call instruction sc vectors to address 0xc00. Let's explore the code segment from head.S where the handler is set for the PPC system call:

 ----------------------------------------------------------------------- arch/ppc/kernel/head.S 484  /* System call */ 485   . = 0xc00 486  SystemCall: 487   EXCEPTION_PROLOG 488   EXC_XFER_EE_LITE(0xc00, DoSyscall) -----------------------------------------------------------------------

Line 485

The anchoring of the address. This line tells the loader that the next instruction is located at address 0xc00. Because labels follow similar rules, the label SystemCall along with the first line of code in the macro EXCEPTION_PROLOG both start at address 0xc00.

Line 488

This macro dispatches the DoSyscall() handler.

For both architectures, the syscall number and any parameters are stored in the processor's registers.

When the x86 exception handler processes the int 0x80, it indexes into the system call table. The file arch/i386/kernel/entry.S contains low-level interrupt handling routines and the system call table, sys_call_table. Likewise for the PPC, the syscall low-level routines are in arch/ppc/kernel/entry.S and the sys_call_table is in arch/ppc/kernel/misc.S.

The syscall table is an assembly code implementation of an array in C with 4-byte entries. Each entry is initialized to the address of a function. By convention, we must prepend the name of our function with "sys_." Because the position in the table determines the syscall number, we must add the name of our function to the end of the list. Even with different assembly languages, the tables are nearly identical between the architectures. However, the PPC table has only 255 entries at the time of this writing, while the x86 table has 275.

The files include/asm-i386/unistd.h and include/asm-ppc/unistd.h associate the system calls with their positional numbers in the sys_call_table. The "sys_" is replaced with a "__NR_" in this file. This file also has macros to assist the user program in loading the registers with parameters. (See the assembly programming section in Chapter 2, "Exploration Toolkit," for a crash course in C and assembler variables and inline assembler.)

Let's look at how we would add a system call named sys_ourcall. The system call must be added to the sys_call_table. The addition of our system call into the x86 sys_call_table is shown here:

 ----------------------------------------------------------------------- arch/i386/kernel/entry.S 607  .data 608  ENTRY(sys_call_table) 609  .long sys_restart_syscall /* 0 - old "setup()" system call, used for restarting*/ ... 878  .long sys_tgkill   /* 270 */ 879  .long sys_utimes 880  .long sys_fadvise64_64 881  .long sys_ni_syscall   /* sys_vserver */ 882  .long sys_ourcall   /* our syscall will be 274 */ 883 884  nr_syscalls=(.-sys_call_table)/4 -----------------------------------------------------------------------

In x86, our system call would be number 274. If we were to add a syscall named sys_ourcall in PPC, the entry would be number 255. Here, we show how it would look when we introduce the association of our system call with its positional number into include/asm-ppc/unistd.h. __NR_ourcall is number-entry number 255 at the end of the table:

 ----------------------------------------------------------------------- include/asm-ppc/unistd.h /*  * This file contains the system call numbers.  */ #define __NR_restart_syscall  0 #define __NR_exit   1 #define __NR_fork   2 ... #define __NR_utimes   271 #define __NR_fadvise64_64  272 #define __NR_vserver  273 #define __NR_ourcall   274 /* #define NR_syscalls 274  this is the old value before our syscall */ #define NR_syscalls 275 -----------------------------------------------------------------------

The next section discusses interrupts and the hardware involved to alert the kernel to the need for handling them. Where exceptions as a group diverge somewhat is what their handler does in response to being called. Although exceptions travel the same route as interrupts at handling time, exceptions tend to send signals back to the current process rather than work with hardware devices.

3.8.2. Interrupts

Interrupts are asynchronous to the execution of the processor, which means that interrupts can happen in between instructions. The processor receives notification of an interrupt by way of an external signal to one of its pins (INTR or NMI). This signal comes from a hardware device called an interrupt controller. Interrupts and interrupt controllers are hardware and system specific. From architecture to architecture, many differences exist in how interrupt controllers are designed. This section touches on the major hardware differences and functions tracing the kernel code from the architecture-independent to the architecture-dependent parts.

An interrupt controller is needed because the processor must be in communication with one of several peripheral devices at any given moment. Older x86 computers used a cascaded pair of Intel 8259 interrupt controllers configured in such a way^[6] that the processor was able to discern between 15 discrete interrupt lines (IRQ) (see Figure 3.16).When the interrupt controller has a pending interrupt (for example, when you press a key), it asserts its INT line, which is connected to the processor. The processor then acknowledges this signal by asserting its acknowledge line connected to the INTA line on the interrupt controller. At this moment, the interrupt controller transfers the IRQ data to the processor. This sequence is known as an interrupt-acknowledge cycle.

^[6] An IRQ from the first 8259 (usually IRQ2) is connected to the output of the second 8259.

Figure 3.16. Cascaded Interrupt Controllers

Newer x86 processors have a local Advanced Programmable Interrupt Controller (APIC). The local APIC (which is built into the processor package) receives interrupt signals from the following:

Processor's interrupt pins (LINT0, LINT1)
Internal timer
Internal performance monitor
Internal temperature sensor
Internal APIC error
Another processor (inter-processor interrupts)
An external I/O APIC (via an APIC bus on multiprocessor systems)

After the APIC receives an interrupt signal, it routes the signal to the processor core (internal to the processor package). The I/O APIC shown in Figure 3.17 is part of a processor chipset and is designed to receive 24 programmable interrupt inputs.

Figure 3.17. I/O APIC

The x86 processors with local APIC can also be configured with 8259 type interrupt controllers instead of the I/O APIC architecture (or the I/O APIC can be configured to interface to an 8259 controller). To find out if a system is using the I/O APIC architecture, enter the following on the command line:

 lkp:~# cat /proc/interrupts

If you see I/O-APIC listed, it is in use. Otherwise, you see XT-PIC, which means it is using the 8259 type architecture.

The PowerPC interrupt controllers for the Power Mac G4 and G5 are integrated into the Key Largo and K2 I/O controllers. Entering this on the command line:

 lkp:~# cat /proc/interrupts

on a G4 machine yields OpenPIC, which is an Open Programmable Interrupt Controller standard initiated by AMD and Cyrix in 1995 for multiprocessor systems. MPIC is the IBM implementation of OpenPIC, and is used in several of their CHRP designs. Old-world Apple machines had an in-house interrupt controller and, for the 4xx embedded processors, the interrupt controller core is integrated into the ASIC chip.

Now that we have had the necessary discussion of how, why, and when interrupts are delivered to the kernel by the hardware, we can analyze a real-world example of the kernel handling the Hardware System Timer interrupt and expand on where the interrupt is delivered. As we go through the System Timer code, we see that at interrupt time, the hardware-to-software interface is implemented in both the x86 and PPC architectures with jump tables that select the proper handler code for a given interrupt.

Each interrupt of the x86 architecture is assigned a unique number or vector. At interrupt time, this vector is used to index into the Interrupt Descriptor Table (IDT). (See the Intel Programmer's Reference for the format of the x86 gate descriptor.) The IDT allows the hardware to assist the software with address resolution and privilege checking of handler code at interrupt time. The PPC architecture is somewhat different in that it uses an interrupt table created at compile time to execute the proper interrupt handler. (Later in this section, there is more on the software aspects of initialization and use of the jump tables, when we compare x86 and PPC interrupt handling for the system timer.) The next section discusses interrupt handlers and their implementation. We follow that with a discussion of the system timer as an example of the Linux implementation of interrupts and their associated handlers.

We now talk about the different kinds of interrupt handlers.

3.8.2.1. Interrupt Handlers

Interrupt and exception handlers look much like regular C functions. They mayand often docall hardware-specific assembly routines. Linux interrupt handlers are broken into a high-performance top half and a low-performance bottom half:

Top half. Must execute as quickly as possible. Top-half handlers, depending on how they are registered, can run with all local (to a given processor) interrupts disabled (a fast handler). Code in a top-half handler needs to be limited to responding directly to the hardware and/or performing time-critical tasks. To remain in the top-half handler for a prolonged period of time could significantly impact system performance. To keep performance high and latency (which is the time it takes to acknowledge a device) low, the bottom-half architecture was introduced.
Bottom half. Allows the handler writer to delay the less critical work until the kernel has more time.^[7] Remember, the interrupt came in asynchronously with the execution of the system; the kernel might have been doing something more time critical at that moment. With the bottom-half architecture, the handler writer can have the kernel run the less critical handler code at a later time.
^[7] Earlier versions of Linux used a top-half/bottom-half handler for the system timer. It has since been rewritten to be a high-performance top half only.

Table 3.8 illustrates the four most common methods of bottom-half interrupt handling.

Table 3.8. Bottom-Half Interrupt Handling Methods
"Old" bottom halves	These pre-SMP handlers are being phased out because of the fact that only one bottom half can run at a time regardless of the number of processors. This system has been removed in the 2.6 kernel and is mentioned only for reference.
Work queues	The top-half code is said to run in interrupt context, which means it is not associated with any process. With no process association, the code cannot sleep or block. Work queues run in process context and have the abilities of any kernel thread. Work queues have a rich set of functions for creation, scheduling, canceling, and so on. For more information on work queues, see the "Work Queues and Interrupts" section in Chapter 10.
Softirqs	Softirqs run in interrupt context and are similar to bottom halves except that softirqs of the same type can run on multiple processors simultaneously. Only 32 softirqs are available in the system. The system timer uses softirqs.
Tasklets	Similar to softirqs except that no limit exists. All tasklets are funneled through one softirq, and the same tasklet cannot run simultaneously on multiple processors. The tasklet interface is simpler to use and implement compared to softirqs.

3.8.2.2. IRQ Structures

Three main structures contain all the information related to IRQ's: irq_desc_t, irqaction, and hw_interrupt_type. Figure 3.18 illustrates how they interrelate.

Figure 3.18. IRQ Structures

Struct irq_desc_t

The irq_desc_t structure is the primary IRQ descriptor. irq_desc_t structures are stored in a globally accessible array of size NR_IRQS (whose value is architecture dependent) called irq_desc.

 ----------------------------------------------------------------------- include/linux/irq.h 60  typedef struct irq_desc { 61   unsigned int status;   /* IRQ status */ 62   hw_irq_controller *handler; 63   struct irqaction *action; /* IRQ action list */ 64   unsigned int depth;   /* nested irq disables */ 65   unsigned int irq_count;  /* For detecting broken interrupts */ 66   unsigned int irqs_unhandled; 67   spinlock_t lock; 68  } ____cacheline_aligned irq_desc_t; 69 70  extern irq_desc_t irq_desc [NR_IRQS]; -----------------------------------------------------------------------

Line 61

The value of the status field is determined by setting flags that describe the status of the IRQ line. Table 3.9 shows the flags.

Table 3.9. irq_desc_t->field Flags
Flag	Description
`IRQ_INPROGRESS`	Indicates that we are in the process of executing the handler for that IRQ line.
`IRQ_DISABLED`	Indicates that the IRQ is disabled by software so that its handler is not executed even if the physical line itself is enabled.
`IRQ_PENDING`	A middle state that indicates that the occurrence of the interrupt has been acknowledged, but the handler has not been executed.
`IRQ_REPLAY`	The previous IRQ has not been acknowledged.
`IRQ_AUTODETECT`	The state the IRQ line is set when being probed.
`IRQ_WAITING`	Used when probing.
`IRQ_LEVEL`	The IRQ is level triggered as opposed to edge triggered.
`IRQ_MASKED`	This flag is unused in the kernel code.
`IRQ_PER_CPU`	Used to indicate that the IRQ line is local to the CPU calling.

Line 62

The handler field is a pointer to the hw_irq_controller. The hw_irq_ controller is a typedef for hw_interrupt_type structure, which is the interrupt controller descriptor used to describe low-level hardware.

Line 63

The action field holds a pointer to the irqaction struct. This structure, described later in more detail, keeps track of the interrupt handler routine to be executed when the IRQ is enabled.

Line 64

The depth field is a counter of nested IRQ disables. The IRQ_DISABLE flag is cleared only when the value of this field is 0.

Lines 6566

The irq_count field, along with the irqs_unhandled field, identifies IRQs that might be stuck. They are used in x86 and PPC64 in the function note_interrupt() (arch/<arch>/kernel/irq.c).

Line 67

The lock field holds the spinlock to the descriptor.

Struct irqaction

The kernel uses the irqaction struct to keep track of interrupt handlers and the association with the IRQ. Let's look at the structure and the fields we will view in later sections:

 ----------------------------------------------------------------------- include/linux/interrupt.h 35  struct irqaction { 36   irqreturn_t (*handler)  (int, void *, struct pt_regs *); 37   unsigned long flags; 38   unsigned long mask; 39   const char *name; 40   void *dev_id; 41   struct irqaction *next; 42  }; ------------------------------------------------------------------------

Line 36

The field handler is a pointer to the interrupt handler that will be called when the interrupt is encountered.

Line 37

The flags field can hold flags such as SA_INTERRUPT, which indicates the interrupt handler will run with all interrupts disabled, or SA_SHIRQ, which indicates that the handler might share an IRQ line with another handler.

Line 39

The name field holds the name of the interrupt being registered.

Struct hw_interrupt_type

The hw_interrupt_type or hw_irq_controller structure contains all the data related to the system's interrupt controller. First, we look at the structure, and then we look at how it is implemented for a couple of interrupt controllers:

 ----------------------------------------------------------------------- include/linux/irq.h 40  struct hw_interrupt_type { 41   const char * typename; 42   unsigned int (*startup)(unsigned int irq); 43   void (*shutdown)(unsigned int irq); 44   void (*enable)(unsigned int irq); 45   void (*disable)(unsigned int irq); 46   void (*ack)(unsigned int irq); 47   void (*end)(unsigned int irq); 48   void (*set_affinity)(unsigned int irq, cpumask_t dest); 49  }; ------------------------------------------------------------------------

Line 41

The typename holds the name of the Programmable Interrupt Controller (PIC). (PICs are discussed in detail later.)

Lines 4248

These fields hold pointers to PIC-specific programming functions.

Now, let's look at our PPC controller. In this case, we look at the PowerMac's PIC:

 ----------------------------------------------------------------------- arch/ppc/platforms/pmac_pic.c 170  struct hw_interrupt_type pmac_pic = { 171   " PMAC-PIC ", 172   NULL, 173   NULL, 174   pmac_unmask_irq, 175   pmac_mask_irq, 176   pmac_mask_and_ack_irq, 177   pmac_end_irq, 178   NULL 179  }; ------------------------------------------------------------------------

As you can see, the name of this PIC is PMAC-PIC, and it has four of the six functions defined. The pmac_unamsk_irq and the pmac_mask_irq functions enable and disable the IRQ line, respectively. The function pmac_mask_and_ack_irq acknowledges that an IRQ has been received, and pmac_end_irq takes care of cleaning up when we are done executing the interrupt handler.

 ----------------------------------------------------------------------- arch/i386/kernel/i8259.c 59  static struct hw_interrupt_type i8259A_irq_type = { 60   "XT-PIC", 61   startup_8259A_irq, 62   shutdown_8259A_irq, 63   enable_8259A_irq, 64   disable_8259A_irq, 65   mask_and_ack_8259A, 66   end_8259A_irq, 67   NULL 68  }; ------------------------------------------------------------------------

The x86 8259 PIC is called XT-PIC, and it defines the first five functions. The first two, startup_8259A_irq and shutdown_8259A_irq, start up and shut down the actual IRQ line, respectively.

3.8.2.3. An Interrupt Example: System Timer

The system timer is the heartbeat for the operating system. The system timer and its interrupt are initialized during system initialization at boot-up time. The initialization of an interrupt at this time uses interfaces different to those used when an interrupt is registered at runtime. We point out these differences as we go through the example.

As more complex support chips are produced, the kernel designer has gained several options for the source of the system timer. The most common timer implementation for the x86 architecture is the Programmable Interval Time (PIT) and, for the PowerPC, it is the decrementer.

The x86 architecture has historically implemented the PIT with the Intel 8254 timer. The 8254 is used as a 16-bit down counterinterrupting on terminal count. That is, a value is written to a register and the 8254 decrements this value until it gets to 0. At that moment, it activates an interrupt to the IRQ 0 input on the 8259 interrupt controller, which was previously mentioned in this section.

The system timer implemented in the PowerPC architecture is the decrementer clock, which is a 32-bit down counter that runs at the same frequency as the CPU. Similar to the 8259, it activates an interrupt at its terminal count. Unlike the Intel architecture, the decrementer is built in to the processor.

Every time the system timer counts down and activates an interrupt, it is known as a tick. The rate or frequency of this tick is set by the HZ variable.

HZ

HZ is a variation on the abbreviation for Hertz (Hz), named for Heinrich Hertz (1857-1894). One of the founders of radio waves, Hertz was able to prove Maxwell's theories on electricity and magnetism by inducing a spark in a wire loop. Marconi then built on these experiments leading to modern radio. In honor of the man and his work the fundamental unit of frequency is named after him; one cycle per second is equal to one Hertz.

HZ is defined in include/asm-xxx/param.h. Let's take a look at what these values are in our x86 and PPC.

[View full width]

-----------------------------------------------------------------------

include/asm-i386/param.h 005 #ifdef __KERNEL__ 006 #define HZ 1000 /* internal kernel timer frequency */ -----------------------------------------------------------------------

-----------------------------------------------------------------------

include/asm-ppc/param.h 008 #ifdef __KERNEL__ 009 #define HZ 100 /* internal kernel timer frequency */ -----------------------------------------------------------------------

The value of HZ has been typically 100 across most architectures, but as machines become faster, the tick rate has increased on certain models. Looking at the two main architectures we are using for this book, we can see (above) the default tick rate for both architectures is 1000. The period of 1 tick is 1/HZ. Thus the period (or time between interrupts) is 1 millisecond. We can see that as the value of HZ goes up, we get more interrupts in a given amount of time. While this yields better resolution from the timekeeping functions, it is important to note that more of the processor time is spent answering the system timer interrupts in the kernel. Taken to an extreme, this could slow the system response to user mode programs. As with all interrupt handling, finding the right balance is key.

We now begin walking through the code with the initialization of the system timer and its associated interrupts. The handler for the system timer is installed near the end of kernel initialization; we pick up the code segments as start_kernel(), the primary initialization function executed at system boot time, first calls trap_init(), then init_IRQ(), and finally time_init():

 init/main.c 386  asmlinkage void __init start_kernel(void) 387  { ... 413  trap_init(); ... 415  init_IRQ(); ... 419  time_init(); ...   } -----------------------------------------------------------------------

Line 413

The macro trap_init() initializes the exception entries in the Interrupt Descriptor Table (IDT) for the x86 architecture running in protected mode. The IDT is a table set up in memory. The address of the IDT is set in the processor's IDTR register. Each element of the interrupt descriptor table is one of three gates. A gate is an x86 protected mode address that consists of a selector, an offset, and a privilege level. The gate's purpose is to transfer program control. The three types of gates in the IDT are system, where control is transferred to another task; interrupt, where control is passed to an interrupt handler with interrupts disabled; and trap, where control is passed to the interrupt handler with interrupts unchanged.

The PPC is architected to jump to specific addresses, depending on the exception. The function trap_init() is a no-op for the PPC. Later in this section, as we continue to follow the system timer code, we will contrast the PPC interrupt table with the x86 interrupt descriptor table initialized next.

 -----------------------------------------------------------------------  arch/i386/kernel/traps.c 900  void __init trap_init(void) 901  { 902  #ifdef CONFIG_EISA 903   if (isa_readl(0x0FFFD9) == 'E'+('I'<<8)+('S'<<16)+('A'<<24)) { 904    EISA_bus = 1; 905   } 906  #endif 907 908  #ifdef CONFIG_X86_LOCAL_APIC 909   init_apic_mappings(); 910  #endif 911 912   set_trap_gate(0,&divide_error); 913   set_intr_gate(1,&debug); 914   set_intr_gate(2,&nmi); 915   set_system_gate(3,&int3);  /* int3-5 can be called from all */ 916   set_system_gate(4,&overflow); 917   set_system_gate(5,&bounds); 918   set_trap_gate(6,&invalid_op); 919   set_trap_gate(7,&device_not_available); 920   set_task_gate(8,GDT_ENTRY_DOUBLEFAULT_TSS); 921   set_trap_gate(9,&coprocessor_segment_overrun); 922   set_trap_gate(10,&invalid_TSS); 923   set_trap_gate(11,&segment_not_present); 924   set_trap_gate(12,&stack_segment); 925   set_trap_gate(13,&general_protection); 926   set_intr_gate(14,&page_fault); 927   set_trap_gate(15,&spurious_interrupt_bug); 928   set_trap_gate(16,&coprocessor_error); 929   set_trap_gate(17,&alignment_check); 930  #ifdef CONFIG_X86_MCE 931   set_trap_gate(18,&machine_check); 932  #endif 933   set_trap_gate(19,&simd_coprocessor_error); 934 935   set_system_gate(SYSCALL_VECTOR,&system_call) ; 936 937   /* 938   * default LDT is a single-entry callgate to lcall7 for iBCS 939   * and a callgate to lcall27 for Solaris/x86 binaries 940   */ 941   set_call_gate(&default_ldt[0],lcall7); 942   set_call_gate(&default_ldt[4],lcall27); 943 944   /* 945   * Should be a barrier for any external CPU state. 846   */ 947   cpu_init(); 948 949   trap_init_hook(); 950  } -----------------------------------------------------------------------

Line 902

Look for EISA signature. isa_readl() is a helper routine that allows reading the EISA bus by mapping I/O with ioremap().

Lines 908910

If an Advanced Programmable Interrupt Controller (APIC) exists, add its address to the system fixed address map. See include/asm-i386/fixmap.h for "special" system address helper routines; set_fixmap_nocache().init_apic_mappings() uses this routine to set the physical address of the APIC.

Lines 912935

Initialize the IDT with trap gates, system gates, and interrupt gates.

Lines 941942

These special intersegment call gates support the Intel Binary Compatibility Standard for running other UNIX binaries on Linux.

Line 947

For the currently executing CPU, initialize its tables and registers.

Line 949

Used to initialize system-specific hardware, such as different kinds of APICs. This is a no-op for most x86 platforms.

Line 415

The call to init_IRQ() initializes the hardware interrupt controller. Both x86 and PPC architectures have several device implementations. For the x86 architecture, we explore the i8259 device. For PPC, we explore the code associated with the Power Mac.

The PPC implementation of init_IRQ() is in arch/ppc/kernel/irq.c. Depending on the particular hardware configuration, init_IRQ() calls one of several routines to initialize the PIC. For a Power Mac configuration, the function pmac_pic_init() in arch/ppc/platforms/pmac_pic.c is called for the G3, G4, and G5 I/O controllers. This is a hardware-specific routine that tries to identify the type of I/O controller and set it up appropriately. In this example, the PIC is part of the I/O controller device. The process for interrupt initialization is similar to x86, with the minor difference being the system timer is not started in the PPC version of init_IRQ(), but rather in the time_init() function, which is covered later in this section.

The x86 architecture has fewer options for the PIC. As previously discussed, the older systems use the cascaded 8259, while the later systems use the IOAPIC architecture. This code explores the APIC with the emulated 8259 type controllers:

 -----------------------------------------------------------------------  arch/i386/kernel/i8259.c 342  void __init init_ISA_irqs (void) 343  { 344   int i; 345  #ifdef CONFIG_X86_LOCAL_APIC 346   init_bsp_APIC(); 347  #endif 348   init_8259A(0); ... 351   for (i = 0; i < NR_IRQS; i++) { 352    irq_desc[i].status = IRQ_DISABLED; 353    irq_desc[i].action = 0; 354    irq_desc[i].depth = 1; 355   356    if (i < 16) { 357     /* 358      * 16 old-style INTA-cycle interrupts: 359      */ 360     irq_desc[i].handler = &i8259A_irq_type; 361    } else { 362     /* 363      * 'high' PCI IRQs filled in on demand 364      */ 365     irq_desc[i].handler = &no_irq_type; 366    } 367   } 368  } ... 409 410  void __init init_IRQ(void) 411  { 412   int i; 413 414   /* all the set up before the call gates are initialized */ 415   pre_intr_init_hook(); ... 422  for (i = 0; i < NR_IRQS; i++) { 423   int vector = FIRST_EXTERNAL_VECTOR + i; 424   if (vector != SYSCALL_VECTOR)  425    set_intr_gate(vector, interrupt[i]) ; 426  } ... 431  intr_init_hook(); ... 437  setup_timer(); ... } -----------------------------------------------------------------------

Line 410

This is the function entry point called from start_kernel(), which is the primary kernel initialization function called at system startup.

Lines 342348

If the local APIC is available and desired, initialize it and put it in virtual wire mode for use with the 8259. Then, initialize the 8259 device using register I/O in init_8259A(0).

Lines 422426

On line 424, syscalls are not included in this loop because they were already installed earlier in TRap_init(). Linux uses an Intel interrupt gate (kernel-initiated code) as the descriptor for interrupts. This is set with the set_intr_gate() macro (on line 425). Exceptions use the Intel system gate and trap gate set by the set_system_gate() and set_trap_gate(), respectively. These macros can be found in arch/i386/kernel/traps.c.

Line 431

Set up interrupt handlers for the local APIC (if used) and call setup_irq() in irq.c for the cascaded 8259.

Line 437

Start the 8253 PIT using register I/O.

Line 419

Now, we follow time_init() to install the system timer interrupt handler for both PPC and x86. In PPC, the system timer (abbreviated for the sake of this discussion) initializes the decrementer:

 -----------------------------------------------------------------------  arch/ppc/kernel/time.c void __init time_init(void) { ... 317   ppc_md.calibrate_decr(); ... 351   set_dec(tb_ticks_per_jiffy); ... } -----------------------------------------------------------------------

Line 317

Figure the proper count for the system HZ value.

Line 351

Set the decrementer to the proper starting count.

The interrupt architecture of the PowerPC and its Linux implementation does not require the installation of the timer interrupt. The decrementer interrupt vector comes in at 0x900. The handler call is hard coded at this location and it is not shared:

 -----------------------------------------------------------------------  arch/ppc/kernel/head.S   /* Decrementer */ 479  EXCEPTION(0x900, Decrementer, timer_interrupt, EXC_XFER_LITE) -----------------------------------------------------------------------

More detail on the EXCEPTION macro for the decrementer is given later in this section. The handler for the decrementer is now ready to be executed when the terminal count is reached.

The following code snippets outline the x86 system timer initialization:

 -----------------------------------------------------------------------  arch/i386/kernel/time.c void __init time_init(void) { ... 340   time_init_hook(); } -----------------------------------------------------------------------

The function time_init() flows down to time_init_hook(), which is located in the machine-specific setup file setup.c:

 -----------------------------------------------------------------------  arch/i386/machine-default/setup.c 072  static struct irqaction irq0 = { timer_interrupt, SA_INTERRUPT, 0, "timer", NULL, NULL}; ... 081  void __init time_init_hook(void) 082  { 083   setup_irq(0, &irq0); 084  } -----------------------------------------------------------------------

Line 72

We initialize the irqaction struct that corresponds to irq0.

Lines 8184

The function call setup_irq(0, &irq0) puts the irqaction struct containing the handler timer_interrupt() on the queue of shared interrupts associated with irq0.

This code segment has a similar effect to calling request_irq() for the general case handler (those not loaded at kernel initialization time). The initialization code for the timer interrupt took a shortcut to get the handler into irq_desc[]. Runtime code uses disable_irq(), enable_irq(), request_irq(), and free_irq() in irq.c. All these routines are utilities to work with IRQs and touch an irq_desc struct at one point.

Interrupt Time

For PowerPC, the decrementer is internal to the processor and has its own interrupt vector at 0x900. This contrasts the x86 architecture where the PIT is an external interrupt coming in from the interrupt controller. The PowerPC external controller comes in on vector 0x500. A similar situation would arise in the x86 architecture if the system timer were based on the local APIC.

Tables 3.10 and 3.11 describe the interrupt vector tables of the x86 and PPC architectures, respectively.

Table 3.10. x86 Interrupt Vector Table
Vector Number/IRQ	Description
0	Divide error
1	Debug extension
2	NMI interrupt
3	Breakpoint
4	INTO-detected overflow
5	BOUND range exceeded
6	Invalid opcode
7	Device not available
8	Double fault
9	Coprocessor segment overrun (reserved)
10	Invalid task state segment
11	Segment not present
12	Stack fault
13	General protection
14	Page fault
15	(Intel reserved. Do not use.)
16	Floating point error
17	Alignment check
18	Machine check*
1931	(Intel reserved. Do not use.)
32255	Maskable interrupts

Table 3.11. PPC Offset of Interrupt Vector
Offset (Hex)	Interrupt Type
00000	Reserved
00100	System reset
00200	Machine check
00300	Data storage
00400	Instruction storage
00500	External
00600	Alignment
00700	Program
00800	Floating point unavailable
00900	Decrementer
00A00	Reserved
00B00	Reserved
00C00	System call
00D00	Trace
00E00	Floating point assist
00E10	Reserved
…	…
00FFF	Reserved
01000	Reserved, implementation specific
…	…
02FFF	(End of interrupt vector locations)

Note the similarities between the two architectures. These tables represent the hardware. The software interface to the Intel exception interrupt vector table is the Interrupt Descriptor Table (IDT) that was previously mentioned in this chapter.

As we proceed, we can see how the Intel architecture handles a hardware interrupt by way of an IRQ, to a jump table in entry.S, to a call gate (descriptor), to finally the handler code. Figure 3.19 illustrates this.

Figure 3.19. x86 Interrupt Flow

PowerPC, on the other hand, vectors to specific offsets in memory where the code to jump to the appropriate handler is located. As we see next, the PPC jump table in head.S is indexed by way of being fixed in memory. Figure 3.20 illustrates this.

Figure 3.20. PPC Interrupt Flow

This should become clearer as we now explore the PPC external (offset 0x500) and timer (offset 0x900) interrupt handlers.

Processing the PowerPC External Interrupt Vector

As previously discussed, the processor jumps to address 0x500 in the event of an external interrupt. Upon further investigation of the EXCEPTION() macro in the file head.S, we can see the following lines of code is linked and loaded such that it is mapped to this memory region at offset 0x500. This architected jump table has the same effect as the x86 IDT:

 ----------------------------------------------------------------------- arch/ppc/kernel/head.S 453   /* External interrupt */ 454  EXCEPTION(0x500, HardwareInterrupt, do_IRQ, EXC_XFER_LITE) The third parameter, do_IRQ(), is called next. Let's take a look at this function. arch/ppc/kernel/irq.c 510  void do_IRQ(struct pt_regs *regs) 511  { 512  int irq, first = 1; 513  irq_enter(); ... 523  while ((irq = ppc_md.get_irq(regs)) >= 0) { 524   ppc_irq_dispatch_handler(regs, irq); 525   first = 0; 526  } 527  if (irq != -2 && first) 528   /* That's not SMP safe ... but who cares ? */ 529   ppc_spurious_interrupts++; 530  irq_exit(); 531  } -----------------------------------------------------------------------

Lines 513530

Indicate to the preemption code that we are in a hardware interrupt.

Line 523

Read from the interrupt controller a pending interrupt and convert to an IRQ number (until all interrupts are handled).

Line 524

The ppc_irq_dispatch_handler() handles the interrupt. We look at this function in more detail next.

The function ppc_irq_dispatch_handler() is nearly identical to the x86 function do_IRQ():

 ----------------------------------------------------------------------- arch/ppc/kernel/irq.c 428  void ppc_irq_dispatch_handler(struct pt_regs *regs, int irq) 429  { 430  int status; 431  struct irqaction *action; 432  irq_desc_t *desc = irq_desc + irq; 433 434  kstat_this_cpu.irqs[irq]++; 435  spin_lock(&desc->lock); 436  ack_irq(irq); ... 441  status = desc->status & ~(IRQ_REPLAY | IRQ_WAITING); 442  if (!(status & IRQ_PER_CPU)) 443   status |= IRQ_PENDING; /* we _want_ to handle it */ ... 449   action = NULL; 450   if (likely(!(status & (IRQ_DISABLED | IRQ_INPROGRESS)))) { 451     action = desc->action; 452    if (!action || !action->handler) { 453     ppc_spurious_interrupts++; 454     printk(KERN_DEBUG "Unhandled interrupt %x, disabled\n",irq); 455      /* We can't call disable_irq here, it would deadlock */ 456      ++desc->depth; 457      desc->status |= IRQ_DISABLED; 458      mask_irq(irq); 459      /* This is a real interrupt, we have to eoi it, 460       so we jump to out */ 461      goto out; 462      } 463      status &= ~IRQ_PENDING; /* we commit to handling */ 464      if (!(status & IRQ_PER_CPU)) 465      status |= IRQ_INPROGRESS; /* we are handling it */ 466    } 567   desc->status = status; ... 489  for (;;) { 490   spin_unlock(&desc->lock); 491   handle_irq_event(irq, regs, action); 492   spin_lock(&desc->lock); 493 494   if (likely(!(desc->status & IRQ_PENDING))) 495    break; 496   desc->status &= ~IRQ_PENDING; 497  } 498 out: 499  desc->status &= ~IRQ_INPROGRESS; ... 511  }   -----------------------------------------------------------------------

Line 432

Get the IRQ from parameters and gain access to the appropriate irq_desc.

Line 435

Acquire the spinlock on the IRQ descriptor in case of concurrent accesses to the same interrupt by different CPUs.

Line 436

Send an acknowledgment to the hardware. The hardware then reacts accordingly, preventing further interrupts of this type from being processed until this one is finished.

Lines 441443

The flags IRQ_REPLAY and IRQ_WAITING are cleared. In this case, IRQ_REPLAY indicates that the IRQ was dropped earlier and is being resent. IRQ_WAITING indicates that the IRQ is being tested. (Both cases are outside the scope of this discussion.) In a uniprocessor system, the IRQ_PENDING flag is set, which indicates that we commit to handling the interrupt.

Line 450

This block of code checks for conditions under which we would not process the interrupt. If IRQ_DISABLED or IRQ_INPROGRESS are set, we can skip over this block of code. The IRQ_DISABLED flag is set when we do not want the system to respond to a particular IRQ line being serviced. IRQ_INPROGRESS indicates that an interrupt is being serviced by a processor. This is used in the case a second processor in a multiprocessor system tries to raise the same interrupt.

Lines 451462

Here, we check to see if the handler exists. If it does not, we break out and jump to the "out" label in line 498.

Lines 463465

At this point, we cleared all three conditions for not servicing the interrupt, so we are committing to doing so. The flag IRQ_INPROGRESS is set and the IRQ_PENDING flag is cleared, which indicates that the interrupt is being handled.

Lines 489497

The interrupt is serviced. Before an interrupt is serviced, the spinlock on the interrupt descriptor is released. After the spinlock is released, the routine handle_irq_event() is called. This routine executes the interrupt's handler. Once done, the spinlock on the descriptor is acquired once more. If the IRQ_PENDING flag has not been set (by another CPU) during the course of the IRQ handling, break out of the loop. Otherwise, service the interrupt again.

Processing the PowerPC System Timer Interrupt

As noted in timer_init(), the decrementer is hard coded to 0x900. We can assume the terminal count has been reached and the handler timer_interrupt() in arch/ppc/kernel/time.c is called at this time:

 ----------------------------------------------------------------------- arch/ppc/kernel/head.S   /* Decrementer */ 479  EXCEPTION(0x900, Decrementer, timer_interrupt, EXC_XFER_LITE) -----------------------------------------------------------------------

Here is the timer_interrupt() function.

 ----------------------------------------------------------------------- arch/ppc/kernel/time.c 145  void timer_interrupt(struct pt_regs * regs) 146  { ... 152   if (atomic_read(&ppc_n_lost_interrupts) != 0) 153    do_IRQ(regs); 154 155   irq_enter(); ... 159    if (!user_mode(regs)) 160     ppc_do_profile(instruction_pointer(regs)); ... 165   write_seqlock(&xtime_lock); 166     167    do_timer(regs); ...    189  if (ppc_md.set_rtc_time(xtime.tv_sec+1 + time_offset) == 0)  ...    195   write_sequnlock(&xtime_lock); ... 198    set_dec(next_dec); ... 208   irq_exit(); 209  } -----------------------------------------------------------------------

Line 152

If an interrupt was lost, go back and call the external handler at 0x900.

Line 159

Do kernel profiling for kernel routing debug.

Lines 165 and 195

Lock out this block of code.

Line 167

This code is the same function used in the x86 timer interrupt (coming up next).

Line 189

Update the RTC.

Line 198

Restart the decrementer for the next interrupt.

Line 208

Return from the interrupt.

The interrupted code now runs as normal until the next interrupt.

Processing the x86 System Timer Interrupt

Upon activation of an interrupt (in our example, the PIT has counted down to 0 and activated IRQ0), the interrupt controller activates an interrupt line going into the processor. The assembly code in enTRy.S has an entry point that corresponds to each descriptor in the IDT. IRQ0 is the first external interrupt and is vector 32 in the IDT. The code is then ready to jump to entry point 32 in the jump table in enTRy.S:

 ----------------------------------------------------------------------- arch/i386/kernel/entry.S 385  vector=0 386  ENTRY(irq_entries_start) 387  .rept NR_IRQS 388   ALIGN 389  1:  pushl $vector-256 390   jmp common_interrupt 391  .data 392  .long 1b 393  .text 394  vector=vector+1 395  .endr 396 397   ALIGN 398  common_interrupt: 399   SAVE_ALL 400   call do_IRQ 401   jmp ret_from_intr -----------------------------------------------------------------------

This code is a fine piece of assembler magic. The repeat construct .rept (on line 387), and its closing statement (on line 395) create the interrupt jump table at compile time. Notice that as this block of code is repeatedly created, the vector number to be pushed at line 389 is decremented. By pushing the vector, the kernel code now knows what IRQ it is working with at interrupt time.

When we left off the code trace for x86, the code jumps to the proper entry point in the jump table and saves the IRQ on the stack. The code then jumps to the common handler at line 398 and calls do_IRQ() (arch/i386/kernel/irq.c) at line 400. This function is almost identical to ppc_irq_dispatch_handler(), which was described in the section, "Processing the PowerPC External Interrupt Vector" so we will not repeat it here.

Based on the incoming IRQ, the function do_irq() accesses the proper element of irq_desc and jumps to each handler in the chain of action structures. Here, we have finally made it to the actual handler function for the PIT: timer_interrupt(). See the following code segments from time.c. Maintaining the same order as in the source file, the handler starts at line 274:

 ----------------------------------------------------------------------- arch/i386/kernel/time.c 274 irqreturn_t timer_interrupt(int irq, void *dev_id, struct pt_regs *regs) 275  { ... 287   do_timer_interrupt(irq, NULL, regs); ... 290   return IRQ_HANDLED; 291  }   -----------------------------------------------------------------------

Line 274

This is the entry point for the system timer interrupt handler.

Line 287

This is the call to do_timer_interrupt().

 ----------------------------------------------------------------------- arch/i386/kernel/time.c 208  static inline void do_timer_interrupt(int irq, void *dev_id, 209       struct pt_regs *regs) 210  {    ... 227   do_timer_interrupt_hook(regs);   ... 250  } ------------------------------------------------------------

Line 227

Call to do_timer_interrupt_hook(). This function is essentially a wrapper around the call to do_timer(). Let's look at it:

 ----------------------------------------------------------------------- include/asm-i386/mach-default/do_timer.h 016  static inline void do_timer_interrupt_hook(struct pt_regs   *regs) 017  { 018   do_timer(regs); ... 025   x86_do_profile(regs); ... 030  } ------------------------------------------------------------------

Line 18

This is where the call to do_timer() gets made. This function performs the bulk of the work for updating the system time.

Line 25

The x86_do_profile() routine looks at the eip register for the code that was running before the interrupt. Over time, this data indicates how often processes are running.

At this point, the system timer interrupt returns from do_irq()to enTRy.S for system housekeeping and the interrupted thread resumes.

As previously discussed, the system timer is the heartbeat of the Linux operating system. Although we have used the timer as an example for interrupts in this chapter, its use is prevalent throughout the entire operating system.