Section 8.3. Architecture-Dependent Memory Initialization | The Linux Kernel Primer. A Top-Down Approach for x86 and PowerPC Architectures

8.3. Architecture-Dependent Memory Initialization

We now take a moment to discuss hardware management features in PPC and x86. Both x86 and PowerPC architectures have hardware memory-management features to support real and virtual addressing environments. As in all operating systems, Linux Memory Management depends on the underlying hardware architecture. This section describes the hardware initialization of both architectures. Because the initialization of memory management is extremely hardware dependent, the hardware specifications need to be understood in order to follow the initialization process. Memory management is one of the first subsystems to be initialized and begins prior to the execution of start_kernel() because of its highly architecture-dependent nature.

8.3.1. PowerPC Hardware Memory Management

Also known as "storage control" in the PowerPC world, this section describes the hardware-supported features of address translation specific to the PowerPC architecture. We follow up with a discussion on how Linux uses (or disregards, for the sake of portability) these features from system power-on through kernel initialization.

8.3.1.1. Real Addressing Mode

From embedded up to high performance, all PowerPC processors come out of hardware reset in real mode.^[3] PowerPC real-addressing mode is defined as having the processor in a state of disabled address translation. Address translation is controlled by the instruction relocate (IR) and data relocate (DR) bits in the Machine State Register (MSR). For fetch instructions, if the IR bit is 0, the effective address (EA) is the same as the real address. For load and store instructions, the DR bit in the MSR plays a similar role.

^[3] Even the 440 series of processors, which technically have no real mode, start with a "shadow" TLB that maps linear addresses to physical addresses.

The MSR, which is illustrated in Figure 8.3, is a 64- or 32-bit register that describes the current state of the processor. On a 32-bit implementation, the IR and DR are bits 26 and 27.

Figure 8.3. PowerPC Machine State Register (MSR)

Because address translation in Linux is a combination of hardware and software structures, real mode is fundamental to the boot process of initializing the memory subsystem and the memory-management structures of Linux. The need to enable address translation is exemplified by the inherent limitations of real mode. Real mode is only capable of addressing the implemented address width; this is 64- or 32-bit in most applications. The two major limitations are as follows:

There is no hardware protection for load/store operations.
Any access (instruction or data) to or from an address that does not have a device physically attached to the bus might cause a Machine Check (also known as a Checkstop), which in most cases, is unrecoverable.

8.3.1.2. Address Translation

The lack of address translation is real addressing. Address translation opens the door to virtual addressing where every possible address is not physically available at any given instance, but through the clever use of hardware and software, every possible address can be made virtually available when accessed.

With address translation enabled, the PowerPC architecture translates an EA by one of two methods: Segmented Address Translation or Block Address Translation (see Figure 8.4). If the EA can be translated by both methods, Block Address Translation takes precedence. Address translation is said to be enabled when MSR_IR=1, or MSR_DR=1, or both. Segmented Address Translation breaks virtual memory into segments, which are divided into 4KB pages, each representing physical memory. Block Address Translation breaks memory into regions ranging from 128MB to 256MB.

Figure 8.4. 32-Bit Address Translation

Memory Addressing Terminology

When we reference memory, we really only have two distinct methodologies or modes: real addressing, where each increment of the address specifies a specific base unit (usually a byte) in physical memory; and virtual addressing, where the address is a computation in hardware and/or software. Here are some example terms used for each:

Real addressing. Physical, bus
Virtual addressing. Effective, protected, and translated

In PowerPC, effective address space is considered a subset of virtual address space.

Terms such as linear, flat, and logical can apply to both modes.

Segmented Address Translation Direct Store Segment T

The next level of translation is determined by the T bit, which is located in the Segment Register. Bits 0:3 of the EA select one of 16 segment registers (SRs) in the PowerPC 7xx series. Figure 8.5 illustrates the segment register.

Figure 8.5. Segment Register

With the T bit set, the segment is deemed a direct store segment to an I/O device, and there is no reference to hardware page tables. The I/O address is made up of a permission bit, the BUID, the controller-specific field, and bits 4:31 of the EA. Linux does not use direct store segmentation.

When the Segmented Address Translation Ordinary Segment T is not set, the virtual segment ID (VSID) field is used.

Referring to Figure 8.6, a 52-bit virtual address (VA) is formed by concatenating bits 20:31 of the EA (the offset within a given page), bits 4:19 of the EA, and bits 8:31 of the selected segment register VSID field. The most significant 40 bits of the VA make up the virtual page number (VPN). The PowerPC architecture uses a Hashed Page Table to map VPNs to real page numbers (the real address of a desired page in memory). The hash function uses the VPN and the value in Storage Description Register 1 (SDR1) to store and retrieve a Page Table Entry (PTE). The PTE, which is illustrated in Figure 8.7, is an 8-byte structure that contains all the necessary attributes of a page in memory.

Figure 8.6. Segment Translation

Figure 8.7. Page Table Entry

Block Address Translation

As its name implies, Block Address Translation (BAT) is an addressing mechanism that allows for mapping blocks of contiguous memory from 125KB to 256MB. BAT registers are privileged special purpose registers (SPRs) in the PowerPC architecture. Figure 8.8 illustrates the BAT register.

Figure 8.8. BAT Register

The formation of a real address from a BAT register can be seen in Figure 8.9. Four Instruction BAT (IBAT) registers and four Data BAT (DBAT) registers can be read or written using mtspr and mfspr PPC instructions.^[4]

^[4] Block Address Translation is not implemented on all PowerPC processors. Notably, it was not implemented on G4 or G5. It is implemented in the 4xx-embedded processors.

Figure 8.9. BAT Real

Translation Lookaside Buffers

The Translation Lookaside Buffers (TLBs) can be thought of as a hardware cache with hardware protection for the paging system. The TLB varies in length with PowerPC architectures and contains an index of the most recently used PTEs. The paging software must be sure to keep the TLBs in sync with the page table. When the processor cannot find a page in the hash table,^[5] the Linux page tables are then searched. If the page is still not found, a normal page fault is generated. Information on optimization of the synchronization between the Linux page tables and PPC hash tables can be found in the document, "Low Level Optimizations in the PowerPC/Linux Kernels," by Paul Mackerras.

^[5] Hash tables are not implemented on all PowerPC processors. They are absent in the 4xx- and 8xx-embedded systems where a TLB miss generates an exception in the hardware and the paging software, and then brings the page in.

Storage Access Mode Control

When address translation is enabled (MSR_IR=1, or MSR_DR=1, or both) and accomplished by way of Segmented Address Translation or Block Address Translation, the storage mode is determined by four control bits: W, I, M, and G. For Segmented Address Translation, they are bits 25:28 of the second word of a PTE, and the same bits for the second SPR of the DBAT. (The G-bit is reserved in the IBAT.) Two more bitsReference and Control, which are located in the PTEare available for Segmented Address Translation. The R and C bits are set by hardware or software. (See the following sidebar for a discussion of the W, I, M, G, R, and C bits.)

Control Bits

The W, I, M, G, R, and C bits control how the processor accesses the cache and main memory:

W (Write Through). If data is in the cache and a store operation is performed on it, if W=1, the copy in main memory must also be updated.
I (Cache Inhibit). Updates bypass cache and goes straight through to main memory.
M (Memory Coherence). When M=1, hardware memory coherency is enforced.
G (Guarded). When G=1, speculative execution is suppressed.
R (Referenced). When R=1, the Page Table entry has been referenced.
C (Changed). When C=1, the Page Table entry has been changed.

8.3.1.3. How Linux Uses PPC Address Translation

We now look at the code that influences memory management in PPC.

The following code is the first in the kernel distribution to get control. This routine calls back into the Firmware for allocation of temporary regions by using the claim() function. The kernel is then decompressed into its proper location:

 ---------------------------------------------------------------------- arch/ppc/boot/openfirmware/newworldmain.c 40  void boot(int a1, int a2, void *prom) ... 54  claim(initrd_start, RAM_END - initrd_start, 0); 55  printf("initial ramdisk moving 0x%x <- 0x%p (%x bytes)\n\r", 56   initrd_start, (char *)(&__ramdisk_begin), initrd_size); 57  memcpy((char *)initrd_start, (char *)(&__ramdisk_begin), initrd_size); ... 63  /* claim 3MB starting at PROG_START */ 64   claim(PROG_START, PROG_SIZE, 0); 65   dst = (void *) PROG_START; 66   if (im[0] == 0x1f && im[1] == 0x8b) { 67  /* claim some memory for scratch space */ 68  avail_ram = (char *) claim(0, SCRATCH_SIZE, 0x10); 69  begin_avail = avail_high = avail_ram; 70  end_avail = avail_ram + SCRATCH_SIZE; 71  printf("heap at 0x%p\n", avail_ram); 72  printf("gunzipping (0x%p <- 0x%p:0x%p)...", dst, im, im+len); 73  gunzip(dst, PROG_SIZE, im, &len); 74  printf("done %u bytes\n", len); 75  printf("%u bytes of heap consumed, max in use %u\n", 76   avail_high - begin_avail, heap_max); ... 86  sa = (unsigned long)PROG_START; 87   printf("start address = 0x%x\n", sa); 88 89   (*(kernel_start_t)sa)(a1, a2, prom); ----------------------------------------------------------------------

Line 40

Entry point to this file is the function boot(a1, a2, *prom).

Line 54

Function claim() is called to allocate memory just below 1M and ramdisk is copied into that memory.

Line 64

Function claim() is called to allocate 3M of memory, starting at 0x1_0000 for the image.

Line 68

Function claim() is called to allocate 8K of memory starting at 0x00 for scratch/heap.

Line 73

The image is gunzipped to address 0x1_0000 (PROG_START).

Line 89

Jump to 0x1_0000 ((*kernel_start_t)sa) with parameters (a1, a2, and prom) where a1 holds the value in r3 (equal to the boot ramdisk start), a2 holds the value in r4 (equal to the boot ramdisk size or 0xdeadbeef in the case of no ramdisk) and prom holds the value in r5 (code stored in system ROM).

The next code block readies the hardware memory-management features of the various PowerPC processors. The first 16M of RAM is mapped to 0xc0000000:

 ---------------------------------------------------------------------- arch/ppc/kernel/head.S 131  __start: ... 150  bl  early_init  in <arch/ppc/kernel/setup.c> (283) ... 170  bl  mmu_off ...   171   RFI: SRR0=>IP, SRR1=>MSR 172  #ifndef CONFIG_POWER4 173   bl  clear_bats 174   bl  flush_tlbs 175 176   bl  initial_bats 177  #if !defined(CONFIG_APUS) && defined(CONFIG_BOOTX_TEXT) 178   bl  setup_disp_bat 179  #endif 180  #else /* CONFIG_POWER4 */ 181   bl  reloc_offset 182   bl  initial_mm_power4 183  #endif /* CONFIG_POWER4 */ 185  /* 186  * Call setup_cpu for CPU 0 and initialize 6xx Idle 187  */ 188   bl  reloc_offset 189   li  r24,0    /* cpu# */ 190   bl  call_setup_cpu   /* Call setup_cpu for this CPU */ 195  #ifdef CONFIG_POWER4 196   bl  reloc_offset 197  bl  init_idle_power4 198  #endif /* CONFIG_POWER4 */ 199 210  bl  reloc_offset 211  mr  r26,r3 212  addis  r4,r3,KERNELBASE@h  /* current address of _start */ 213  cmpwi  0,r4,0    /* are we already running at 0? */ 214  bne  relocate_kernel 215 ... 224  turn_on_mmu: 225  mfmsr  r0 226  ori  r0,r0,MSR_DR|MSR_IR 227  mtspr  SRR1,r0 228  lis  r0,start_here@h 229  ori  r0,r0,start_here@l 230  mtspr  SRR0,r0 231  SYNC 232  RFI     /* enables MMU */ ----------------------------------------------------------------------

Line 131

This is the entry point to this code. Get minimal mmu environment set up. (Note that APUS stands for Amiga Power Up System.)

Line 150

There might be a difference between where the kernel is loaded and where it is linked. The function early_init returns the physical address of the current code.

Line 170

Shut off memory-management unit of PPC. If both IR and DR are enabled, leave them on; otherwise, shut off relocation.

Lines 173176

If not power4 or G5, clear the BAT registers, flush TLBs, and set up BATs to map the first 16M of RAM to 0xc0000000.

Note the various labels for kernel memory used throughout the kernel:

 ---------------------------------------------------------------------- arch/ppc/defconfig CONFIG_KERNEL_START=0xc0000000 -----------------------------------------------------------------------

and

 ---------------------------------------------------------------------- include/asm-ppc/page.h #define PAGE_OFFSET  CONFIG_KERNEL_START #define KERNELBASE  PAGE_OFFSET ----------------------------------------------------------------------

Lines 181182

By using segmentation, set up kernel memory for power4 and G5.

Lines 188198

setup_cpu() initializes the kernel and user features, such as cache configuration, or whether an FPU or MMU exists. (Note that at this writing, init_idle_power4 is a noop.)

Line 210

Relocate kernel to KERNELBASE or 0x00, depending on the platform.

Lines 224232

Turn on the MMU (if it is not already) by enabling IR and DR in MSR. Then, execute an RFI instruction causing a jump to the label start_here:. (Note: The RFI instruction loads the MSR with the contents of SRR1 and branches to the address in SRR0.)

The following code is where the kernel starts. It sets up all memory in the system based on the command line:

 ---------------------------------------------------------------------- arch/ppc/kernel/head.S 1337  start_here: ... 1364  bl  machine_init   1365  bl  MMU_init ... 1385  lis  r4,2f@h 1386  ori  r4,r4,2f@l 1387  tophys(r4,r4) 1388  li  r3,MSR_KERNEL & ~(MSR_IR|MSR_DR) 1389  FIX_SRR1(r3,r5) 1390  mtspr  SRR0,r4 1391  mtspr  SRR1,r3 1392  SYNC 1393  RFI 1394  /* Load up the kernel context */ 1395  2:  bl  load_up_mmu ... 1411  /* Now turn on the MMU for real! */ 1412  li  r4,MSR_KERNEL 1413  FIX_SRR1(r4,r5) 1414  lis  r3,start_kernel@h 1415  ori  r3,r3,start_kernel@l 1416  mtspr  SRR0,r3 1417  mtspr  SRR1,r4 1418  SYNC 1419  RFI ----------------------------------------------------------------------

Line 1337

This line is the entry point to this section.

Line 1364

machine_init() (see the file arch/ppc/kernel/setup.c, line 532) sets up machine-specific information, such as NVRAM, L2, CPU cache line size, debugging, and so on.

Line 1365

MMU_init() (see file arch/ppc/mm/init.c, line 234) discovers the total memory size for highmem and lowmem. It then initializes the MMU hardware (MMU_init_hw(), line 267), sets up Hash Page Table (arch/ppc/mm/hashtable.s), maps all RAM starting at KERNELBASE (mapin_ram(), line 272), maps all I/O (setup_io_mappings(), line 285), and initializes context management(mmu_context_init(), line 288).

Line 1385

Shut off IR and DR to set up SDR1. This holds the real address of the Page Table and how many bits from the hash are used in the Page Table Index.

Line 1395

Clear TLBs, load SDR1 (hash table base and size), set up segmentation, and, depending on the particular PPC platform, initialize the BAT registers.

Lines 14121419

Turn on IR, DR, and RFI to start_kernel in /init/main.c. Note that at interrupt time in the PowerPC architecture, the contents of the Instruction Address Registser (ISR) holds the address the processor must return to after servicing the interrupt. This value is saved in the Save Restore Register 0 (SRR0). The Machine Status Register is in turn saved in the Save Restore Register 1 (SRR1). In shorthand, at interrupt time:

IAR->SRR0
MSR->SRR1

The RFI instruction, which is normally executed at the end of an interrupt routine, is the inverse of this procedure, where SRR0 is restored to the IAR and SRR1 is restored to the MSR. In shorthand:

SRR0->IAR
SRR1->MSR

The code in lines 13851419 uses this methodology to turn memory management on and off by this three-step process:

1.	Sets the desired bits for the MSR (refer to Figure 8.1) in SRR1.
2.	Sets the desired address we want to jump to in SRR0.
3.	Executes the RFI instruction.

8.3.2. x86 Intel-Based Hardware Memory Management

At power-on, all Intel processors are in real address mode. Real addressing is a compatibility mode to the early Intel processors. As processors grew more complex, legacy code was always in use that newer processors still needed to be able to run. In real address mode, the processor can execute a program written for the 8086 and 8088 processors using the same instructions and, more importantly, the same method of addressing or address translation. The end result of address translation is how the processor accesses the system memory. The early Intel processors had a 20-bit address bus, which accessed approximately 64K bytes of memory. This is the limitation put on the early code in the system. In real address mode, the linear address is the same as the physical address. As we move through the code that initializes memory management, we see more of the features of the later processors being used in the hardware and more complex structures added to the software.

The code in setup.S performs several important functions with respect to memory initialization:

 ---------------------------------------------------------------------- arch/i386/boot/setup.S 307  #define SMAP 0x534d4150 308 309  meme820: 310   xorl  %ebx, %ebx    # continuation counter 311   movw  $E820MAP, %di    # point into the whitelist 312         # so we can have the bios 313         # directly write into it. 314 315  jmpe820: 316   movl  $0x0000e820, %eax   # e820, upper word zeroed 317   movl  $SMAP, %edx    # ascii 'SMAP' 318   movl  $20, %ecx    # size of the e820rec 319   pushw  %ds     # data record. 320   popw  %es 321   int  $0x15     # make the call 322   jc  bail820     # fall to e801 if it fails 323 324   cmpl  $SMAP, %eax    # check the return is 'SMAP' 325   jne  bail820     # fall to e801 if it fails 326 ... 333  good820: 334   movb  (E820NR), %al    # up to 32 entries 335   cmpb  $E820MAX, %al 336   jnl  bail820 337 338   incb  (E820NR) 339   movw  %di, %ax 340   addw  $20, %ax 341   movw  %ax, %di 342  again820: 343   cmpl  $0, %ebx    # check to see if 344   jne  jmpe820     # %ebx is set to EOF 345  bail820: -----------------------------------------------------------------------

Lines 307345

Looking at the code segment, we first see (on line 321) a call to the BIOS int15h function with ax= 0xe820. This returns the addresses and lengths of the many different types of memory of which BIOS is aware. This simple memory map represents the basic pool from which all the pages of memory in Linux are obtained. As seen from further studying of the code, the memory map can be obtained by three methods: 0xe820, 0xe801, and 0x88. All three methods have to do with compatibility with existing BIOS distributions and their platforms.

[View full width]

---------------------------------------------------------------------- arch/i386/boot/setup.S 595 # Now we move the system to its rightful place ... but we check if we have a #

big-kernel. In that case we *must* not move it ... 597 testb $LOADED_HIGH, %cs:loadflags 598 jz do_move0 # .. then we have a normal low 599 # loaded zImage 600 # .. or else we have a high 601 # loaded bzImage 602 jmp end_move # ... and we skip moving 603 604 do_move0: 605 movw $0x100, %ax # start of destination segment 606 movw %cs, %bp # aka SETUPSEG 607 subw $DELTA_INITSEG, %bp # aka INITSEG 608 movw %cs:start_sys_seg, %bx # start of source segment 609 cld 610 do_move: 611 movw %ax, %es # destination segment 612 incb %ah # instead of add ax,#0x100 613 movw %bx, %ds # source segment 614 addw $0x100, %bx 615 subw %di, %di 616 subw %si, %si 617 movw $0x800, %cx 618 rep 619 movsw 620 cmpw %bp, %bx # assume start_sys_seg > 0x200, 621 # so we will perhaps read one 622 # page more than needed, but 623 # never overwrite INITSEG 624 # because destination is a 625 # minimum one page below source 626 jb do_move 627 628 end_move: ----------------------------------------------------------------------

Lines 595628

This code is the kernel image created by build.c and loaded by LILO. It is made up of the init sector (at address 0x9000), the setup sector (at address 0x9200), and the compressed image. The image is originally loaded at address 0x10000. If it is LARGE (>0X7FF), it is left in place; otherwise, it is moved down to 0x1000.

 ---------------------------------------------------------------------- arch/i386/boot/setup.S 723   # Try enabling A20 through the keyboard controller 724  #endif /* CONFIG_X86_VOYAGER */ 725  a20_kbc: 726   call  empty_8042 727 728  #ifndef CONFIG_X86_VOYAGER 729   call  a20_test    # Just in case the BIOS worked 730   jnz  a20_done    # but had a delayed reaction. 731  #endif 732 733   movb  $0xD1, %al    # command write 734   outb  %al, $0x64 735   call  empty_8042 736 737   movb  $0xDF, %al    # A20 on 738   outb  %al, $0x60 739   call  empty_8042 ----------------------------------------------------------------------

Forming the 20-bit Physical Address in Intel Real Address Mode

The Intel 8088 processor in the original IBM PC had only 20 address lines [0...19]. This allowed the system to access up to 1 megabyte plus approximately 64K bytes of memory (0 to 0x10_FFEF) internally, but physically (on the bus) the last 64K of addressable memory was actually the first 64K of real memory!

Internal to the processor, a 20-bit address is formed from a 16-bit segment selector and a 16-bit segment offset. The selector is shifted left 4 bits and added to the offset, which is extended by 4 bits. The sum of these registers is the physical address seen on the bus.

For example:

To obtain the highest address, we load a segment selector (CS, DS, ES, and so on) with a value of 0xFFFF and an index register (SI, DI, and so on) with a value of 0xFFFF. Internal to the processor, the segment selector is shifted left 4 bits and added to the offset.

0xFFFF shifted left 4 bits	=	0x0F_FFF0
Add the offset	+	0x00_FFFF
Internal sum	=	0x10_FFEF
External Physical Address	=	0x00_FFEF

This resulting Physical Address is the same as a segment selector with the value of 0x0000 and an offset value of 0xFFEF (0000:FFEF).

Accessing the highest address and above would wrap back into low memory at 0xFFEF. Certain programs written for this processor would depend on this 20-bit wrap-around behavior. The introduction of the Intel 286 and later processors with wider address busses incorporated Real Addressing to maintain backward compatibility with 8088 and 8086. Real Addressing mode did not take into account legacy software that depended on the 20-bit wrap-around. The A20M# signal pin was added to mimic this "feature" of the earlier processors. Asserting this signal would mask off the A20 signal allowing the low memory to be accessed once again.

A logic gate was used to enable or disable the memory bus A20 signal. The original design to assert this gate was to use an extra I/O signal from the keyboard controller that was controlled by I/O ports 0x60 and 0x64. A "Fast Gate A20" method was later developed which used I/O port 0x92 designed into the system board. Since all x86 processors come out of reset in Real Address mode, it is wise for boot code to make certain address line A20 is enabled by one or both of these methods.

Lines 723739

This code is a fascinating throwback to the early Intel processors. This is a mere nuisance in the setup of Memory Management.

 ---------------------------------------------------------------------- arch/i386/boot/setup.S 790  # set up gdt and idt 791  lidt  idt_48     # load idt with 0,0 792  xorl  %eax, %eax    # Compute gdt_base 793  movw  %ds, %ax    # (Convert %ds:gdt to a linear ptr) 794  shll  $4, %eax 795  addl  $gdt, %eax 796  movl  %eax, (gdt_48+2) 797  lgdt  gdt_48     # load gdt with whatever is 798        # appropriate ... 981  gdt: 982   .fill GDT_ENTRY_BOOT_CS,8,0 983 984   .word  0xFFFF     # 4Gb - (0x100000*0x1000 = 4Gb) 985   .word  0     # base address = 0 986   .word  0x9A00     # code read/exec 987   .word  0x00CF     # granularity = 4096, 386 988         # (+5th nibble of limit) 989 990   .word  0xFFFF     # 4Gb - (0x100000*0x1000 = 4Gb) 991   .word  0     # base address = 0 992   .word  0x9200     # data read/write 993   .word  0x00CF     # granularity = 4096, 386 994         # (+5th nibble of limit) 995  gdt_end: 996   .align  4 997 998   .word  0     # alignment byte 999  idt_48: 1000   .word  0     # idt limit = 0 1001  .word  0, 0     # idt base = 0L 1002 1003   .word  0     # alignment byte 1004  gdt_48: 1005   .word  gdt_end - gdt - 1   # gdt limit 1006   .word  0, 0     # gdt base (filled in later) ----------------------------------------------------------------------

Lines 790797

The structures and data for the provisional GDT and IDT are compiled into the end of setup.S. These tables are implemented in their simplest form.

Lines 9811006

These lines are the compiled-in values for the provisional GDT. The GDT has a code and data descriptor, each representing 4GB of memory starting at 0x00. The IDT is left initialized to 0x00 and is filled in later.

As far as memory management on an Intel platform is concerned, entering protected mode is one of the most important phases. At this point, the hardware begins to build a virtual address space for the operating system.

Protected Mode

The Intel method of memory management is called protected mode. The protection refers to multiple independent segmented address spaces that are protected from each other. The other half of Intel memory management is paging or page translation. System programmers can make use of various combinations of segmentation and paging, but Linux uses a flat model where segmentation is all but eliminated. In the flat model, each process has access to its entire 32-bit address space (4GB).

 ---------------------------------------------------------------------- arch/i386/boot/setupS 830  movw  $1, %ax     # protected mode (PE) bit 831  lmsw  %ax     # This is it! 832  jmp  flush_instr 833 834  flush_instr: 835   xorw  %bx, %bx    # Flag to indicate a boot 836   xorl  %esi, %esi    # Pointer to real-mode code 837   movw  %cs, %si 838   subw  $DELTA_INITSEG, %si 839   shll  $4, %esi   -----------------------------------------------------------------------

Lines 830831

Set the PE bit in the Machine Status Word to enter protected mode. The jmp instruction begins executing in protected mode.

Lines 834839

Save a 32-bit pointer to real-mode for decompressing and loading the kernel later on in startup_32().

Recall that in real addressing mode, code is executed by using 16-bit instructions. The current file is compiled using the .code16 assembler directive, which enforces this mode; this is also known as a 16-bit module in the Intel Programmer's Reference. To jump from a 16-bit module to a 32-bit module, the Intel architecture (and assembler magic) allows us to build a 32-bit instruction in a 16-bit module.

Build and execute the 32-bit jump:

 ---------------------------------------------------------------------- arch/i386/boot/setup.S 841  # jump to startup_32 in arch/i386/kernel/head.S 842  # 843  # NOTE: For high loaded big kernels we need a 844  #  jmpi 0x100000,__BOOT_CS 845  # 846  #  but we haven't yet reloaded the CS register, so the default size  847  #  of the target offset still is 16 bit. 848  #  However, using an operand prefix (0x66), the CPU will properly 849  #  take our 48 bit far pointer. (INTeL 80386 Programmer's Reference 850  #  Manual, Mixing 16-bit and 32-bit code, page 16-6) 851 852   .byte 0x66, 0xea    # prefix + jmpi-opcode 853  code32:  .long  0x1000     # will be set to 0x100000 854         # for big kernels 855   .word  __BOOT_CS -----------------------------------------------------------------------

Line 852

This line builds the 32-bit jump instruction.

After this jump is executed, the system uses the provisional GDT and the code is executing in 32-bit protected mode, starting at the label startup_32 in arch/i386/kernel/head.S line 57.

8.3.2.1. Protected Mode

Until this point, the discussion has been how to get the Intel system ready to set up paging. As we trace through the code in head.S, we see what initialization needs to take place and how Linux uses the x86-based protected mode paging system. This is the final code before the kernel is started in main.c. For complete information on the many possible modes and settings that relate to memory initialization and Intel processors, look at the Intel Architecture Software Developers Manual, Volume 3.

 ---------------------------------------------------------------------- arch/i386/kernel/head.S 057  ENTRY(startup_32) 058 059  /* 060  * Set segments to known values. 061  */ 062   cld 063   lgdt boot_gdt_descr - __PAGE_OFFSET 064   movl $(__BOOT_DS),%eax 065   movl %eax,%ds 066   movl %eax,%es 067   movl %eax,%fs 068   movl %eax,%gs 068 081  /* 082  * Initialize page tables. This creates a PDE and a set of page 083  * tables, which are located immediately beyond _end. The variable 084  * init_pg_tables_end is set up to point to the first "safe" location. 085  * Mappings are created both at virtual address 0 (identity mapping) 086  * and PAGE_OFFSET for up to _end+sizeof(page tables)+INIT_MAP_BEYOND_END. 087  * 088  * Warning: don't use %esi or the stack in this code. However, %esp 089  * can be used as a GPR if you really need it...  090  */ 091  page_pde_offset = (__PAGE_OFFSET >> 20); 092 093   movl $(pg0 - __PAGE_OFFSET), %edi 094   movl $(swapper_pg_dir - __PAGE_OFFSET), %edx 095   movl $0x007, %eax    /* 0x007 = PRESENT+RW+USER */ 096  10: 097    leal 0x007(%edi),%ecx    /* Create PDE entry */ 098   movl %ecx,(%edx)    /* Store identity PDE entry */ 099   movl %ecx,page_pde_offset(%edx)   /* Store kernel PDE entry */ 100   addl $4,%edx 101   movl $1024, %ecx 102  11: 103   stosl 104   addl $0x1000,%eax 105   loop 11b 106   /* End condition: we must map up to and including INIT_MAP_BEYOND_END */ 107   /* bytes beyond the end of our own page tables; the +0x007 is the attribute bits */ 108  leal (INIT_MAP_BEYOND_END+0x007)(%edi),%ebp 109   cmpl %ebp,%eax 110   jb 10b 111  movl %edi,(init_pg_tables_end - __PAGE_OFFSET) 112 113  #ifdef CONFIG_SMP ... 156  3: 157  #endif /* CONFIG_SMP */ 158 159  /* 160  * Enable paging 161  */ 162   movl $swapper_pg_dir-__PAGE_OFFSET,%eax 163   movl %eax,%cr3   /* set the page table pointer.. */ 164   movl %cr0,%eax 165   orl $0x80000000,%eax 166   movl %eax,%cr0   /* ..and set paging (PG) bit */ 167   ljmp $__BOOT_CS,$1f  /* Clear prefetch and normalize %eip */ 168  1: 169   /* Set up the stack pointer */ 170   lss stack_start,%esp ... 177   pushl $0 178   popfl 179 180  #ifdef CONFIG_SMP 181   andl %ebx,%ebx 182   jz 1f     /* Initial CPU cleans BSS */ 183   jmp checkCPUtype 184  1: 185  #endif /* CONFIG_SMP */ 186 187  /* 188  * start system 32-bit setup. We need to re-do some of the things done 189  * in 16-bit mode for the "real" operations. 190  */ 191   call setup_idt 192 193  * 194  * Copy bootup parameters out of the way. 195  * Note: %esi still has the pointer to the real-mode data. 196  */ 197   movl $boot_params,%edi 198   movl $(PARAM_SIZE/4),%ecx 199   cld 200   rep 201   movsl 202   movl boot_params+NEW_CL_POINTER,%esi 203   andl %esi,%esi 204   jnz 2f    # New command line protocol 205   cmpw $(OLD_CL_MAGIC),OLD_CL_MAGIC_ADDR 206   jne 1f 207   movzwl OLD_CL_OFFSET,%esi 208   addl $(OLD_CL_BASE_ADDR),%esi 209  2: 210   movl $saved_command_line,%edi 211   movl $(COMMAND_LINE_SIZE/4),%ecx 212   rep 213   movsl 214  1: 215  checkCPUtype: ... 279   lgdt cpu_gdt_descr 280   lidt idt_descr ... 303   call start_kernel ----------------------------------------------------------------------

Line 57

This line is the 32-bit protected mode entry point for the kernel code. Currently, the code uses the provisional GDT.

Line 63

This code initializes the GDTR with the base address of the boot GDT. This boot GDT is the same as the provisional GDT used in setup.S (4GB code and data starting at address 0x00000000) and is used only by this boot code.

Lines 6468

Initialize the remaining segment registers with __BOOT_DS, which resolves to 24 (see /include/asm-i386/segment.h). This value points to the 24th selector (starting at 0) in the final GDT, which is set later in this code.

Lines 91111

Create a page directory entry (PDE) in swapper_pg_dir that references a page table (pg0) with 0 based (identity) entries and duplicate PAGE_OFFSET (kernel memory) entries.

Lines 113157

This code block initializes secondary (non-boot) processors to the page tables. For this discussion, we focus on the boot processor.

Lines 162164

The cr3 register is the entry point for x86 hardware paging. This register is initialized to point to the base of the Page Directory, which in this case, is swapper_pg_dir.

Lines 165168

Set the PG (paging) bit in cr0 of the boot processor. The PG bit enables the paging mechanism in the x86 architecture. The jump instruction (on line 167) is recommended when changing the PG bit to ensure that all instructions within the processor are serialized at the moment of entering or exiting paging mode.

Line 170

Initialize the stack to the start of the data segment (see also lines 401403).

Lines 177178

The eflags register is a read/write system register that contains the status of interrupts, modes, and permissions. This register is cleared by pushing a 0 onto the stack and directly popping it into the register with the popfl instruction.

Lines 180185

The general-purpose register ebx is used as a flag to indicate whether it is the boot processor to the processor that runs this code. Because we are tracing this code as the boot processor, ebx has been cleared (0), and we jump to the call to setup_idt.

Line 191

The routine setup_idt initializes an Interrupt Descriptor Table (IDT) where each entry points to a dummy handler. The IDT, discussed in Chapter 7, "Scheduling and Kernel Synchronization," is a table of functions (or handlers) that are called when the processor needs to immediately execute time-critical code.

Lines 197214

The user can pass certain parameters to Linux at boot time. They are stored here for later use.

Lines 215303

The code listed on these lines does a large amount of necessary (but tedious) x86 processor-version checking and some minor initialization. By way of the cupid instruction (or lack thereof), certain bits are set in the eflags register and cr0. One notable setting in cr0 is bit 4, the extension type (ET). This bit indicates the support of math-coprocessor instructions in older x86 processors. The most important lines of code in this block are lines 279280. This is where the IDT and the GDT are loaded (by way of the lidt and lgdt instructions) into the idtr and gdtr registers. Finally, on line 303, we jump to the routine start_kernel().

With the code in head.S, the system can now map a logical address to a linear address to finally a physical address (see Figure 8.10). Starting with a logical address, the selector (in the CS, DS, ES, etc., registers) references one of the descriptors in the GDT. The offset is the flat address that we seek. The information from the descriptor and the offset are combined to form the logical address.

Figure 8.10. Boot-Time Paging

In the code walkthrough, we saw how the Page Directory (swapper_pg_dir) and Page Table (pg0) were created and that cr3 was initialized to point to the Page Directory. As previously discussed, the processor becomes aware of where to look for the paging components by cr3's setting, and setting cr0 (PG bit) is how the processor is informed to start using them. On the logical address, bits 22:31 indicate the Page Directory Entry (PDE), bits 12:21 indicate the Page Table Entry (PTE), and bits 0:11 indicate the offset (in this example, 4KB) into the physical page.

The system now has 8MB of memory mapped out using a provisional paging system. The next step is to call the function start_kernel() in init/main.c.

8.3.3. PowerPC and x86 Code Convergence

Notice that both the PowerPC code and the x86 code have now converged on start_kernel() in init/main.c. This routine, which is located in the architecture-independent section of the code, calls architecture-specific routines to finish memory initialization.

The first function called in this file is setup_arch() in arch/i386/ kernel/ setup.c, which then calls paging_init() in arch/i386/mm/init.c, which then calls pagetable_init() in the same file. The remainder of system memory is allocated here to produce the final page tables.

In the PowerPC world, much has already been done. The setup_arch() file in arch/ppc/kernel/setup.c then calls paging_init() in arch/ppc/mm/init.c. The one notable function performed in paging_init() for PPC is to set all pages to be in the DMA zone.