8.3. Architecture-Dependent Memory InitializationWe now take a moment to discuss hardware management features in PPC and x86. Both x86 and PowerPC architectures have hardware memory-management features to support real and virtual addressing environments. As in all operating systems, Linux Memory Management depends on the underlying hardware architecture. This section describes the hardware initialization of both architectures. Because the initialization of memory management is extremely hardware dependent, the hardware specifications need to be understood in order to follow the initialization process. Memory management is one of the first subsystems to be initialized and begins prior to the execution of start_kernel() because of its highly architecture-dependent nature. 8.3.1. PowerPC Hardware Memory ManagementAlso known as "storage control" in the PowerPC world, this section describes the hardware-supported features of address translation specific to the PowerPC architecture. We follow up with a discussion on how Linux uses (or disregards, for the sake of portability) these features from system power-on through kernel initialization. 8.3.1.1. Real Addressing ModeFrom embedded up to high performance, all PowerPC processors come out of hardware reset in real mode.[3] PowerPC real-addressing mode is defined as having the processor in a state of disabled address translation. Address translation is controlled by the instruction relocate (IR) and data relocate (DR) bits in the Machine State Register (MSR). For fetch instructions, if the IR bit is 0, the effective address (EA) is the same as the real address. For load and store instructions, the DR bit in the MSR plays a similar role.
The MSR, which is illustrated in Figure 8.3, is a 64- or 32-bit register that describes the current state of the processor. On a 32-bit implementation, the IR and DR are bits 26 and 27. Figure 8.3. PowerPC Machine State Register (MSR)Because address translation in Linux is a combination of hardware and software structures, real mode is fundamental to the boot process of initializing the memory subsystem and the memory-management structures of Linux. The need to enable address translation is exemplified by the inherent limitations of real mode. Real mode is only capable of addressing the implemented address width; this is 64- or 32-bit in most applications. The two major limitations are as follows:
8.3.1.2. Address TranslationThe lack of address translation is real addressing. Address translation opens the door to virtual addressing where every possible address is not physically available at any given instance, but through the clever use of hardware and software, every possible address can be made virtually available when accessed. With address translation enabled, the PowerPC architecture translates an EA by one of two methods: Segmented Address Translation or Block Address Translation (see Figure 8.4). If the EA can be translated by both methods, Block Address Translation takes precedence. Address translation is said to be enabled when MSRIR=1, or MSRDR=1, or both. Segmented Address Translation breaks virtual memory into segments, which are divided into 4KB pages, each representing physical memory. Block Address Translation breaks memory into regions ranging from 128MB to 256MB. Figure 8.4. 32-Bit Address Translation
Segmented Address Translation Direct Store Segment TThe next level of translation is determined by the T bit, which is located in the Segment Register. Bits 0:3 of the EA select one of 16 segment registers (SRs) in the PowerPC 7xx series. Figure 8.5 illustrates the segment register. Figure 8.5. Segment RegisterWith the T bit set, the segment is deemed a direct store segment to an I/O device, and there is no reference to hardware page tables. The I/O address is made up of a permission bit, the BUID, the controller-specific field, and bits 4:31 of the EA. Linux does not use direct store segmentation. When the Segmented Address Translation Ordinary Segment T is not set, the virtual segment ID (VSID) field is used. Referring to Figure 8.6, a 52-bit virtual address (VA) is formed by concatenating bits 20:31 of the EA (the offset within a given page), bits 4:19 of the EA, and bits 8:31 of the selected segment register VSID field. The most significant 40 bits of the VA make up the virtual page number (VPN). The PowerPC architecture uses a Hashed Page Table to map VPNs to real page numbers (the real address of a desired page in memory). The hash function uses the VPN and the value in Storage Description Register 1 (SDR1) to store and retrieve a Page Table Entry (PTE). The PTE, which is illustrated in Figure 8.7, is an 8-byte structure that contains all the necessary attributes of a page in memory. Figure 8.6. Segment TranslationFigure 8.7. Page Table EntryBlock Address TranslationAs its name implies, Block Address Translation (BAT) is an addressing mechanism that allows for mapping blocks of contiguous memory from 125KB to 256MB. BAT registers are privileged special purpose registers (SPRs) in the PowerPC architecture. Figure 8.8 illustrates the BAT register. Figure 8.8. BAT RegisterThe formation of a real address from a BAT register can be seen in Figure 8.9. Four Instruction BAT (IBAT) registers and four Data BAT (DBAT) registers can be read or written using mtspr and mfspr PPC instructions.[4]
Figure 8.9. BAT RealTranslation Lookaside BuffersThe Translation Lookaside Buffers (TLBs) can be thought of as a hardware cache with hardware protection for the paging system. The TLB varies in length with PowerPC architectures and contains an index of the most recently used PTEs. The paging software must be sure to keep the TLBs in sync with the page table. When the processor cannot find a page in the hash table,[5] the Linux page tables are then searched. If the page is still not found, a normal page fault is generated. Information on optimization of the synchronization between the Linux page tables and PPC hash tables can be found in the document, "Low Level Optimizations in the PowerPC/Linux Kernels," by Paul Mackerras.
Storage Access Mode ControlWhen address translation is enabled (MSRIR=1, or MSRDR=1, or both) and accomplished by way of Segmented Address Translation or Block Address Translation, the storage mode is determined by four control bits: W, I, M, and G. For Segmented Address Translation, they are bits 25:28 of the second word of a PTE, and the same bits for the second SPR of the DBAT. (The G-bit is reserved in the IBAT.) Two more bitsReference and Control, which are located in the PTEare available for Segmented Address Translation. The R and C bits are set by hardware or software. (See the following sidebar for a discussion of the W, I, M, G, R, and C bits.)
8.3.1.3. How Linux Uses PPC Address TranslationWe now look at the code that influences memory management in PPC. The following code is the first in the kernel distribution to get control. This routine calls back into the Firmware for allocation of temporary regions by using the claim() function. The kernel is then decompressed into its proper location: ---------------------------------------------------------------------- arch/ppc/boot/openfirmware/newworldmain.c 40 void boot(int a1, int a2, void *prom) ... 54 claim(initrd_start, RAM_END - initrd_start, 0); 55 printf("initial ramdisk moving 0x%x <- 0x%p (%x bytes)\n\r", 56 initrd_start, (char *)(&__ramdisk_begin), initrd_size); 57 memcpy((char *)initrd_start, (char *)(&__ramdisk_begin), initrd_size); ... 63 /* claim 3MB starting at PROG_START */ 64 claim(PROG_START, PROG_SIZE, 0); 65 dst = (void *) PROG_START; 66 if (im[0] == 0x1f && im[1] == 0x8b) { 67 /* claim some memory for scratch space */ 68 avail_ram = (char *) claim(0, SCRATCH_SIZE, 0x10); 69 begin_avail = avail_high = avail_ram; 70 end_avail = avail_ram + SCRATCH_SIZE; 71 printf("heap at 0x%p\n", avail_ram); 72 printf("gunzipping (0x%p <- 0x%p:0x%p)...", dst, im, im+len); 73 gunzip(dst, PROG_SIZE, im, &len); 74 printf("done %u bytes\n", len); 75 printf("%u bytes of heap consumed, max in use %u\n", 76 avail_high - begin_avail, heap_max); ... 86 sa = (unsigned long)PROG_START; 87 printf("start address = 0x%x\n", sa); 88 89 (*(kernel_start_t)sa)(a1, a2, prom); ---------------------------------------------------------------------- Line 40Entry point to this file is the function boot(a1, a2, *prom). Line 54Function claim() is called to allocate memory just below 1M and ramdisk is copied into that memory. Line 64Function claim() is called to allocate 3M of memory, starting at 0x1_0000 for the image. Line 68Function claim() is called to allocate 8K of memory starting at 0x00 for scratch/heap. Line 73The image is gunzipped to address 0x1_0000 (PROG_START). Line 89Jump to 0x1_0000 ((*kernel_start_t)sa) with parameters (a1, a2, and prom) where a1 holds the value in r3 (equal to the boot ramdisk start), a2 holds the value in r4 (equal to the boot ramdisk size or 0xdeadbeef in the case of no ramdisk) and prom holds the value in r5 (code stored in system ROM). The next code block readies the hardware memory-management features of the various PowerPC processors. The first 16M of RAM is mapped to 0xc0000000: ---------------------------------------------------------------------- arch/ppc/kernel/head.S 131 __start: ... 150 bl early_init in <arch/ppc/kernel/setup.c> (283) ... 170 bl mmu_off ... 171 RFI: SRR0=>IP, SRR1=>MSR 172 #ifndef CONFIG_POWER4 173 bl clear_bats 174 bl flush_tlbs 175 176 bl initial_bats 177 #if !defined(CONFIG_APUS) && defined(CONFIG_BOOTX_TEXT) 178 bl setup_disp_bat 179 #endif 180 #else /* CONFIG_POWER4 */ 181 bl reloc_offset 182 bl initial_mm_power4 183 #endif /* CONFIG_POWER4 */ 185 /* 186 * Call setup_cpu for CPU 0 and initialize 6xx Idle 187 */ 188 bl reloc_offset 189 li r24,0 /* cpu# */ 190 bl call_setup_cpu /* Call setup_cpu for this CPU */ 195 #ifdef CONFIG_POWER4 196 bl reloc_offset 197 bl init_idle_power4 198 #endif /* CONFIG_POWER4 */ 199 210 bl reloc_offset 211 mr r26,r3 212 addis r4,r3,KERNELBASE@h /* current address of _start */ 213 cmpwi 0,r4,0 /* are we already running at 0? */ 214 bne relocate_kernel 215 ... 224 turn_on_mmu: 225 mfmsr r0 226 ori r0,r0,MSR_DR|MSR_IR 227 mtspr SRR1,r0 228 lis r0,start_here@h 229 ori r0,r0,start_here@l 230 mtspr SRR0,r0 231 SYNC 232 RFI /* enables MMU */ ---------------------------------------------------------------------- Line 131This is the entry point to this code. Get minimal mmu environment set up. (Note that APUS stands for Amiga Power Up System.) Line 150There might be a difference between where the kernel is loaded and where it is linked. The function early_init returns the physical address of the current code. Line 170Shut off memory-management unit of PPC. If both IR and DR are enabled, leave them on; otherwise, shut off relocation. Lines 173176If not power4 or G5, clear the BAT registers, flush TLBs, and set up BATs to map the first 16M of RAM to 0xc0000000. Note the various labels for kernel memory used throughout the kernel: ---------------------------------------------------------------------- arch/ppc/defconfig CONFIG_KERNEL_START=0xc0000000 ----------------------------------------------------------------------- and ---------------------------------------------------------------------- include/asm-ppc/page.h #define PAGE_OFFSET CONFIG_KERNEL_START #define KERNELBASE PAGE_OFFSET ---------------------------------------------------------------------- Lines 181182By using segmentation, set up kernel memory for power4 and G5. Lines 188198setup_cpu() initializes the kernel and user features, such as cache configuration, or whether an FPU or MMU exists. (Note that at this writing, init_idle_power4 is a noop.) Line 210Relocate kernel to KERNELBASE or 0x00, depending on the platform. Lines 224232Turn on the MMU (if it is not already) by enabling IR and DR in MSR. Then, execute an RFI instruction causing a jump to the label start_here:. (Note: The RFI instruction loads the MSR with the contents of SRR1 and branches to the address in SRR0.) The following code is where the kernel starts. It sets up all memory in the system based on the command line: ---------------------------------------------------------------------- arch/ppc/kernel/head.S 1337 start_here: ... 1364 bl machine_init 1365 bl MMU_init ... 1385 lis r4,2f@h 1386 ori r4,r4,2f@l 1387 tophys(r4,r4) 1388 li r3,MSR_KERNEL & ~(MSR_IR|MSR_DR) 1389 FIX_SRR1(r3,r5) 1390 mtspr SRR0,r4 1391 mtspr SRR1,r3 1392 SYNC 1393 RFI 1394 /* Load up the kernel context */ 1395 2: bl load_up_mmu ... 1411 /* Now turn on the MMU for real! */ 1412 li r4,MSR_KERNEL 1413 FIX_SRR1(r4,r5) 1414 lis r3,start_kernel@h 1415 ori r3,r3,start_kernel@l 1416 mtspr SRR0,r3 1417 mtspr SRR1,r4 1418 SYNC 1419 RFI ---------------------------------------------------------------------- Line 1337This line is the entry point to this section. Line 1364machine_init() (see the file arch/ppc/kernel/setup.c, line 532) sets up machine-specific information, such as NVRAM, L2, CPU cache line size, debugging, and so on. Line 1365MMU_init() (see file arch/ppc/mm/init.c, line 234) discovers the total memory size for highmem and lowmem. It then initializes the MMU hardware (MMU_init_hw(), line 267), sets up Hash Page Table (arch/ppc/mm/hashtable.s), maps all RAM starting at KERNELBASE (mapin_ram(), line 272), maps all I/O (setup_io_mappings(), line 285), and initializes context management(mmu_context_init(), line 288). Line 1385Shut off IR and DR to set up SDR1. This holds the real address of the Page Table and how many bits from the hash are used in the Page Table Index. Line 1395Clear TLBs, load SDR1 (hash table base and size), set up segmentation, and, depending on the particular PPC platform, initialize the BAT registers. Lines 14121419Turn on IR, DR, and RFI to start_kernel in /init/main.c. Note that at interrupt time in the PowerPC architecture, the contents of the Instruction Address Registser (ISR) holds the address the processor must return to after servicing the interrupt. This value is saved in the Save Restore Register 0 (SRR0). The Machine Status Register is in turn saved in the Save Restore Register 1 (SRR1). In shorthand, at interrupt time:
The RFI instruction, which is normally executed at the end of an interrupt routine, is the inverse of this procedure, where SRR0 is restored to the IAR and SRR1 is restored to the MSR. In shorthand:
The code in lines 13851419 uses this methodology to turn memory management on and off by this three-step process:
8.3.2. x86 Intel-Based Hardware Memory ManagementAt power-on, all Intel processors are in real address mode. Real addressing is a compatibility mode to the early Intel processors. As processors grew more complex, legacy code was always in use that newer processors still needed to be able to run. In real address mode, the processor can execute a program written for the 8086 and 8088 processors using the same instructions and, more importantly, the same method of addressing or address translation. The end result of address translation is how the processor accesses the system memory. The early Intel processors had a 20-bit address bus, which accessed approximately 64K bytes of memory. This is the limitation put on the early code in the system. In real address mode, the linear address is the same as the physical address. As we move through the code that initializes memory management, we see more of the features of the later processors being used in the hardware and more complex structures added to the software. The code in setup.S performs several important functions with respect to memory initialization: ---------------------------------------------------------------------- arch/i386/boot/setup.S 307 #define SMAP 0x534d4150 308 309 meme820: 310 xorl %ebx, %ebx # continuation counter 311 movw $E820MAP, %di # point into the whitelist 312 # so we can have the bios 313 # directly write into it. 314 315 jmpe820: 316 movl $0x0000e820, %eax # e820, upper word zeroed 317 movl $SMAP, %edx # ascii 'SMAP' 318 movl $20, %ecx # size of the e820rec 319 pushw %ds # data record. 320 popw %es 321 int $0x15 # make the call 322 jc bail820 # fall to e801 if it fails 323 324 cmpl $SMAP, %eax # check the return is 'SMAP' 325 jne bail820 # fall to e801 if it fails 326 ... 333 good820: 334 movb (E820NR), %al # up to 32 entries 335 cmpb $E820MAX, %al 336 jnl bail820 337 338 incb (E820NR) 339 movw %di, %ax 340 addw $20, %ax 341 movw %ax, %di 342 again820: 343 cmpl $0, %ebx # check to see if 344 jne jmpe820 # %ebx is set to EOF 345 bail820: ----------------------------------------------------------------------- Lines 307345Looking at the code segment, we first see (on line 321) a call to the BIOS int15h function with ax= 0xe820. This returns the addresses and lengths of the many different types of memory of which BIOS is aware. This simple memory map represents the basic pool from which all the pages of memory in Linux are obtained. As seen from further studying of the code, the memory map can be obtained by three methods: 0xe820, 0xe801, and 0x88. All three methods have to do with compatibility with existing BIOS distributions and their platforms.
Lines 595628This code is the kernel image created by build.c and loaded by LILO. It is made up of the init sector (at address 0x9000), the setup sector (at address 0x9200), and the compressed image. The image is originally loaded at address 0x10000. If it is LARGE (>0X7FF), it is left in place; otherwise, it is moved down to 0x1000. ---------------------------------------------------------------------- arch/i386/boot/setup.S 723 # Try enabling A20 through the keyboard controller 724 #endif /* CONFIG_X86_VOYAGER */ 725 a20_kbc: 726 call empty_8042 727 728 #ifndef CONFIG_X86_VOYAGER 729 call a20_test # Just in case the BIOS worked 730 jnz a20_done # but had a delayed reaction. 731 #endif 732 733 movb $0xD1, %al # command write 734 outb %al, $0x64 735 call empty_8042 736 737 movb $0xDF, %al # A20 on 738 outb %al, $0x60 739 call empty_8042 ----------------------------------------------------------------------
Lines 723739This code is a fascinating throwback to the early Intel processors. This is a mere nuisance in the setup of Memory Management. ---------------------------------------------------------------------- arch/i386/boot/setup.S 790 # set up gdt and idt 791 lidt idt_48 # load idt with 0,0 792 xorl %eax, %eax # Compute gdt_base 793 movw %ds, %ax # (Convert %ds:gdt to a linear ptr) 794 shll $4, %eax 795 addl $gdt, %eax 796 movl %eax, (gdt_48+2) 797 lgdt gdt_48 # load gdt with whatever is 798 # appropriate ... 981 gdt: 982 .fill GDT_ENTRY_BOOT_CS,8,0 983 984 .word 0xFFFF # 4Gb - (0x100000*0x1000 = 4Gb) 985 .word 0 # base address = 0 986 .word 0x9A00 # code read/exec 987 .word 0x00CF # granularity = 4096, 386 988 # (+5th nibble of limit) 989 990 .word 0xFFFF # 4Gb - (0x100000*0x1000 = 4Gb) 991 .word 0 # base address = 0 992 .word 0x9200 # data read/write 993 .word 0x00CF # granularity = 4096, 386 994 # (+5th nibble of limit) 995 gdt_end: 996 .align 4 997 998 .word 0 # alignment byte 999 idt_48: 1000 .word 0 # idt limit = 0 1001 .word 0, 0 # idt base = 0L 1002 1003 .word 0 # alignment byte 1004 gdt_48: 1005 .word gdt_end - gdt - 1 # gdt limit 1006 .word 0, 0 # gdt base (filled in later) ---------------------------------------------------------------------- Lines 790797The structures and data for the provisional GDT and IDT are compiled into the end of setup.S. These tables are implemented in their simplest form. Lines 9811006These lines are the compiled-in values for the provisional GDT. The GDT has a code and data descriptor, each representing 4GB of memory starting at 0x00. The IDT is left initialized to 0x00 and is filled in later. As far as memory management on an Intel platform is concerned, entering protected mode is one of the most important phases. At this point, the hardware begins to build a virtual address space for the operating system.
---------------------------------------------------------------------- arch/i386/boot/setupS 830 movw $1, %ax # protected mode (PE) bit 831 lmsw %ax # This is it! 832 jmp flush_instr 833 834 flush_instr: 835 xorw %bx, %bx # Flag to indicate a boot 836 xorl %esi, %esi # Pointer to real-mode code 837 movw %cs, %si 838 subw $DELTA_INITSEG, %si 839 shll $4, %esi ----------------------------------------------------------------------- Lines 830831Set the PE bit in the Machine Status Word to enter protected mode. The jmp instruction begins executing in protected mode. Lines 834839Save a 32-bit pointer to real-mode for decompressing and loading the kernel later on in startup_32(). Recall that in real addressing mode, code is executed by using 16-bit instructions. The current file is compiled using the .code16 assembler directive, which enforces this mode; this is also known as a 16-bit module in the Intel Programmer's Reference. To jump from a 16-bit module to a 32-bit module, the Intel architecture (and assembler magic) allows us to build a 32-bit instruction in a 16-bit module. Build and execute the 32-bit jump: ---------------------------------------------------------------------- arch/i386/boot/setup.S 841 # jump to startup_32 in arch/i386/kernel/head.S 842 # 843 # NOTE: For high loaded big kernels we need a 844 # jmpi 0x100000,__BOOT_CS 845 # 846 # but we haven't yet reloaded the CS register, so the default size 847 # of the target offset still is 16 bit. 848 # However, using an operand prefix (0x66), the CPU will properly 849 # take our 48 bit far pointer. (INTeL 80386 Programmer's Reference 850 # Manual, Mixing 16-bit and 32-bit code, page 16-6) 851 852 .byte 0x66, 0xea # prefix + jmpi-opcode 853 code32: .long 0x1000 # will be set to 0x100000 854 # for big kernels 855 .word __BOOT_CS ----------------------------------------------------------------------- Line 852This line builds the 32-bit jump instruction. After this jump is executed, the system uses the provisional GDT and the code is executing in 32-bit protected mode, starting at the label startup_32 in arch/i386/kernel/head.S line 57. 8.3.2.1. Protected ModeUntil this point, the discussion has been how to get the Intel system ready to set up paging. As we trace through the code in head.S, we see what initialization needs to take place and how Linux uses the x86-based protected mode paging system. This is the final code before the kernel is started in main.c. For complete information on the many possible modes and settings that relate to memory initialization and Intel processors, look at the Intel Architecture Software Developers Manual, Volume 3. ---------------------------------------------------------------------- arch/i386/kernel/head.S 057 ENTRY(startup_32) 058 059 /* 060 * Set segments to known values. 061 */ 062 cld 063 lgdt boot_gdt_descr - __PAGE_OFFSET 064 movl $(__BOOT_DS),%eax 065 movl %eax,%ds 066 movl %eax,%es 067 movl %eax,%fs 068 movl %eax,%gs 068 081 /* 082 * Initialize page tables. This creates a PDE and a set of page 083 * tables, which are located immediately beyond _end. The variable 084 * init_pg_tables_end is set up to point to the first "safe" location. 085 * Mappings are created both at virtual address 0 (identity mapping) 086 * and PAGE_OFFSET for up to _end+sizeof(page tables)+INIT_MAP_BEYOND_END. 087 * 088 * Warning: don't use %esi or the stack in this code. However, %esp 089 * can be used as a GPR if you really need it... 090 */ 091 page_pde_offset = (__PAGE_OFFSET >> 20); 092 093 movl $(pg0 - __PAGE_OFFSET), %edi 094 movl $(swapper_pg_dir - __PAGE_OFFSET), %edx 095 movl $0x007, %eax /* 0x007 = PRESENT+RW+USER */ 096 10: 097 leal 0x007(%edi),%ecx /* Create PDE entry */ 098 movl %ecx,(%edx) /* Store identity PDE entry */ 099 movl %ecx,page_pde_offset(%edx) /* Store kernel PDE entry */ 100 addl $4,%edx 101 movl $1024, %ecx 102 11: 103 stosl 104 addl $0x1000,%eax 105 loop 11b 106 /* End condition: we must map up to and including INIT_MAP_BEYOND_END */ 107 /* bytes beyond the end of our own page tables; the +0x007 is the attribute bits */ 108 leal (INIT_MAP_BEYOND_END+0x007)(%edi),%ebp 109 cmpl %ebp,%eax 110 jb 10b 111 movl %edi,(init_pg_tables_end - __PAGE_OFFSET) 112 113 #ifdef CONFIG_SMP ... 156 3: 157 #endif /* CONFIG_SMP */ 158 159 /* 160 * Enable paging 161 */ 162 movl $swapper_pg_dir-__PAGE_OFFSET,%eax 163 movl %eax,%cr3 /* set the page table pointer.. */ 164 movl %cr0,%eax 165 orl $0x80000000,%eax 166 movl %eax,%cr0 /* ..and set paging (PG) bit */ 167 ljmp $__BOOT_CS,$1f /* Clear prefetch and normalize %eip */ 168 1: 169 /* Set up the stack pointer */ 170 lss stack_start,%esp ... 177 pushl $0 178 popfl 179 180 #ifdef CONFIG_SMP 181 andl %ebx,%ebx 182 jz 1f /* Initial CPU cleans BSS */ 183 jmp checkCPUtype 184 1: 185 #endif /* CONFIG_SMP */ 186 187 /* 188 * start system 32-bit setup. We need to re-do some of the things done 189 * in 16-bit mode for the "real" operations. 190 */ 191 call setup_idt 192 193 * 194 * Copy bootup parameters out of the way. 195 * Note: %esi still has the pointer to the real-mode data. 196 */ 197 movl $boot_params,%edi 198 movl $(PARAM_SIZE/4),%ecx 199 cld 200 rep 201 movsl 202 movl boot_params+NEW_CL_POINTER,%esi 203 andl %esi,%esi 204 jnz 2f # New command line protocol 205 cmpw $(OLD_CL_MAGIC),OLD_CL_MAGIC_ADDR 206 jne 1f 207 movzwl OLD_CL_OFFSET,%esi 208 addl $(OLD_CL_BASE_ADDR),%esi 209 2: 210 movl $saved_command_line,%edi 211 movl $(COMMAND_LINE_SIZE/4),%ecx 212 rep 213 movsl 214 1: 215 checkCPUtype: ... 279 lgdt cpu_gdt_descr 280 lidt idt_descr ... 303 call start_kernel ---------------------------------------------------------------------- Line 57This line is the 32-bit protected mode entry point for the kernel code. Currently, the code uses the provisional GDT. Line 63This code initializes the GDTR with the base address of the boot GDT. This boot GDT is the same as the provisional GDT used in setup.S (4GB code and data starting at address 0x00000000) and is used only by this boot code. Lines 6468Initialize the remaining segment registers with __BOOT_DS, which resolves to 24 (see /include/asm-i386/segment.h). This value points to the 24th selector (starting at 0) in the final GDT, which is set later in this code. Lines 91111Create a page directory entry (PDE) in swapper_pg_dir that references a page table (pg0) with 0 based (identity) entries and duplicate PAGE_OFFSET (kernel memory) entries. Lines 113157This code block initializes secondary (non-boot) processors to the page tables. For this discussion, we focus on the boot processor. Lines 162164The cr3 register is the entry point for x86 hardware paging. This register is initialized to point to the base of the Page Directory, which in this case, is swapper_pg_dir. Lines 165168Set the PG (paging) bit in cr0 of the boot processor. The PG bit enables the paging mechanism in the x86 architecture. The jump instruction (on line 167) is recommended when changing the PG bit to ensure that all instructions within the processor are serialized at the moment of entering or exiting paging mode. Line 170Initialize the stack to the start of the data segment (see also lines 401403). Lines 177178The eflags register is a read/write system register that contains the status of interrupts, modes, and permissions. This register is cleared by pushing a 0 onto the stack and directly popping it into the register with the popfl instruction. Lines 180185The general-purpose register ebx is used as a flag to indicate whether it is the boot processor to the processor that runs this code. Because we are tracing this code as the boot processor, ebx has been cleared (0), and we jump to the call to setup_idt. Line 191The routine setup_idt initializes an Interrupt Descriptor Table (IDT) where each entry points to a dummy handler. The IDT, discussed in Chapter 7, "Scheduling and Kernel Synchronization," is a table of functions (or handlers) that are called when the processor needs to immediately execute time-critical code. Lines 197214The user can pass certain parameters to Linux at boot time. They are stored here for later use. Lines 215303The code listed on these lines does a large amount of necessary (but tedious) x86 processor-version checking and some minor initialization. By way of the cupid instruction (or lack thereof), certain bits are set in the eflags register and cr0. One notable setting in cr0 is bit 4, the extension type (ET). This bit indicates the support of math-coprocessor instructions in older x86 processors. The most important lines of code in this block are lines 279280. This is where the IDT and the GDT are loaded (by way of the lidt and lgdt instructions) into the idtr and gdtr registers. Finally, on line 303, we jump to the routine start_kernel(). With the code in head.S, the system can now map a logical address to a linear address to finally a physical address (see Figure 8.10). Starting with a logical address, the selector (in the CS, DS, ES, etc., registers) references one of the descriptors in the GDT. The offset is the flat address that we seek. The information from the descriptor and the offset are combined to form the logical address. Figure 8.10. Boot-Time PagingIn the code walkthrough, we saw how the Page Directory (swapper_pg_dir) and Page Table (pg0) were created and that cr3 was initialized to point to the Page Directory. As previously discussed, the processor becomes aware of where to look for the paging components by cr3's setting, and setting cr0 (PG bit) is how the processor is informed to start using them. On the logical address, bits 22:31 indicate the Page Directory Entry (PDE), bits 12:21 indicate the Page Table Entry (PTE), and bits 0:11 indicate the offset (in this example, 4KB) into the physical page. The system now has 8MB of memory mapped out using a provisional paging system. The next step is to call the function start_kernel() in init/main.c. 8.3.3. PowerPC and x86 Code ConvergenceNotice that both the PowerPC code and the x86 code have now converged on start_kernel() in init/main.c. This routine, which is located in the architecture-independent section of the code, calls architecture-specific routines to finish memory initialization. The first function called in this file is setup_arch() in arch/i386/ kernel/ setup.c, which then calls paging_init() in arch/i386/mm/init.c, which then calls pagetable_init() in the same file. The remainder of system memory is allocated here to produce the final page tables. In the PowerPC world, much has already been done. The setup_arch() file in arch/ppc/kernel/setup.c then calls paging_init() in arch/ppc/mm/init.c. The one notable function performed in paging_init() for PPC is to set all pages to be in the DMA zone. |