Section 8.5. The Beginning: start_kernel()

8.5. The Beginning: start_kernel()

This discussion begins with the jump to the start_kernel() (init/main.c) function, the first architecture-independent part of the code to be called.

With the jump to start_kernel(), we execute Process 0, which is otherwise known as the root thread. Process 0 spawns off Process 1, known as the init process. Process 0 then becomes the idle thread for the CPU. When /sbin/init is called, we have only those two processes running:

 ---------------------------------------------------------------------- init/main.c 396  asmlinkage void __init start_kernel(void) 397  { 398   char * command_line; 399   extern char saved_command_line[]; 400   extern struct kernel_param __start___param[], __stop___param[]; ... 405   lock_kernel(); 406   page_address_init(); 407   printk(linux_banner); 408   setup_arch(&command_line); 409   setup_per_cpu_areas(); ... 415   smp_prepare_boot_cpu(); ... 422   sched_init(); 423   424   build_all_zonelists(); 425   page_alloc_init(); 426   printk("Kernel command line: %s\n", saved_command_line); 427   parse_args("Booting kernel", command_line, __start___param, 428     __stop___param - __start___param, 429     &unknown_bootoption); 430   sort_main_extable(); 431   trap_init(); 432   rcu_init(); 433   init_IRQ(); 434   pidhash_init(); 435   init_timers(); 436   softirq_init(); 437   time_init(); ... 444   console_init(); 445   if (panic_later) 446    panic(panic_later, panic_param) ; 447   profile_init(); 448   local_irq_enable(); 449  #ifdef CONFIG_BLK_DEV_INITRD 450   if (initrd_start && !initrd_below_start_ok && 451     initrd_start < min_low_pfn << PAGE_SHIFT) { 452    printk(KERN_CRIT "initrd overwritten (0x%08lx < 0x%08lx) - " 453     "disabling it.\n",initrd_start,min_low_pfn << PAGE_SHIFT); 454    initrd_start = 0; 455   } 456  #endif 457   mem_init(); 458   kmem_cache_init(); 459   if (late_time_init) 460    late_time_init(); 461   calibrate_delay(); 462   pidmap_init(); 463   pgtable_cache_init(); 464   prio_tree_init(); 465   anon_vma_init(); 466  #ifdef CONFIG_X86 467   if (efi_enabled) 468    efi_enter_virtual_mode(); 469  #endif 470   fork_init(num_physpages); 471   proc_caches_init(); 472   buffer_init(); 473   unnamed_dev_init(); 474   security_scaffolding_startup(); 475   vfs_caches_init(num_physpages); 476   radix_tree_init(); 477   signals_init(); 478   /* rootfs populating might need page-writeback */ 479   page_writeback_init(); 480  #ifdef CONFIG_PROC_FS 481   proc_root_init(); 482  #endif 483   check_bugs(); ... 490   init_idle(current, smp_processor_id()); ... 493   rest_init(); 494  } -----------------------------------------------------------------------

8.5.1. The Call to lock_kernel()

Line 405

In the 2.6 Linux kernel, the default configuration is to have a preemptible kernel. A preemptible kernel means that the kernel itself can be interrupted by a higher priority task, such as a hardware interrupt, and control is passed to the higher priority task. The kernel must save enough state so that it can return to executing when the higher priority task finishes.

Early versions of Linux implemented kernel preemption and SMP locking by using the Big Kernel Lock (BKL). Later versions of Linux correctly abstracted preemption into various calls, such as preempt_disable(). The BKL is still with us in the initialization process. It is a recursive spinlock that can be taken several times by a given CPU. A side effect of using the BKL is that it disables preemption, which is an important side effect during initialization.

Locking the kernel prevents it from being interrupted or preempted by any other task. Linux uses the BKL to do this. When the kernel is locked, no other process can execute. This is the antithesis of a preemptible kernel that can be interrupted at any point. In the 2.6 Linux kernel, we use the BKL to lock the kernel upon startup and initialize the various kernel objects without fear of being interrupted. The kernel is unlocked on line 493 within the rest_init() function. Thus, all of start_kernel() occurs with the kernels locked. Let's look at what happens in lock_kernel():

 ---------------------------------------------------------------------- include/linux/smp_lock.h 42 static inline void lock_kernel(void) 43 { 44   int depth = current->lock_depth+1; 45   if (likely(!depth)) 46     get_kernel_lock(); 47   current->lock_depth = depth; 48 } -----------------------------------------------------------------------

Lines 4448

The init task has a special lock_depth of -1. This ensures that in multi-processor systems, different CPUs do not attempt to simultaneously grab the kernel lock. Because only one CPU runs the init task, only it can grab the big kernel lock because depth is 0 only for init (otherwise, depth is greater than 0). A similar trick is used in unlock_kernel() where we test (--current->lock_depth < 0). Let's see what happens in get_kernel_lock():

 ---------------------------------------------------------------------- include/linux/smp_lock.h 10 extern spinlock_t kernel_flag; 11  12 #define kernel_locked()   (current->lock_depth >= 0) 13  14 #define get_kernel_lock()  spin_lock(&kernel_flag) 15 #define put_kernel_lock()  spin_unlock(&kernel_flag) ... 59 #define lock_kernel()       do { } while(0) 60 #define unlock_kernel()       do { } while(0) 61 #define release_kernel_lock(task)    do { } while(0) 62 #define reacquire_kernel_lock(task)    do { } while(0) 63 #define kernel_locked()       1   -----------------------------------------------------------------------

Lines 1015

These macros describe the big kernel locks that use standard spinlock routines. In multiprocessor systems, it is possible that two CPUs might try to access the same data structure. Spinlocks, which are explained in Chapter 7, prevent this kind of contention.

Lines 5963

In the case where the kernel is not preemptible and not operating over multiple CPUs, we simply do nothing for lock_kernel() because nothing can interrupt us anyway.

The kernel has now seized the BKL and will not let go of it until the end of start_kernel(); as a result, all the following commands cannot bepreempted.

8.5.2. The Call to page_address_init()

Line 406

The call to page_address_init() is the first function that is involved with the initialization of the memory subsystem in this architecture-dependent portion of the code. The definition of page_address_init() varies according to three different compile-time parameter definitions. The first two result in page_address_init() being stubbed out to do nothing by defining the body of the function to be do { } while (0), as shown in the following code. The third is the operation we explore here in more detail. Let's look at the different definitions and discuss when they are enabled:

 ---------------------------------------------------------------------- include/linux/mm.h 376 #if defined(WANT_PAGE_VIRTUAL) 382 #define page_address_init() do { } while(0) 385 #if defined(HASHED_PAGE_VIRTUAL) 388 void page_address_init(void); 391 #if !defined(HASHED_PAGE_VIRTUAL) && !defined(WANT_PAGE_VIRTUAL) 394 #define page_address_init() do { } while(0) ----------------------------------------------------------------------

The #define for WANT_PAGE_VIRTUAL is set when the system has direct memory mapping, in which case simply calculating the virtual address of the memory location is sufficient to access the memory location. In cases where all of RAM is not mapped into the kernel address space (as is often the case when himem is configured), we need a more involved way to acquire the memory address. This is why the initialization of page addressing is defined only in the case where HASHED_PAGE_VIRTUAL is set.

We now look at the case where the kernel has been told to use HASHED_PAGE_VIRTUAL and where we need to initialize the virtual memory that the kernel is using. Keep in mind that this happens only if himem has been configured; that is, the amount of RAM the kernel can access is larger than that mapped by the kernel address space (generally 4GB).

In the process of following the function definition, various kernel objects are introduced and revisited. Table 8.2 shows the kernel objects introduced during the process of exploring page_address_init().

Table 8.2. Objects Introduced During the Call to page_address_init()
Object Name	Description
`page_address_map`	Struct
`page_address_slot`	Struct
`page_address_pool`	Global variable
`page_address_maps`	Global variable
`page_address_htable`	Global variable

 ---------------------------------------------------------------------- mm/highmem.c 510 static struct page_address_slot { 511  struct list_head lh;     512 spinlock_t lock;     513 } ____cacheline_aligned_in_smp page_address_htable[1<<PA_HASH_ORDER];  ... 591 static struct page_address_map page_address_maps[LAST_PKMAP]; 592  593 void __init page_address_init(void) 594 { 595   int i; 596    597   INIT_LIST_HEAD(&page_address_pool); 598   for (i = 0; i < ARRAY_SIZE(page_address_maps); i++)     599     list_add(&page_address_maps[i].list, &page_address_pool)  ; 600   for (i = 0; i < ARRAY_SIZE(page_address_htable); i++) { 601     INIT_LIST_HEAD(&page_address_htable[i].lh); 602     spin_lock_init(&page_address_htable[i].lock); 603   } 604   spin_lock_init(&pool_lock); 605 } ----------------------------------------------------------------------

Line 597

The main purpose of this line is to initialize the page_address_pool global variable, which is a struct of type list_head and point to a list of free pages allocated from page_address_maps (line 591). Figure 8.11 illustrates page_address_pool.

Figure 8.11. Data Structures Surrounding the Page Address Map Pool

Lines 598599

We add each list of pages in page_address_maps to the doubly linked list headed by page_address_pool. We describe the page_address_map structure in detail next.

Lines 600603

We initialize each page address hash table's list_head and spinlock. The page_address_htable variable holds the list of entries that hash to the same bucket. Figure 8.12 illustrates the page address hash table.

Figure 8.12. Page Address Hash Table

Line 604

We initialize the page_address_pool's spinlock.

Let's look at the page_address_map structure to better understand the lists we just saw initialized. This structure's main purpose is to maintain the association with a page and its virtual address. This would be wasteful if the page had a linear association with its virtual address. This becomes necessary only if the addressing is hashed:

 ---------------------------------------------------------------------- mm/highmem.c   490 struct page_address_map { 491   struct page *page; 492   void *virtual; 493   struct list_head list; 494 }; -----------------------------------------------------------------------

As you can see, the object keeps a pointer to the page structure that's associated with this page, a pointer to the virtual address, and a list_head struct to maintain its position in the doubly linked list of the page address list it is in.

8.5.3. The Call to printk(linux_banner)

Line 407

This call is responsible for the first console output made by the Linux kernel. This introduces the global variable linux_banner:

 ---------------------------------------------------------------------- init/version.c 31  const char *linux_banner =  32   "Linux version " UTS_RELEASE " (" LINUX_COMPILE_BY "@" LINUX_COMPILE_HOST ") (" LINUX_COMPILER ") " UTS_VERSION "\n"; -----------------------------------------------------------------------

The version.c file defines linux_banner as just shown. This string provides the user with a reference of the Linux kernel version, the gcc version it was compiled with, and the release.

8.5.4. The Call to setup_arch

Line 408

The setup_arch() function in arch/i386/kernel/setup.c is cast to the __init type (refer to Chapter 2 for a description of __init) where it runs only once at system initialization time. The setup_arch() function takes in a pointer to any Linux command-line data entered at boot time and initializes many of the architecture-specific subsystems, such as memory, I/O, processors, and consoles:

 ---------------------------------------------------------------------- arch/i386/kernel/setup.c 1083  void __init setup_arch(char **cmdline_p) 1084  { 1085   unsigned long max_low_pfn; 1086 1087   memcpy(&boot_cpu_data, &new_cpu_data, sizeof(new_cpu_data)); 1088   pre_setup_arch_hook(); 1089   early_cpu_init(); 1090 1091   /* 1092   * FIXME: This isn't an official loader_type right 1093   * now but does currently work with elilo. 1094   * If we were configured as an EFI kernel, check to make 1095   * sure that we were loaded correctly from elilo and that 1096   * the system table is valid. If not, then initialize normally. 1097   */ 1098  #ifdef CONFIG_EFI 1099   if ((LOADER_TYPE == 0x50) && EFI_SYSTAB) 1100    efi_enabled = 1; 1101  #endif 1102 1103   ROOT_DEV = old_decode_dev(ORIG_ROOT_DEV); 1104   drive_info = DRIVE_INFO; 1105   screen_info = SCREEN_INFO; 1106   edid_info = EDID_INFO; 1107   apm_info.bios = APM_BIOS_INFO; 1108   ist_info = IST_INFO; 1109   saved_videomode = VIDEO_MODE; 1110   if( SYS_DESC_TABLE.length != 0 ) { 1111    MCA_bus = SYS_DESC_TABLE.table[3] &0x2; 1112    machine_id = SYS_DESC_TABLE.table[0]; 1113    machine_submodel_id = SYS_DESC_TABLE.table[1]; 1114    BIOS_revision = SYS_DESC_TABLE.table[2]; 1115   } 1116   aux_device_present = AUX_DEVICE_INFO;  1117 1118  #ifdef CONFIG_BLK_DEV_RAM 1119   rd_image_start = RAMDISK_FLAGS & RAMDISK_IMAGE_START_MASK; 1120   rd_prompt = ((RAMDISK_FLAGS & RAMDISK_PROMPT_FLAG) != 0); 1121   rd_doload = ((RAMDISK_FLAGS & RAMDISK_LOAD_FLAG) != 0); 1122  #endif 1123   ARCH_SETUP 1124   if (efi_enabled) 1125    efi_init(); 1126   else 1127    setup_memory_region(); 1128 1129   copy_edd(); 1130 1131   if (!MOUNT_ROOT_RDONLY) 1132    root_mountflags &= ~MS_RDONLY; 1133   init_mm.start_code = (unsigned long) _text; 1134   init_mm.end_code = (unsigned long) _etext; 1135   init_mm.end_data = (unsigned long) _edata; 1136   init_mm.brk = init_pg_tables_end + PAGE_OFFSET; 1137 1138   code_resource.start = virt_to_phys(_text); 1139   code_resource.end = virt_to_phys(_etext)-1; 1140   data_resource.start = virt_to_phys(_etext); 1141   data_resource.end = virt_to_phys(_edata)-1; 1142 1143   parse_cmdline_early(cmdline_p); 1144 1145   max_low_pfn = setup_memory(); 1146 1147   /* 1148   * NOTE: before this point _nobody_ is allowed to allocate 1149   * any memory using the bootmem allocator. 1150   */ 1152  #ifdef CONFIG_SMP 1153   smp_alloc_memory(); /* AP processor realmode stacks in low memory*/ 1154  #endif 1155   paging_init(); 1156 1157  #ifdef CONFIG_EARLY_PRINTK 1158   { 1159    char *s = strstr(*cmdline_p, "earlyprintk="); 1160    if (s) { 1161     extern void setup_early_printk(char *); 1162 1163     setup_early_printk(s); 1164     printk("early console enabled\n"); 1165    } 1166   } 1167  #endif ... 1170   dmi_scan_machine(); 1171 1172  #ifdef CONFIG_X86_GENERICARCH 1173   generic_apic_probe(*cmdline_p); 1174  #endif   1175   if (efi_enabled) 1176    efi_map_memmap(); 1177 1178   /* 1179   * Parse the ACPI tables for possible boot-time SMP configuration. 1180   */ 1181   acpi_boot_init(); 1182 1183  #ifdef CONFIG_X86_LOCAL_APIC 1184   if (smp_found_config) 1185    get_smp_config(); 1186  #endif 1187 1188  register_memory(max_low_pfn); 1188 1190  #ifdef CONFIG_VT 1191  #if defined(CONFIG_VGA_CONSOLE) 1192   if (!efi_enabled || (efi_mem_type(0xa0000) != EFI_CONVENTIONAL_MEMORY)) 1193    conswitchp = &vga_con; 1194  #elif defined(CONFIG_DUMMY_CONSOLE) 1195   conswitchp = &dummy_con; 1196  #endif 1197  #endif 1198  } -----------------------------------------------------------------------

Line 1087

Get boot_cpu_data, which is a pointer to the cpuinfo_x86 struct filled in at boot time. This is similar for PPC.

Line 1088

Activate any machine-specific identification routines. This can be found in arch/xxx/machine-default/setup.c.

Line 1089

Identify the specific processor.

Lines 11031116

Get the system boot parameters.

Lines 11181122

Get RAM disk if set in arch/<arch>/defconfig.

Lines 11241127

Initialize Extensible Firmware Interface (if set in /defconfig) or just print out the BIOS memory map.

Line 1129

Save off Enhanced Disk Drive parms from boot time.

Lines 11331141

Initialize memory-management structs from the BIOS-provided memory map.

Line 1143

Begin parsing out the Linux command line. (See arch/<arch>/kernel/ setup.c.)

Line 1145

Initializes/reserves boot memory. (See arch/i386/kernel/setup.c.)

Lines 11531155

Get a page for SMP initialization or initialize paging beyond the 8M that's already initialized in head.S. (See arch/i386/mm/init.c.)

Lines 11571167

Get printk() running even though the console is not fully initialized.

Line 1170

This line is the Desktop Management Interface (DMI), which gathers information about the specific system-hardware configuration from BIOS. (See arch/i386/kernel/dmi_scan.c.)

Lines 11721174

If the configuration calls for it, look for the APIC given on the command line. (See arch/i386/machine-generic/probe.c.)

Lines 11751176

If using Extensible Firmware Interface, remap the EFI memory map. (See arch/i386/kernel/efi.c.)

Line 1181

Look for local and I/O APICs. (See arch/i386/kernel/acpi/boot.c.) Locate and checksum System Description Tables. (See drivers/acpi/tables.c.) For a better understanding of ACPI, go to the ACPI4LINUX project on the Web.

Lines 11831186

Scan for SMP configuration. (See arch/i386/kernel/mpparse.c.) This section can also use ACPI for configuration information.

Line 1188

Request I/O and memory space for standard resources. (See arch/i386/kernel/std_resources.c for an idea of how resources are registered.)

Lines 11901197

Set up the VGA console switch structure. (See drivers/video/console/vgacon.c.)

A similar but shorter version of setup_arch() can be found in arch/ppc/kernel/setup.c for the PowerPC. This function initializes a large part of the ppc_md structure. A call to pmac_feature_init() in arch/ppc/platforms/pmac_feature.c does an initial probe and initialization of the pmac hardware.

8.5.5. The Call to setup_per_cpu_areas()

Line 409

The routine setup_per_cpu_areas() exists for the setup of a multiprocessing environment. If the Linux kernel is compiled without SMP support, setup_per_cpu_areas() is stubbed out to do nothing, as follows:

 ---------------------------------------------------------------------- init/main.c 317  static inline void setup_per_cpu_areas(void) { } -----------------------------------------------------------------------

If the Linux kernel is compiled with SMP support, setup_per_cpu_areas() is defined as follows:

 ---------------------------------------------------------------------- init/main.c 327 static void __init setup_per_cpu_areas(void) 328 { 329   unsigned long size, i; 330   char *ptr; 331   /* Created by linker magic */ 332   extern char __per_cpu_start[], __per_cpu_end[]; 333  334   /* Copy section for each CPU (we discard the original) */ 335   size = ALIGN(__per_cpu_end - __per_cpu_start, SMP_CACHE_BYTES); 336 #ifdef CONFIG_MODULES 337   if (size < PERCPU_ENOUGH_ROOM) 338     size = PERCPU_ENOUGH_ROOM; 339 #endif 340  341   ptr = alloc_bootmem(size * NR_CPUS); 342  343   for (i = 0; i < NR_CPUS; i++, ptr += size) { 344     __per_cpu_offset[i] = ptr - __per_cpu_start; 345     memcpy(ptr, __per_cpu_start, __per_cpu_end - __per_cpu_start); 346   } 347 } -----------------------------------------------------------------------

Lines 329332

The variables for managing a consecutive block of memory are initialized. The "linker magic" variables are defined during linking in the appropriate architecture's kernel directory (for example, arch/i386/kernel/vmlinux.lds.S).

Lines 334341

We determine the size of memory a single CPU requires and allocate that memory for each CPU in the system as a single contiguous block of memory.

Lines 343346

We cycle through the newly allocated memory, initializing each CPU's chunk of memory. Conceptually, we have taken a chunk of data that's valid for a single CPU (__per_cpu_start to __per_cpu_end) and copied it for each CPU on the system. This way, each CPU has its own data with which to play.

8.5.6. The Call to smp_prepare_boot_cpu()

Line 415

Similar to smp_per_cpu_areas(), smp_prepare_boot_cpu() is stubbed out when the Linux kernel does not support SMP:

 ---------------------------------------------------------------------- include/linux/smp.h 106 #define smp_prepare_boot_cpu()     do {} while (0) -----------------------------------------------------------------------

However, if the Linux kernel is compiled with SMP support, we need to allow the booting CPU to access its console drivers and the per-CPU storage that we just initialized. Marking CPU bitmasks achieves this.

A CPU bitmask is defined as follows:

 ---------------------------------------------------------------------- include/asm-generic/cpumask.h 10 #if NR_CPUS > BITS_PER_LONG && NR_CPUS != 1 11 #define CPU_ARRAY_SIZE   BITS_TO_LONGS(NR_CPUS) 12  13 struct cpumask 14 { 15   unsigned long mask[CPU_ARRAY_SIZE]; 16 }; -----------------------------------------------------------------------

This means that we have a platform-independent bitmask that contains the same number of bits as the system has CPUs.

smp_prepare_boot_cpu() is implemented in the architecture-dependent section of the Linux kernel but, as we soon see, it is the same for i386 and PPC systems:

 ---------------------------------------------------------------------- arch/i386/kernel/smpboot.c 66 /* bitmap of online cpus */ 67 cpumask_t cpu_online_map; ... 70 cpumask_t cpu_callout_map; ... 1341 void __devinit smp_prepare_boot_cpu(void) 1342 { 1343   cpu_set(smp_processor_id(), cpu_online_map); 1344   cpu_set(smp_processor_id(), cpu_callout_map); 1345 } ----------------------------------------------------------------------- ---------------------------------------------------------------------- arch/ppc/kernel/smp.c 49 cpumask_t cpu_online_map; 50 cpumask_t cpu_possible_map; ... 331 void __devinit smp_prepare_boot_cpu(void) 332 { 333   cpu_set(smp_processor_id(), cpu_online_map); 334   cpu_set(smp_processor_id(), cpu_possible_map); 335 } -----------------------------------------------------------------------

In both these functions, cpu_set() simply sets the bit smp_processor_id() in the cpumask_t bitmap. Setting a bit implies that the value of the set bit is 1.

8.5.7. The Call to sched_init()

Line 422

The call to sched_init() marks the initialization of all objects that the scheduler manipulates to manage the assignment of CPU time among the system's processes. Keep in mind that, at this point, only one process exists: the init process that currently executes sched_init():

 ---------------------------------------------------------------------- kernel/sched.c 3896 void __init sched_init(void) 3897 { 3898   runqueue_t *rq; 3899   int i, j, k; 3900 ... 3919   for (i = 0; i < NR_CPUS; i++) { 3920     prio_array_t *array; 3921  3922     rq = cpu_rq(i); 3923     spin_lock_init(&rq->lock); 3924     rq->active = rq->arrays; 3925     rq->expired = rq->arrays + 1; 3926     rq->best_expired_prio = MAX_PRIO; ... 3938     for (j = 0; j < 2; j++) { 3939       array = rq->arrays + j; 3940       for (k = 0; k < MAX_PRIO; k++) { 3941         INIT_LIST_HEAD(array->queue + k); 3942         __clear_bit(k, array->bitmap); 3943       } 3944       // delimiter for bitsearch 3945       __set_bit(MAX_PRIO, array->bitmap); 3946     } 3947   } 3948   /* 3949   * We have to do a little magic to get the first 3950   * thread right in SMP mode. 3951   */ 3952   rq = this_rq(); 3953   rq->curr = current; 3954   rq->idle = current; 3955   set_task_cpu(current, smp_processor_id()); 3956   wake_up_forked_process(current); 3957  3958   /* 3959   * The boot idle thread does lazy MMU switching as well: 3960   */ 3961   atomic_inc(&init_mm.mm_count); 3962   enter_lazy_tlb(&init_mm, current); 3963 } -----------------------------------------------------------------------

Lines 39193926

Each CPU's run queue is initialized: The active queue, expired queue, and spinlock are all initialized in this segment. Recall from Chapter 7 that spin_lock|_init() sets the spinlock to 1, which indicates that the data object is unlocked.

Figure 8.13 illustrates the initialized run queue.

Figure 8.13. Initialized Run Queue rq

Lines 39383947

For each possible priority, we initialize the list associated with the priority and clear all bits in the bitmap to show that no process is on that queue. (If all this is confusing, refer to Figure 8.14. Also, see Chapter 7 for an overview of how the scheduler manages its run queues.) This code chunk just ensures that everything is ready for the introduction of a process. As of line 3947, the scheduler is in the position to know that no processes exist; it ignores the current and idle processes for now.

Figure 8.14. rq->arrays

Lines 39523956

We add the current process to the current CPU's run queue and call wake_up_forked_process() on ourselves to initialize current into the scheduler. Now, the scheduler knows that exactly one process exists: the init process.

Lines 39613962

When lazy MMU switching is enabled, it allows a multiprocessor Linux system to perform context switches at a faster rate. A TLB is a transaction lookaside buffer that contains the recent page translation addresses. It takes a long time to flush the TLB, so we swap it if possible. enter_lazy_tlb() ensures that the mm_struct init_mm isn't being used across multiple CPUs and can be lazily switched. On a uniprocessor system, this becomes a NULL function.

The sections that were omitted in the previous code deal with initialization of SMP machines. As a quick overview, those sections bootstrap each CPU to the default settings necessary to allow for load balancing, group scheduling, and thread migration. They are omitted here for clarity and brevity.

8.5.8. The Call to build_all_zonelists()

Line 424

The build_all_zonelists()function splits up the memory according to the zone types ZONE_DMA, ZONE_NORMAL, and ZONE_HIGHMEM. As mentioned in Chapter 6, "Filesystems," zones are linear separations of physical memory that are used mainly to address hardware limitations. Suffice it to say that this is the function where these memory zones are built. After the zones are built, pages are stored in page frames that fall within zones.

The call to build_all_zonelists() introduces numnodes and NODE_DATA. The global variable numnodes holds the number of nodes (or partitions) of physical memory.

The partitions are determined according to CPU access time. Note that, at this point, the page tables have already been fully set up:

 ---------------------------------------------------------------------- mm/page_alloc.c 1345  void __init build_all_zonelists(void) 1346  { 1347   int i; 1348   1349   for(i = 0 ; i < numnodes ; i++) 1350    build_zonelists(NODE_DATA(i)); 1351   printk("Built %i zonelists\n", numnodes); 1352  } ----------------------------------------------------------------------

build_all_zonelists() calls build_zonelists() once for each node and finishes by printing out the number of zonelists created. This book does not go into more detail regarding nodes. Suffice it to say that, in our one CPU example, numnodes are equivalent to 1, and each node can have all three types of zones. The NODE_DATA macro returns the node's descriptor from the node descriptor list.

8.5.9. The Call to page_alloc_init

Line 425

The function page_alloc_init() simply registers a function in a notifier chain.^[6] The function-registered page_alloc_cpu_notify() is a page-draining function^[7] associated with dynamic CPU configuration.

^[6] Chapter 2 discusses notifier chains.

^[7] Page draining refers to removing pages that are in use by a CPU that will no longer be used.

Dynamic CPU configuration refers to bringing up and down CPUs during the running of the Linux system, an event referred to as "hotplugging the CPU." Although technically, CPUs are not physically inserted and removed during machine operation, they can be turned on and off in some systems, such as the IBM p-Series 690. Let's look at the function:

 ---------------------------------------------------------------------- mm/page_alloc.c 1787  #ifdef CONFIG_HOTPLUG_CPU 1788  static int page_alloc_cpu_notify(struct notifier_block *self, 1789      unsigned long action, void *hcpu) 1790  { 1791   int cpu = (unsigned long)hcpu; 1792   long *count; 1793   if (action == CPU_DEAD) { ... 1796    count = &per_cpu(nr_pagecache_local, cpu); 1797    atomic_add(*count, &nr_pagecache); 1798    *count = 0; 1799    local_irq_disable(); 1800    __drain_pages(cpu); 1801    local_irq_enable(); 1802   } 1803   return NOTIFY_OK; 1804  } 1805  #endif /* CONFIG_HOTPLUG_CPU */ 1806 1807  void __init page_alloc_init(void) 1808  { 1809   hotcpu_notifier(page_alloc_cpu_notify, 0); 1810  } -----------------------------------------------------------------------

Line 1809

This line is the registration of the page_alloc_cpu_notify() routine into the hotcpu_notifier notifier chain. The hotcpu_notifier() routine creates a notifier_block that points to the page_alloc_cpu_notify() function and, with a priority of 0, then registers the object in the cpu_chain notifier chain(kernel/cpu.c).

Line 1788

page_alloc_cpu_notify() has the parameters that correspond to a notifier call, as Chapter 2 explained. The system-specific pointer points to an integer that specifies the CPU number.

Lines 17941802

If the CPU is dead, free up its pages. The variable action is set to CPU_DEAD when a CPU is brought down. (See drain_pages() in this same file.)

8.5.10. The Call to parse_args()

Line 427

The parse_args() function parses the arguments passed to the Linux kernel.

For example, nfsroot is a kernel parameter that sets the NFS root filesystem for systems without disks. You can find a complete list of kernel parameters in Documentation/kernel-parameters.txt:

 ---------------------------------------------------------------------- kernel/params.c 116 int parse_args(const char *name, 117    char *args, 118    struct kernel_param *params, 119    unsigned num, 120    int (*unknown)(char *param, char *val)) 121 { 122   char *param, *val; 123  124   DEBUGP("Parsing ARGS: %s\n", args); 125  126   while (*args) { 127     int ret; 128  129     args = next_arg(args, &param, &val); 130     ret = parse_one(param, val, params, num, unknown); 131     switch (ret) { 132     case -ENOENT: 133       printk(KERN_ERR "%s: Unknown parameter '%s'\n", 134        name, param); 135       return ret; 136     case -ENOSPC: 137       printk(KERN_ERR 138        "%s: '%s' too large for parameter '%s'\n", 139        name, val ?: "", param); 140       return ret; 141     case 0: 142       break; 143     default: 144       printk(KERN_ERR 145        "%s: '%s' invalid for parameter '%s'\n", 146        name, val ?: "", param); 147       return ret; 148     } 149   } 150  151   /* All parsed OK. */ 152   return 0; 153 } -----------------------------------------------------------------------

Lines 116125

The parameters passed to parse_args() are the following:

name. A character string to be displayed if any errors occur while the kernel attempts to parse the kernel parameter arguments. In standard operation, this means that an error message, "Booting Kernel: Unknown parameter X," is displayed.
args. The kernel parameter list of form foo=bar,bar2 baz=fuz wix.
params. Points to the kernel parameter structure that holds all the valid parameters for the specific kernel. Depending on how a kernel was compiled, some parameters might exist and others might not.
num. The number of kernel parameters in this specific kernel, not the number of arguments in args.
unknown. Points to a function to call if a kernel parameter is specified that is not recognized.

Lines 126153

We loop through the string args, set param to point to the first parameter, and set val to the first value (if any, val could be null). This is done via next_args() (for example, the first call to next_args() with args being foo=bar,bar2 baz=fuz wix). We set param to foo and val to bar, bar2. The space after bar2 is overwritten with a \0 and args is set to point at the beginning character of baz.

We pass our pointers param and val into parse_one(), which does the work of setting the actual kernel parameter data structures:

 ---------------------------------------------------------------------- kernel/params.c 46 static int parse_one(char *param, 47      char *val, 48      struct kernel_param *params, 49      unsigned num_params, 50      int (*handle_unknown)(char *param, char *val)) 51 { 52   unsigned int i; 53  54   /* Find parameter */ 55   for (i = 0; i < num_params; i++) { 56     if (parameq(param, params[i].name)) { 57       DEBUGP("They are equal! Calling %p\n", 58        params[i].set); 59       return params[i].set(val, &params[i]); 60     } 61   } 62  63   if (handle_unknown) { 64     DEBUGP("Unknown argument: calling %p\n", handle_unknown); 65     return handle_unknown(param, val); 66   } 67  68   DEBUGP("Unknown argument '%s'\n", param); 69   return -ENOENT; 70 } -----------------------------------------------------------------------

Lines 4654

These parameters are the same as those described under parse_args() with param and val pointing to a subsection of args.

Lines 5561

We loop through the defined kernel parameters to see if any match param. If we find a match, we use val to call the associated set function. Thus, the set function handles multiple, or null, arguments.

Lines 6266

If the kernel parameter was not found, we call the handle_unknown() function that was passed in via parse_args().

After parse_one() is called for each parameter-value combination specified in args, we have set the kernel parameters and are ready to continue starting the Linux kernel.

8.5.11. The Call to trap_init()

Line 431

In Chapter 3, we introduced exceptions and interrupts. The function TRap_init() is specific to the handling of interrupts in x86 architecture. Briefly, this function initializes a table referenced by the x86 hardware. Each element in the table has a function to handle kernel or user-related issues, such as an invalid instruction or reference to a page not currently in memory. Although the PowerPC can have these same issues, its architecture handles them in a somewhat different manner. (Again, all this is discussed in Chapter 3.)

8.5.12. The Call to rcu_init()

Line 432

The rcu_init() function initializes the Read-Copy-Update (RCU) subsystem of the Linux 2.6 kernel. RCU controls access to critical sections of code and enforces mutual exclusion in systems where the cost of acquiring locks becomes significant in comparison to the chip speed. The Linux implementation of RCU is beyond the scope of this book. We occasionally mention calls to the RCU subsystem in our code analysis, but the specifics are left out. For more information on the Linux RCU subsystem, consult the Linux Scalability Effort pages at http://lse.sourceforge.net/locking/rcupdate.html:

 ---------------------------------------------------------------------- kernel/rcupate.c 297 void __init rcu_init(void) 298 { 299   rcu_cpu_notify(&rcu_nb, CPU_UP_PREPARE, 300       (void *)(long)smp_processor_id()); 301   /* Register notifier for non-boot CPUs */ 302   register_cpu_notifier(&rcu_nb); 303 } -----------------------------------------------------------------------

8.5.13. The Call to init_IRQ()

Line 433

The function init_IRQ() in arch/i386/kernel/i8259.c initializes the hardware interrupt controller, the interrupt vector table and, if on x86, the system timer. Chapter 3 includes a thorough discussion of interrupts for both x86 and PPC, where the Real-Time Clock is used as an interrupt example:

 ---------------------------------------------------------------------- arch/i386/kernel/i8259.c 410 void __init init_IRQ(void) 411 { 412  int i; ... 422  for (i = 0; i < (NR_VECTORS - FIRST_EXTERNAL_VECTOR); i++) { 423   int vector = FIRST_EXTERNAL_VECTOR + i; 424   if (i >= NR_IRQS) 425    break; ... 430   if (vector != SYSCALL_VECTOR)  431    set_intr_gate(vector, interrupt[i]); 432  } ... 437  intr_init_hook(); ... 443  setup_timer(); ... 449  if (boot_cpu_data.hard_math && !cpu_has_fpu) 450   setup_irq(FPU_IRQ, &fpu_irq); 451 } -----------------------------------------------------------------------

Lines 422432

Initialize the interrupt vectors. This associates the x86 (hardware) IRQs with the appropriate handling code.

Line 437

Set up machine-specific IRQs, such as the Advanced Programmable Interrupt Controller (APIC).

Line 443

Initialize the timer clock.

Lines 449450

Set up for FPU if needed.

The following code is the PPC implementation of init_IRQ():

 ---------------------------------------------------------------------- arch/ppc/kernel/irq.c 700  void __init init_IRQ(void) 701  { 702   int i; 703 704   for (i = 0; i < NR_IRQS; ++i) 705    irq_affinity[i] = DEFAULT_CPU_AFFINITY; 706 707   ppc_md.init_IRQ(); 708  } -----------------------------------------------------------------------

Line 704

In multiprocessor systems, an interrupt can have an affinity for a specific processor.

Line 707

For a PowerMac platform, this routine is found in arch/ppc/platforms/ pmac_pic.c. It sets up the Programmable Interrupt Controller (PIC) portion of the I/O controller.

8.5.14. The Call to softirq_init()

Line 436

The softirq_init() function prepares the boot CPU to accept notifications from tasklets. Let's look at the internals of softirq_init():

 ---------------------------------------------------------------------- kernel/softirq.c 317 void __init softirq_init(void) 318 { 319   open_softirq(TASKLET_SOFTIRQ, tasklet_action, NULL); 320   open_softirq(HI_SOFTIRQ, tasklet_hi_action, NULL); 321 } ... 327 void __init softirq_init(void) 328 { 329  open_softirq(TASKLET_SOFTIRQ, tasklet_action, NULL); 330  open_softirq(HI_SOFTIRQ, tasklet_hi_action, NULL); 331 tasklet_cpu_notify(&tasklet_nb, (unsigned long)CPU_UP_PREPARE, 332         (void *)(long)smp_processor_id()); 333 register_cpu_notifier(&tasklet_nb); 334 } -----------------------------------------------------------------------

Lines 319320

We initialize the actions to take when we get a TASKLET_SOFTIRQ or HI_SOFTIRQ interrupt. As we pass in NULL, we are telling the Linux kernel to call tasklet_action(NULL) and tasklet_hi_action(NULL) (in the cases of Line 319 and Line 320, respectively). The following implementation of open_softirq() shows how the Linux kernel stores the tasklet initialization information:

 ---------------------------------------------------------------------- kernel/softirq.c 177 void open_softirq(int nr, void (*action)(struct softirq_action*), void * data) 178 { 179   softirq_vec[nr].data = data; 180   softirq_vec[nr].action = action; 181 } ----------------------------------------------------------------------

8.5.15. The Call to time_init()

Line 437

The function time_init() selects and initializes the system timer. This function, like TRap_init(), is very architecture dependent; Chapter 3 covered this when we explored timer interrupts. The system timer gives Linux its temporal view of the world, which allows it to schedule when a task should run and for how long. The High Performance Event Timer (HPET) from Intel will be the successor to the 8254 PIT and RTC hardware. The HPET uses memory-mapped I/O, which means that the HPET control registers are accessed as if they were memory locations. Memory must be configured properly to access I/O regions. If set in arch/i386/defconfig.h, time_init() needs to be delayed until after mem_init() has set up memory regions. See the following code:

 ---------------------------------------------------------------------- arch/i386/kernel/time.c 376 void __init time_init(void) 377 { ... 378 #ifdef CONFIG_HPET_TIMER 379  if (is_hpet_capable()) { 380   late_time_init = hpet_time_init; 381   return; 382  } ... 387 #endif 388  xtime.tv_sec = get_cmos_time(); 389  wall_to_monotonic.tv_sec = -xtime.tv_sec; 390  xtime.tv_nsec = (INITIAL_JIFFIES % HZ) * (NSEC_PER_SEC / HZ); 391  wall_to_monotonic.tv_nsec = -xtime.tv_nsec; 392 393  cur_timer = select_timer(); 394  printk(KERN_INFO "Using %s for high-res timesource\n",cur_timer->name); 395 396  time_init_hook(); 397 }   -----------------------------------------------------------------------

Lines 379387

If the HPET is configured, time_init() must run after memory has been initialized. The code for late_time_init() (on lines 358373) is the same as time_init().

Lines 388391

Initialize the xtime time structure used for holding the time of day.

Line 393

Select the first timer that initializes. This can be overridden. (See arch/i386/ kernel/timers/timer.c.)

8.5.16. The Call to console_init()

Line 444

A computer console is a device where the kernel (and other parts of a system) output messages. It also has login capabilities. Depending on the system, the console can be on the monitor or through a serial port. The function console_init() is an early call to initialize the console device, which allows for boot-time reporting of status:

 ---------------------------------------------------------------------- drivers/char/tty_io.c 2347 void __init console_init(void) 2348 { 2349  initcall_t *call; ... 2352  (void) tty_register_ldisc(N_TTY, &tty_ldisc_N_TTY); ... 2358 #ifdef CONFIG_EARLY_PRINTK 2359   disable_early_printk(); 2360 #endif ...   2366  call = &__con_initcall_start; 2367  while (call < &__con_initcall_end) { 2368   (*call)(); 2369   call++; 2370  } 2371 }   -----------------------------------------------------------------------

Line 2352

Set up the line discipline.

Line 2359

Keep the early printk support if desired. Early printk support allows the system to report status during the boot process before the system console is fully initialized. It specifically initializes a serial port (ttyS0, for example) or the system's VGA to a minimum functionality. Early printk support is started in setup_arch(). (For more information, see the code discussion on line 408 in this section and the files /kernel/printk.c and /arch/i386/kernel/ early_printk.c.)

Line 2366

Initialize the console.

8.5.17. The Call to profile_init()

Line 447

profile_init() allocates memory for the kernel to store profiling data in. Profiling is the term used in computer science to describe data collection during program execution. Profiling data is used to analyze performance and otherwise study the program being executed (in our case, the Linux kernel itself):

 ---------------------------------------------------------------------- kernel/profile.c 30 void __init profile_init(void) 31 { 32   unsigned int size; 33  34   if (!prof_on) 35     return; 36  37   /* only text is profiled */ 38   prof_len = _etext - _stext; 39   prof_len >>= prof_shift; 40  41   size = prof_len * sizeof(unsigned int) + PAGE_SIZE - 1; 42   prof_buffer = (unsigned int *) alloc_bootmem(size); 43 } -----------------------------------------------------------------------

Lines 3435

Don't do anything if kernel profiling is not enabled.

Lines 3839

_etext and _stext are defined in kernel/head.S. We determine the profile length as delimited by _etext and _stext and then shift the value by prof_shift, which was defined as a kernel parameter.

Lines 4142

We allocate a contiguous block of memory for storing profiling data of the size requested by the kernel parameters.

8.5.18. The Call to local_irq_enable()

Line 448

The function local_irq_enable() allows interrupts on the current CPU. It is usually paired with local_irq_disable(). In previous kernel versions, the sti(), cli() pair were used for this purpose. Although these macros still resolve to sti() and cli(), the keyword to note here is local. These affect only the currently running processor:

 ---------------------------------------------------------------------- include\asm-i386\system.h  446  #define local_irq_disable()  _asm__ __volatile__("cli": : :"memory") 447  #define local_irq_enable()  __asm__ __volatile__("sti": : :"memory") ----------------------------------------------------------------------

Lines 446447

Referring to the "Inline Assembly" section in Chapter 2, the item in the quotes is the assembly instruction and memory is on the clobber list.

8.5.19. initrd Configuration

Lines 449456

This #ifdef statement is a sanity check on initrdthe initial RAM disk.

A system using initrd loads the kernel and mounts the initial RAM disk as the root filesystem. Programs can run from this RAM disk and, when the time comes, a new root filesystem, such as the one on a hard drive, can be mounted and the initial RAM disk unmounted.

This operation simply checks to ensure that the initial RAM disk specified is valid. If it isn't, we set initrd_start to 0, which tells the kernel to not use an initial RAM disk.^[8]

^[8] For more information, refer to Documentation/initrd.txt.

8.5.20. The Call to mem_init()

Line 457

For both x86 and PPC, the call to mem_init() finds all free pages and sends that information to the console. Recall from Chapter 4 that the Linux kernel breaks available memory into zones. Currently, Linux has three zones:

Zone_DMA. Memory less than 16MB.
Zone_Normal. Memory starting at 16MB but less than 896MB. (The kernel uses the last 128MB.)
Zone_HIGHMEM. Memory greater than 1GB.

The function mem_init() finds the total number of free page frames in all the memory zones. This function prints out informational kernel messages regarding the beginning state of the memory. This function is architecture dependent because it manages early memory allocation data. Each architecture supplies its own function, although they all perform the same tasks. We first look at how x86 does it and follow it up with PPC:

[View full width]
 ---------------------------------------------------------------------- arch/i386/mm/init 445  void __init mem_init(void) 446  { 447   extern int ppro_with_ram_bug(void); 448   int codesize, reservedpages, datasize, initsize; 449   int tmp; 450   int bad_ppro; ... 459  #ifdef CONFIG_HIGHMEM 460   if (PKMAP_BASE+LAST_PKMAP*PAGE_SIZE >= FIXADDR_START) { 461   printk(KERN_ERR "fixmap and kmap areas overlap - this will crash\n"); 462   printk(KERN_ERR "pkstart: %lxh pkend:%lxh fixstart %lxh\n", 463   PKMAP_BASE, PKMAP_BASE+LAST_PKMAP*PAGE_SIZE, FIXADDR_START); 464   BUG(); 465  } 466  #endif 467 468   set_max_mapnr_init(); ... 476  /* this will put all low memory onto the freelists */ 477   totalram_pages += __free_all_bootmem(); 478 479 480   reservedpages = 0; 481   for (tmp = 0; tmp < max_low_pfn; tmp++) ...    485   if (page_is_ram(tmp) && PageReserved(pfn_to_page(tmp))) 486    reservedpages++; 487 488   set_highmem_pages_init(bad_ppro); 490   codesize = (unsigned long) &_etext - (unsigned long) &_text; 491   datasize = (unsigned long) &_edata - (unsigned long) &_etext; 492   initsize = (unsigned long) &__init_end - (unsigned long) &__init_begin; 493 494   kclist_add(&kcore_mem, __va(0), max_low_pfn << PAGE_SHIFT);  495   kclist_add(&kcore_vmalloc, (void *)VMALLOC_START,  496     VMALLOC_END-VMALLOC_START); 497 498   printk(KERN_INFO "Memory: %luk/%luk available (%dk kernel code, %dk reserved, %dk  data, %dk init, %ldk highmem)\n", 499    (unsigned long) nr_free_pages() << (PAGE_SHIFT-10), 500    num_physpages << (PAGE_SHIFT-10), 501    codesize >> 10, 502    reservedpages << (PAGE_SHIFT-10), 503    datasize >> 10, 504    initsize >> 10, 505    (unsigned long) (totalhigh_pages << (PAGE_SHIFT-10)) 506     ); ... 521  #ifndef CONFIG_SMP 522    zap_low_mappings(); 523  #endif 524  } -----------------------------------------------------------------------

Line 459

This line is a straightforward error check so that fixed map and kernel map do not overlap.

Line 469

The function set_max_mapnr_init() (arch/i386/mm/init.c) simply sets the value of num_physpages, which is a global variable (defined in mm/memory.c) that holds the number of available page frames.

Line 477

The call to __free_all_bootmem() marks the freeing up of all low-memory pages. During boot time, all pages are reserved. At this late point in the bootstrapping phase, the available low-memory pages are released. The flow of the function calls are seen in Figure 8.15.

Figure 8.15. __free_all_bootmem() Call Hierarchy

Let's look at the core portion of free_all_bootmem_core() to understand what is happening:

[View full width]
 ---------------------------------------------------------------------- mm/bootmem.c 257  static unsigned long __init free_all_bootmem_core(pg_data_t *pgdat) 258  { 259   struct page *page; 260   bootmem_data_t *bdata = pgdat->bdata; 261   unsigned long i, count, total = 0; ... 295   page = virt_to_page(bdata->node_bootmem_map); 296   count = 0; 297    for (i = 0; i < ((bdata->node_low_pfn-(bdata->node_boot_start >> PAGE_SHIFT))/8 +  PAGE_SIZE-1)/PAGE_SIZE; i++,page++) { 298     count++; 299     ClearPageReserved(page); 300     set_page_count(page, 1); 301     __free_page(page); 302   } 303   total += count; 304   bdata->node_bootmem_map = NULL; 305   306   return total; 307  } -----------------------------------------------------------------------

For all the available low-memory pages, we clear the PG_reserved flag^[9] in the flags field of the page struct. Next, we set the count field of the page struct to 1 to indicate that it is in use and call __free_page(), thus passing it to the buddy allocator. If you recall from Chapter 4's explanation of the buddy system, we explain that this function releases a page and adds it to a free list.

^[9] Recall from Chapter 6 that this flag is set in pages that are to be pinned in memory and that it is set for low memory during early bootstrapping.

The function __free_all_bootmem() returns the number of low memory pages available, which is added to the running count of totalram_pages (an unsigned long defined in mm/page_alloc.c).

Lines 480486

These lines update the count of reserved pages.

Line 488

The call to set_highmem_pages_init() marks the initialization of high-memory pages. Figure 8.16 illustrates the calling hierarchy of set_highmem_pages_init().

Figure 8.16. highmem_pages_init Calling Hierarchy

Let's look at the bulk of the code performed in one_highpage_init():

 ---------------------------------------------------------------------- arch/i386/mm/init.c 253  void __init one_highpage_init(struct page *page, int pfn, int bad_ppro) 254  { 255         if (page_is_ram(pfn) && !(bad_ppro && page_kills_ppro(pfn))) { 256                  ClearPageReserved(page); 257                 set_bit(PG_highmem, &page->flags); 258                  set_page_count(page, 1); 259                  __free_page(page); 260                  totalhigh_pages++; 261          } else 262                  SetPageReserved(page); 263  } ----------------------------------------------------------------------

Much like __free_all_bootmem(), all high-memory pages have their page struct flags field cleared of the PG_reserved flag, have PG_highmem set, and have their count field set to 1. __free_page() is also called to add these pages to the free lists and the totalhigh_pages counter is incremented.

Lines 490506

This code block gathers and prints out information regarding the size of memory areas and the number of available pages.

Lines 521523

The function zap_low_mappings flushes the initial TLBs and PGDs in low memory.

The function mem_init() marks the end of the boot phase of memory allocation and the beginning of the memory allocation that will be used throughout the system's life.

The PPC code for mem_init() finds and initializes all pages for all zones:

 ---------------------------------------------------------------------- arch/ppc/mm/init.c 393  void __init mem_init(void) 394   { 395   unsigned long addr; 396   int codepages = 0; 397   int datapages = 0; 398   int initpages = 0; 399   #ifdef CONFIG_HIGHMEM 400   unsigned long highmem_mapnr; 402   highmem_mapnr = total_lowmem >> PAGE_SHIFT; 403   highmem_start_page = mem_map + highmem_mapnr; 404  #endif /* CONFIG_HIGHMEM */ 405   max_mapnr = total_memory >> PAGE_SHIFT; 407   high_memory = (void *) __va(PPC_MEMSTART + total_lowmem); 408   num_physpages = max_mapnr;  /* RAM is assumed contiguous */ 410   totalram_pages += free_all_bootmem(); 412  #ifdef CONFIG_BLK_DEV_INITRD 413   /* if we are booted from BootX with an initial ramdisk, 414    make sure the ramdisk pages aren't reserved. */ 415   if (initrd_start) { 416  for (addr = initrd_start; addr < initrd_end; addr += PAGE_SIZE) 417    ClearPageReserved(virt_to_page(addr)); 418  } 419  #endif /* CONFIG_BLK_DEV_INITRD */ 421  #ifdef CONFIG_PPC_OF 422   /* mark the RTAS pages as reserved */ 423   if ( rtas_data ) 424    for (addr = (ulong)__va(rtas_data); 425     addr < PAGE_ALIGN((ulong)__va(rtas_data)+rtas_size) ; 426     addr += PAGE_SIZE) 427     SetPageReserved(virt_to_page(addr)); 428  #endif 429  #ifdef CONFIG_PPC_PMAC 430   if (agp_special_page) 431    SetPageReserved(virt_to_page(agp_special_page)); 432  #endif 433   if ( sysmap ) 434    for (addr = (unsigned long)sysmap; 435     addr < PAGE_ALIGN((unsigned long)sysmap+sysmap_size) ; 436     addr += PAGE_SIZE) 437     SetPageReserved(virt_to_page(addr)); 439   for (addr = PAGE_OFFSET; addr < (unsigned long)high_memory; 440    addr += PAGE_SIZE) { 441    if (!PageReserved(virt_to_page(addr))) 442     continue; 443    if (addr < (ulong) etext) 444     codepages++; 445    else if (addr >= (unsigned long)&__init_begin 446      && addr < (unsigned long)&__init_end) 447     initpages++; 448    else if (addr < (ulong) klimit) 449     datapages++; 450   } 452  #ifdef CONFIG_HIGHMEM 453   { 454    unsigned long pfn; 456   for (pfn = highmem_mapnr; pfn < max_mapnr; ++pfn) { 457     struct page *page = mem_map + pfn; 459     ClearPageReserved(page); 460     set_bit(PG_highmem, &page->flags); 461     set_page_count(page, 1); 462     __free_page(page); 463     totalhigh_pages++; 464    } 465    totalram_pages += totalhigh_pages; 466   } 467  #endif /* CONFIG_HIGHMEM */ 469  printk("Memory: %luk available (%dk kernel code, %dk data, %dk init, %ldk highmem)\n", 470     (unsigned long)nr_free_pages()<< (PAGE_SHIFT-10), 471     codepages<< (PAGE_SHIFT-10), datapages<< (PAGE_SHIFT-10), 472     initpages<< (PAGE_SHIFT-10), 473     (unsigned long) (totalhigh_pages << (PAGE_SHIFT-10))); 474   if (sysmap) 475    printk("System.map loaded at 0x%08x for debugger, size: %ld bytes\n", 476     (unsigned int)sysmap, sysmap_size); 477  #ifdef CONFIG_PPC_PMAC 478   if (agp_special_page) 479    printk(KERN_INFO "AGP special page: 0x%08lx\n", agp_special_page);  480  #endif 482   /* Make sure all our pagetable pages have page->mapping 483    and page->index set correctly. */ 484   for (addr = KERNELBASE; addr != 0; addr += PGDIR_SIZE) { 485    struct page *pg; 486    pmd_t *pmd = pmd_offset(pgd_offset_k(addr), addr); 487    if (pmd_present(*pmd)) { 488     pg = pmd_page(*pmd); 489     pg->mapping = (void *) &init_mm; 490     pg->index = addr; 491    } 492   } 493   mem_init_done = 1; 494  } -----------------------------------------------------------------------

Lines 399410

These lines find the amount of memory available. If HIGHMEM is used, those pages are also counted. The global variable totalram_pages is modified to reflect this.

Lines 412419

If used, clear any pages that the boot RAM disk used.

Lines 421432

Depending on the boot environment, reserve pages for the Real-Time Abstraction Services and AGP (video), if needed.

Lines 433450

If required, reserve some pages for system map.

Lines 452467

If using HIGHMEM, clear any reserved pages and modify the global variable totalram_pages.

Lines 469480

Print memory information to the console.

Lines 482492

Loop through page directory and initialize each mm_struct and index.

8.5.21. The Call to late_time_init()

Lines 459460

The function late_time_init() uses HPET (refer to the discussion under "The Call to time_init" section). This function is used only with the Intel architecture and HPET. This function has essentially the same code as time_init(); it is just called after memory initialization to allow the HPET to be mapped into physical memory.

8.5.22. The Call to calibrate_delay()

Line 461

The function calibrate_delay() in init/main.c calculates and prints the value of the much celebrated "BogoMips," which is a measurement that indicates the number of delay() iterations your processor can perform in a clock tick. calibrate_delay() allows delays to be approximately the same across processors of different speeds. The resulting valueat most an indicator of how fast a processor is runningis stored in loop_pre_jiffy and the udelay() and mdelay() functions use it to set the number of delay() iterations to perform:

 ---------------------------------------------------------------------- init/main.c void __init calibrate_delay(void) {   unsigned long ticks, loopbit;   int lps_precision = LPS_PREC; 186   loops_per_jiffy = (1<<12);   printk("Calibrating delay loop... "); 189   while (loops_per_jiffy <<= 1) {    /* wait for "start of" clock tick */    ticks = jiffies;    while (ticks == jiffies)     /* nothing */;    /* Go .. */    ticks = jiffies;    __delay(loops_per_jiffy);    ticks = jiffies - ticks;    if (ticks)     break; 200   } /* Do a binary approximation to get loops_per_jiffy set to equal one clock  (up to lps_precision bits) */ 204   loops_per_jiffy >>= 1;   loopbit = loops_per_jiffy; 206   while ( lps_precision-- && (loopbit >>= 1) ) {    loops_per_jiffy |= loopbit;    ticks = jiffies;    while (ticks == jiffies);     ticks = jiffies;    __delay(loops_per_jiffy);    if (jiffies != ticks)  /* longer than 1 tick */     loops_per_jiffy &= ~loopbit; 214   } /* Round the value and print it */   217   printk("%lu.%02lu BogoMIPS\n",    loops_per_jiffy/(500000/HZ), 219    (loops_per_jiffy/(5000/HZ)) % 100); } ----------------------------------------------------------------------

Line 186

Start at 0x800.

Lines 189200

Keep doubling loops_per_jiffy until the amount of time it takes the function delay(loops_per_jiffy) to exceed one jiffy.

Line 204

Divide loops_per_jiffy by 2.

Lines 206214

Successively add descending powers of 2 to loops_per_jiffy until tick equals jiffy.

Lines 217219

Print the value out as if it were a float.

8.5.23. The Call to pgtable_cache_init()

Line 463

The key function in this x86 code block is the system function kmem_cache_create(). This function creates a named cache. The first parameter is a string used to identify it in /proc/slabinfo:

 ---------------------------------------------------------------------- arch/i386/mm/init.c 529 kmem_cache_t *pgd_cache; 530 kmem_cache_t *pmd_cache; 531  532 void __init pgtable_cache_init(void) 533 { 534   if (PTRS_PER_PMD > 1) { 535     pmd_cache = kmem_cache_create("pmd", 536           PTRS_PER_PMD*sizeof(pmd_t), 537           0, 538           SLAB_HWCACHE_ALIGN | SLAB_MUST_H  WCACHE_ALIGN,      539           pmd_ctor, 540           NULL); 541     if (!pmd_cache) 542       panic("pgtable_cache_init(): cannot create pmd c  ache"); 543   }     544   pgd_cache = kmem_cache_create("pgd", 545         PTRS_PER_PGD*sizeof(pgd_t), 546         0, 547         SLAB_HWCACHE_ALIGN | SLAB_MUST_HWCACHE_A  LIGN, 548         pgd_ctor, 549         PTRS_PER_PMD == 1 ? pgd_dtor : NULL); 550   if (!pgd_cache) 551     panic("pgtable_cache_init(): Cannot create pgd cache"); 552 } ---------------------------------------------------------------------- ---------------------------------------------------------------------- arch/ppc64/mm/init.c 976 void pgtable_cache_init(void) 977 { 978   zero_cache = kmem_cache_create("zero", 979         PAGE_SIZE, 980         0, 981         SLAB_HWCACHE_ALIGN | SLAB_MUST_HWCACHE_A  LIGN,      982         zero_ctor,  983         NULL); 984   if (!zero_cache) 985     panic("pgtable_cache_init(): could not create zero_cache  !\n");  986 } ----------------------------------------------------------------------

Lines 532542

Create the pmd cache.

Lines 544551

Create the pgd cache.

On the PPC, which has hardware-assisted hashing, pgtable_cache_init() is a no-op:

 ---------------------------------------------------------------------- include\asmppc\pgtable.h 685  #define pgtable_cache_init()  do { } while (0)

8.5.24. The Call to buffer_init()

Line 472

The buffer_init() function in fs/buffer.c holds data from filesystem devices:

 ---------------------------------------------------------------------- fs/buffer.c 3031  void __init buffer_init(void) {   int i;   int nrpages; 3036   bh_cachep = kmem_cache_create("buffer_head",     sizeof(struct buffer_head), 0,     0, init_buffer_head, NULL); 3039   for (i = 0; i < ARRAY_SIZE(bh_wait_queue_heads); i++)    init_waitqueue_head(&bh_wait_queue_heads[i].wqh); 3044   nrpages = (nr_free_buffer_pages() * 10) / 100;   max_buffer_heads = nrpages * (PAGE_SIZE / sizeof(struct buffer_head));   hotcpu_notifier(buffer_cpu_notify, 0); 3048  }   ----------------------------------------------------------------------

Line 3036

Allocate the buffer cache hash table.

Line 3039

Create a table of buffer hash wait queues.

Line 3044

Limit low-memory occupancy to 10 percent.

8.5.25. The Call to security_scaffolding_startup()

Line 474

The 2.6 Linux kernel contains code for loading kernel modules that implement various security features. security_scaffolding_startup() simply verifies that a security operations object exists, and if it does, calls the security module's initialization functions.

How security modules can be created and what kind of issues a writer might face are beyond the scope of this text. For more information, consult Linux Security Modules (http://lsm.immunix.org/) and the Linux-security-module mailing list (http://mail.wirex.com/mailman/listinfo/linux-security-module).

8.5.26. The Call to vfs_caches_init()

Line 475

The VFS subsystem depends on memory caches, called SLAB caches, to hold the structures it manages. Chapter 4 discusses SLAB caches detail. The vfs_caches_init() function initializes the SLAB caches that the subsystem uses. Figure 8.17 shows the overview of the main function hierarchy called from vfs_caches_init(). We explore in detail each function included in this call hierarchy. You can refer to this hierarchy to keep track of the functions as we look at each of them.

Figure 8.17. vfs_caches_init() Call Hierarchy

Table 8.3 summarizes the objects introduced by the vfs_caches_init() function or by one of the functions it calls.

 ---------------------------------------------------------------------- fs/dcache.c 1623  void __init vfs_caches_init(unsigned long mempages) 1624  { 1625   names_cachep = kmem_cache_create("names_cache",  1626     PATH_MAX, 0,  1627     SLAB_HWCACHE_ALIGN, NULL, NULL); 1628   if (!names_cachep) 1629    panic("Cannot create names SLAB cache"); 1630   1631   filp_cachep = kmem_cache_create("filp",  1632     sizeof(struct file), 0, 1633     SLAB_HWCACHE_ALIGN, filp_ctor, filp_dtor); 1634   if(!filp_cachep) 1635    panic("Cannot create filp SLAB cache"); 1636   1637   dcache_init(mempages); 1638   inode_init(mempages); 1639   files_init(mempages);  1640   mnt_init(mempages); 1641   bdev_cache_init(); 1642   chrdev_init(); 1643  } -----------------------------------------------------------------------

Table 8.3. Objects Introduced by vfs_caches_init
Object Name	Description
`names_cachep`	Global variable
`filp_cachep`	Global variable
`inode_cache`	Global variable
`dentry_cache`	Global variable
`mnt_cache`	Global variable
`namespace`	Struct
`mount_hashtable`	Global variable
`root_fs_type`	Global variable
`file_system_type`	Struct (discussed in Chapter 6)
`bdev_cachep`	Global variable

Line 1623

The routine takes in the global variable num_physpages (whose value is calculated during mem_init()) as a parameter that holds the number of physical pages available in the system's memory. This number influences the creation of SLAB caches, as we see later.

Lines 16251629

The next step is to create the names_cachep memory area. Chapter 4 describes the kmem_cache_create() function in detail. This memory area holds objects of size PATH_MAX, which is the maximum allowable number of characters a pathname is allowed to have. (This value is set in linux/limits.h as 4,096.) At this point, the cache that has been created is empty of objects, or memory areas of size PATH_MAX. The actual memory areas are allocated upon the first and potentially subsequent calls to getname().

As discussed in Chapter 6 the getname() routine is called at the beginning of some of the file-related system calls (for example, sys_open()) to read the file pathname from the process address space. Objects are freed from the cache with the putname() routine.

If the names_cache cache cannot be created, the kernel jumps to the panic routine, exiting the function's flow of control.

Lines 16311635

The filp_cachep cache is created next, with objects the size of the file structure. The object holding the file structure is allocated by the get_empty_filp() (fs/file_table.c) routine, which is called, for example, upon creation of a pipe or the opening of a file. The file descriptor object is deallocated by a call to the file_free() (fs/file_table.c) routine.

Line 1637

The dcache_init() (fs/dcache.c) routine creates the SLAB cache that holds dentry descriptors.^[10] The cache itself is called the dentry_cache. The dentry descriptors themselves are created for each hierarchical component in pathnames referred by processes when accessing a file or directory. The structure associates the file or directory component with the inode that represents it, which further facilitates requests to that component for a speedier association with its corresponding inode.

^[10] Recall that dentry is short for directory entry.

Line 1638

The inode_init() (fs/inode.c) routine initializes the inode hash table and the wait queue head array used for storing hashed inodes that the kernel wants to lock. The wait queue heads (wait_queue_head_t) for hashed inodes are stored in an array called i_wait_queue_heads. This array gets initialized at this point of the system's startup process.

The inode_hashtable gets created at this point. This table speeds up the searches on inode. The last thing that occurs is that the SLAB cache used to hold inode objects gets created. It is called inode_cache. The memory areas for this cache are allocated upon calls to alloc_inode (fs/inode.c) and freed upon calls to destroy_inode() (fs/inode.c).

Line 1639

The files_init() routine is called to determine the maximum amount of memory allowed for files per process. The max_files field of the files_stat structure is set. This is then referenced upon file creation to determine if there is enough memory to open the file. Let's look at this routine:

 ---------------------------------------------------------------------- fs/file_table.c 292  void __init files_init(unsigned long mempages) 293  {  294   int n;  ... 299   n = (mempages * (PAGE_SIZE / 1024)) / 10; 300   files_stat.max_files = n;  301   if (files_stat.max_files < NR_FILE) 302    files_stat.max_files = NR_FILE; 303  } ----------------------------------------------------------------------

Line 299

The page size is divided by the amount of space that a file (along with associated inode and cache) will roughly occupy (in this case, 1K). This value is then multiplied by the number of pages to get the total amount of "blocks" that can be used for files. The division by 10 shows that the default is to limit the memory usage for files to no more than 10 percent of the available memory.

Lines 301302

The NR_FILE (include/linux/fs.h) is set to 8,192.

Line 1640

The next routine, called mnt_init(), creates the cache that will hold the vfsmount objects the VFS uses for mounting filesystems. The cache is called mnt_cache. The routine also creates the mount_hashtable array, which stores references to objects in mnt_cache for faster access. It then issues calls to initialize the sysfs filesystem and mounts the root filesystem. Let's closely look at the creation of the hash table:

[View full width]
 ---------------------------------------------------------------------- fs/namespace.c 1137  void __init mnt_init(unsigned long mempages) { 1139   struct list_head *d; 1140   unsigned long order; 1141   unsigned int nr_hash; 1142   int i; ... 1149   order = 0;  1150   mount_hashtable = (struct list_head *) 1151    __get_free_pages(GFP_ATOMIC, order); 1152   1153   if (!mount_hashtable) 1154    panic("Failed to allocate mount hash table\n"); ... 1161  nr_hash = (1UL << order) * PAGE_SIZE / sizeof(struct list_head); 1162   hash_bits = 0; 1163   do { 1164    hash_bits++; 1165    } while ((nr_hash >> hash_bits) != 0); 1166   hash_bits--; ... 1172   nr_hash = 1UL << hash_bits; 1173   hash_mask = nr_hash-1; 1174   1175  printk("Mount-cache hash table entries: %d (order: %ld, %ld bytes)\n", nr_hash,  order, (PAGE_SIZE << order)); ... 1179   d = mount_hashtable; 1180   i = nr_hash; 1181   do { 1182    INIT_LIST_HEAD(d);  1183    d++; 1184    i--; 1185   } while (i); .. 1189  } ----------------------------------------------------------------------

Lines 11391144

The hash table array consists of a full page of memory. Chapter 4 explains in detail how the routine __get_free_pages() works. In a nutshell, this routine returns a pointer to a memory area of size 2 order pages. In this case, we allocate one page to hold the hash table.

Lines 11611173

The next step is to determine the number of entries in the table. nr_hash is set to hold the order (power of two) number of list heads that can fit into the table. hash_bits is calculated as the number of bits needed to represent the highest power of two in nr_hash. Line 1172 then redefines nr_hash as being composed of the single leftmost bit. The bitmask can then be calculated from the new nr_hash value.

Lines 11791185

Finally, we initialize the hash table through a call to the INIT_LIST_HEAD macro, which takes in a pointer to the memory area where a new list head is to be initialized. We do this nr_hash times (or the number of entries that the table can hold).

Let's walk through an example: We assume a PAGE_SIZE of 4KB and a struct list_head of 8 bytes. Because order is equal to 0, the value of nr_hash becomes 500; that is, up to 500 list_head structs can fit in one 4KB table. The (1UL << order) becomes the number of pages that have been allocated. For example, if the order had been 1 (meaning we had requested 21 pages allocated to the hash table), 0000 0001 bit-shifted once to the left becomes 0000 0010 (or 2 in decimal notation). Next, we calculate the number of bits the hash key will need. Walking through each iteration of the loop, we get the following:

Beginning values are hash_bits = 0 and nr_hash = 500.

Iteration 1: hash_bits = 1, and (500 >> 1) ! = 0
(0001 1111 0100 >> 1) = 0000 1111 1010
Iteration 2: hash_bits = 2, and (500 >> 2) ! = 0
(0001 1111 1010 >> 2) = 0000 0111 1110
Iteration3: hash_bits = 3, and (500 >> 3) ! = 0
(0001 1111 1010 >> 3) = 0000 0011 1111
Iteration 4: hash_bits = 4, and (500 >> 4) ! = 0
(0001 1111 1010 >> 4) = 0000 0001 1111
Iteration 5: hash_bits = 5, and (500 >> 5) ! = 0
(0001 1111 1010 >> 5) = 0000 0000 1111
Iteration 6: hash_bits = 6, and (500 >> 6) ! = 0
(0001 1111 1010 >> 6) = 0000 0000 0111
Iteration 7: hash_bits = 7, and (500 >> 7) ! = 0
(0001 1111 1010 >> 7) = 0000 0000 0011
Iteration 8: hash_bits = 8, and (500 >> 8) ! = 0
(0001 1111 1010 >> 8) = 0000 0000 0001
Iteration 9: hash_bits = 9, and (500 >> 9) ! = 0
(0001 1111 1010 >> 9) = 0000 0000 0000

After breaking out of the while loop, hash_bits is decremented to 8, nr_hash is set to 0001 0000 0000, and the hash_mask is set to 0000 1111 1111.

After the mnt_init() routine initializes mount_hashtable and creates mnt_cache, it issues three calls:

 ---------------------------------------------------------------------- fs/namespace.c ... 1189   sysfs_init(); 1190   init_rootfs(); 1191   init_mount_tree(); 1192  } ----------------------------------------------------------------------

sysfs_init() is responsible for the creation of the sysfs filesystem. init_rootfs() and init_mount_tree() are together responsible for mounting the root filesystem. We closely look at each routine in turn.

 ---------------------------------------------------------------------- init_rootfs() fs/ramfs/inode.c 218  static struct file_system_type rootfs_fs_type = { 219   .name   = "rootfs", 220   .get_sb  = rootfs_get_sb, 221   .kill_sb  = kill_litter_super, 222  }; ... 237  int __init init_rootfs(void) 238  { 239  return register_filesystem(&rootfs_fs_type); 240  } ----------------------------------------------------------------------

The rootfs filesystem is an initial filesystem the kernel mounts. It is a simple and quite empty directory that becomes overmounted by the real filesystem at a later point in the kernel boot-up process.

Lines 218222

This code block is the declaration of the rootfs_fs_type file_system_type struct. Only the two methods for getting and killing the associated superblock are defined.

Lines 237240

The init_rootfs() routine merely register this rootfs with the kernel. This makes available all the information regarding the type of filesystem (information stored in the file_system_type struct) within the kernel.

 ---------------------------------------------------------------------- init_mount_tree() fs/namespace.c 1107  static void __init init_mount_tree(void) 1108  { 1109   struct vfsmount *mnt; 1110   struct namespace *namespace; 1111   struct task_struct *g, *p; 1112 1113   mnt = do_kern_mount("rootfs", 0, "rootfs", NULL); 1114   if (IS_ERR(mnt)) 1115    panic("Can't create rootfs"); 1116   namespace = kmalloc(sizeof(*namespace), GFP_KERNEL); 1117   if (!namespace) 1118    panic("Can't allocate initial namespace"); 1119   atomic_set(&namespace->count, 1); 1120   INIT_LIST_HEAD(&namespace->list); 1121   init_rwsem(&namespace->sem); 1122   list_add(&mnt->mnt_list, &namespace->list); 1123   namespace->root = mnt; 1124   1125   init_task.namespace = namespace; 1126   read_lock(&tasklist_lock); 1127   do_each_thread(g, p) { 1128    get_namespace(namespace); 1129    p->namespace = namespace; 1130   } while_each_thread(g, p); 1131   read_unlock(&tasklist_lock); 1132 1133  set_fs_pwd(current->fs, namespace->root,       namespace->root->mnt_root); 1134  set_fs_root(current->fs, namespace->root,       namespace->root->mnt_root); 1135  } -----------------------------------------------------------------------

Lines 11161123

Initialize the process namespace. This structure keeps pointers to the mount tree-related structures and the corresponding dentry. The namespace object is allocated, the count set to 1, the list field of type list_head is initialized, the semaphore that locks the namespace (and the mount tree) is initialized, and the root field corresponding to the vfsmount structure is set to point to our newly allocated vfsmount.

Line 1125

The current task's (the init task's) process descriptor namespace field is set to point at the namespace object we just allocated and initialized. (The current process is Process 0.)

Lines 11341135

The following two routines set the values of four fields in the fs_struct associated with our process. fs_struct holds field for the root and current working directory entries set by these two routines.

We just finished exploring what happens in the mnt_init function. Let's continue exploring vfs_mnt_init.

 ---------------------------------------------------------------------- 1641 bdev_cache_init() fs/block_dev.c 290  void __init bdev_cache_init(void) 291  { 292   int err; 293   bdev_cachep = kmem_cache_create("bdev_cache", 294     sizeof(struct bdev_inode), 295     0, 296   SLAB_HWCACHE_ALIGN|SLAB_RECLAIM_ACCOUNT, 297     init_once, 298     NULL); 299   if (!bdev_cachep) 300    panic("Cannot create bdev_cache SLAB cache"); 301   err = register_filesystem(&bd_type); 302   if (err) 303    panic("Cannot register bdev pseudo-fs"); 304   bd_mnt = kern_mount(&bd_type); 305   err = PTR_ERR(bd_mnt); 306   if (IS_ERR(bd_mnt)) 307    panic("Cannot create bdev pseudo-fs"); 308   blockdev_superblock = bd_mnt->mnt_sb;  /* For writeback */ 309  } ----------------------------------------------------------------------

Lines 293298

Create the bdev_cache SLAB cache, which holds bdev_inodes.

Line 301

 ---------------------------------------------------------------------- fs/block_dev.c 294  static struct file_system_type bd_type = { 295   .name   = "bdev", 296   .get_sb  = bd_get_sb, 297   .kill_sb  = kill_anon_super, 298  }; ----------------------------------------------------------------------

As you can see, the file_system_type struct of the bdev special filesystem has only two routines defined: one for fetching the filesystem's superblock and the other for removing/freeing the superblock. At this point, you might wonder why block devices are registered as filesystems. In Chapter 6, we saw that systems that are not technically filesystems can use filesystem kernel structures; that is, they do not have mount points but can make use of the VFS kernel structures that support filesystems. Block devices are one instance of a pseudo filesystem that makes use of the VFS filesystem kernel structures. As with bdev, these special filesystems generally define only a limited number of fields because not all of them make sense for the particular application.

Lines 304308

The call to kern_mount() sets up all the mount-related VFS structures and returns the vfsmount structure. (See Chapter 6 for more information on setting the global variables bd_mnt to point to the vfsmount structure and blockdev_superblock to point to the vfsmount superblock.)

This function initializes the character device objects that surround the driver model:

 ---------------------------------------------------------------------- 1642 chrdev_init fs/char_dev.c void __init chrdev_init(void)   { 433   subsystem_init(&cdev_subsys); 434   cdev_map = kobj_map_init(base_probe, &cdev_subsys); 435  } ----------------------------------------------------------------------

8.5.27. The Call to radix_tree_init()

Line 476

The 2.6 Linux kernel uses a radix tree to manage pages within the page cache. Here, we simply initialize a contiguous section of kernel space for storing the page cache radix tree:

 ---------------------------------------------------------------------- lib/radix-tree.c 798 void __init radix_tree_init(void) 799 { 800   radix_tree_node_cachep = kmem_cache_create("radix_tree_node", 801       sizeof(struct radix_tree_node), 0, 802       SLAB_PANIC, radix_tree_node_ctor, NULL); 803   radix_tree_init_maxindex(); 804   hotcpu_notifier(radix_tree_callback, 0); ----------------------------------------------------------------------- ---------------------------------------------------------------------- lib/radix-tree.c 768 static __init void radix_tree_init_maxindex(void) 769 { 770   unsigned int i; 771  772   for (i = 0; i < ARRAY_SIZE(height_to_maxindex); i++) 773     height_to_maxindex[i] = __maxindex(i); 774 } -----------------------------------------------------------------------

Notice how radix_tree_init() allocates the page cache space and radix_tree_init_maxindex() configures the radix tree data store, height_to_maxindex[].

hotcpu_notifier() (on line 804) refers to Linux 2.6's capability to hotswap CPUs. When a CPU is hotswapped, the kernel calls radix_tree_callback(), which attempts to cleanly free the parts of the page cache that were linked to the hotswapped CPU.

8.5.28. The Call to signals_init()

Line 477

The signals_init() function in kernel/signal.c initializes the kernel signal queue:

 ---------------------------------------------------------------------- fs/buffer.c 2565  void __init signals_init(void) 2566  { 2567  sigqueue_cachep = 2568     kmem_cache_create("sigqueue", 2569       sizeof(struct sigqueue), 2570       __alignof__(struct sigqueue), 2571       0, NULL, NULL); 2572   if (!sigqueue_cachep) 2573    panic("signals_init(): cannot create sigqueue SLAB cache"); 2574  }   -----------------------------------------------------------------------

Lines 25672571

Allocate SLAB memory for sigqueue.

8.5.29. The Call to page_writeback_init()

Line 479

The page_writeback_init() function initializes the values controlling when a dirty page is written back to disk. Dirty pages are not immediately written back to disk; they are written after a certain amount of time passes or a certain number or percent of the pages in memory are marked as dirty. This init function attempts to determine the optimum number of pages that must be dirty before triggering a background write and a dedicated write. Background dirty-page writes take up much less processing power than dedicated dirty-page writes:

 ---------------------------------------------------------------------- mm/page-writeback.c 488 /* 489 * If the machine has a large highmem:lowmem ratio then scale back the default 490 * dirty memory thresholds: allowing too much dirty highmem pins an excessive 491 * number of buffer_heads. 492 */ 493 void __init page_writeback_init(void) 494 { 495   long buffer_pages = nr_free_buffer_pages(); 496   long correction; 497  498   total_pages = nr_free_pagecache_pages(); 499  500   correction = (100 * 4 * buffer_pages) / total_pages; 501  502   if (correction < 100) { 503     dirty_background_ratio *= correction; 504     dirty_background_ratio /= 100; 505     vm_dirty_ratio *= correction; 506     vm_dirty_ratio /= 100; 507   } 508   mod_timer(&wb_timer, jiffies + (dirty_writeback_centisecs * HZ) / 100); 509   set_ratelimit(); 510   register_cpu_notifier(&ratelimit_nb); 511 } -----------------------------------------------------------------------

Lines 495507

If we are operating on a machine with a large page cache compared to the number of buffer pages, we lower the dirty-page writeback thresholds. If we choose not to lower the threshold, which raises the frequency of writebacks, at each writeback, we would use an inordinate amount of buffer_heads. (This is the meaning of the comment before page_writeback().)

The default background writeback, dirty_background_ratio, starts when 10 percent of the pages are dirty. A dedicated writeback, vm_dirty_ratio, starts when 40 percent of the pages are dirty.

Line 508

We modify the writeback timer, wb_timer, to be triggered periodically (every 5 seconds by default).

Line 509

set_ratelimit() is called, which is documented excellently. I defer to these inline comments:

 ---------------------------------------------------------------------- mm/page-writeback.c 450 /* 451 * If ratelimit_pages is too high then we can get into dirty-data overload 452 * if a large number of processes all perform writes at the same time. 453 * If it is too low then SMP machines will call the (expensive) 454 * get_writeback_state too often. 455 * 456 * Here we set ratelimit_pages to a level which ensures that when all CPUs are 457 * dirtying in parallel, we cannot go more than 3% (1/32) over the dirty memory 458 * thresholds before writeback cuts in. 459 * 460 * But the limit should not be set too high. Because it also controls the 461 * amount of memory which the balance_dirty_pages() caller has to write back. 462 * If this is too large then the caller will block on the IO queue all the 463 * time. So limit it to four megabytes - the balance_dirty_pages() caller 464 * will write six megabyte chunks, max. 465 */ 466  467 static void set_ratelimit(void) 468 { 469   ratelimit_pages = total_pages / (num_online_cpus() * 32); 470   if (ratelimit_pages < 16) 471     ratelimit_pages = 16; 472   if (ratelimit_pages * PAGE_CACHE_SIZE > 4096 * 1024) 473     ratelimit_pages = (4096 * 1024) / PAGE_CACHE_SIZE; 474 } -----------------------------------------------------------------------

Line 510

The final command of page_writeback_init() registers the ratelimit notifier block, ratelimit_nb, with the CPU notifier. The ratelimit notifier block calls ratelimit_handler() when notified, which in turn, calls set_ratelimit(). The purpose of this is to recalculate ratelimit_pages when the number of online CPUs changes:

 ---------------------------------------------------------------------- mm/page-writeback.c 483 static struct notifier_block ratelimit_nb = { 484   .notifier_call = ratelimit_handler, 485   .next   = NULL, 486 }; -----------------------------------------------------------------------

Finally, we need to examine what happens when the wb_timer (from Line 508) goes off and calls wb_time_fn():

 ---------------------------------------------------------------------- mm/page-writeback.c 414 static void wb_timer_fn(unsigned long unused) 415 { 416   if (pdflush_operation(wb_kupdate, 0) < 0) 417     mod_timer(&wb_timer, jiffies + HZ); /* delay 1 second */ 418 } -----------------------------------------------------------------------

Lines 416417

When the timer goes off, the kernel triggers pdflush_operation(), which awakens one of the pdflush threads to perform the actual writeback of dirty pages to disk. If pdflush_operation() cannot awaken any pdflush thread, it tells the writeback timer to trigger again in 1 second to retry awakening a pdflush tHRead. See Chapter 9, "Building the Linux Kernel," for more information on pdflush.

8.5.30. The Call to proc_root_init()

Lines 480482

As Chapter 2 explained, the CONFIG_* #define refers to a compile-time variable. If, at compile time, the proc filesystem is selected, the next step in initialization is the call to proc_root_init():

 ---------------------------------------------------------------------- fs/proc/root.c 40  void __init proc_root_init(void) 41  { 42   int err = proc_init_inodecache(); 43   if (err) 44    return; 45   err = register_filesystem(&proc_fs_type); 46   if (err) 47    return; 48   proc_mnt = kern_mount(&proc_fs_type); 49   err = PTR_ERR(proc_mnt); 50   if (IS_ERR(proc_mnt)) { 51    unregister_filesystem(&proc_fs_type); 52    return; 53   } 54   proc_misc_init(); 55   proc_net = proc_mkdir("net", 0); 56  #ifdef CONFIG_SYSVIPC 57   proc_mkdir("sysvipc", 0); 58  #endif 59  #ifdef CONFIG_SYSCTL 60   proc_sys_root = proc_mkdir("sys", 0); 61  #endif 62  #if defined(CONFIG_BINFMT_MISC) || defined(CONFIG_BINFMT_MISC_MODULE) 63   proc_mkdir("sys/fs", 0); 64   proc_mkdir("sys/fs/binfmt_misc", 0); 65  #endif 66   proc_root_fs = proc_mkdir("fs", 0); 67   proc_root_driver = proc_mkdir("driver", 0); 68   proc_mkdir("fs/nfsd", 0); /* somewhere for the nfsd filesystem to be mounted */ 69  #if defined(CONFIG_SUN_OPENPROMFS) || defined(CONFIG_SUN_OPENPROMFS_MODULE) 70   /* just give it a mountpoint */ 71   proc_mkdir("openprom", 0); 72  #endif 73   proc_tty_init(); 74  #ifdef CONFIG_PROC_DEVICETREE 75   proc_device_tree_init(); 76  #endif 77   proc_bus = proc_mkdir("bus", 0); 78  }  -----------------------------------------------------------------------

Line 42

This line initializes the inode cache that holds the inodes for this filesystem.

Line 45

The file_system_type structure proc_fs_type is registered with the kernel. Let's closely look at the structure:

 ---------------------------------------------------------------------- fs/proc/root.c 33  static struct file_system_type proc_fs_type = { 34   .name   = "proc", 35   .get_sb  = proc_get_sb, 36   .kill_sb  = kill_anon_super, 37  }; ----------------------------------------------------------------------

The file_system_type structure, which defines the filesystem's name simply as proc, has the routines for retrieving and freeing the superblock structures.

Line 48

We mount the proc filesystem. See the sidebar on kern_mount for more details as to what happens here.

Lines 5478

The call to proc_misc_init() is what creates most of the entries you see in the /proc filesystem. It creates entries with calls to create_proc_read_entry(), create_proc_entry(), and create_proc_seq_entry(). The remainder of the code block consists of calls to proc_mkdir for the creation of directories under /proc/, the call to the proc_tty_init() routine to create the tree under /proc/tty, and, if the config time value of CONFIG_PROC_DEVICETREE is set, then the call to the proc_device_tree_init() routine to create the /proc/device-tree subtree.

8.5.31. The Call to init_idle()

Line 490

init_idle() is called near the end of start_kernel() with parameters current and smp_processor_id() to prepare start_kernel() for rescheduling:

 ---------------------------------------------------------------------- kernel/sched.c 2643 void __init init_idle(task_t *idle, int cpu) 2644 { 2645   runqueue_t *idle_rq = cpu_rq(cpu), *rq = cpu_rq(task_cpu(idle)); 2646   unsigned long flags; 2647  2648   local_irq_save(flags); 2649   double_rq_lock(idle_rq, rq); 2650  2651   idle_rq->curr = idle_rq->idle = idle; 2652   deactivate_task(idle, rq); 2653   idle->array = NULL; 2654   idle->prio = MAX_PRIO; 2655   idle->state = TASK_RUNNING; 2656   set_task_cpu(idle, cpu); 2657   double_rq_unlock(idle_rq, rq); 2658   set_tsk_need_resched(idle); 2659   local_irq_restore(flags); 2660  2661   /* Set the preempt count _outside_ the spinlocks! */ 2662 #ifdef CONFIG_PREEMPT 2663   idle->thread_info->preempt_count = (idle->lock_depth >= 0); 2664 #else 2665   idle->thread_info->preempt_count = 0; 2666 #endif 2667 } -----------------------------------------------------------------------

Line 2645

We store the CPU request queue of the CPU that we're on and the CPU request queue of the CPU that the given task idle is on. In our case, with current and smp_processor_id(), these request queues will be equal.

Line 26482649

We save the IRQ flags and obtain the lock on both request queues.

Line 2651

We set the current task of the CPU request queue of the CPU that we're on to the task idle.

Lines 26522656

These statements remove the task idle from its request queue and move it to the CPU request queue of cpu.

Lines 26572659

We release the request queue locks on the run queues that we previously locked. Then, we mark task idle for rescheduling and restore the IRQs that we previously saved. We finally set the preemption counter if kernel preemption is configured.

8.5.32. The Call to rest_init()

Line 493

The rest_init() routine is fairly straightforward. It essentially creates what we call the init thread, removes the initialization kernel lock, and calls the idle tHRead:

 ---------------------------------------------------------------------- init/main.c 388  static void noinline rest_init(void) 389  { 390   kernel_thread(init, NULL, CLONE_FS | CLONE_SIGHAND); 391   unlock_kernel(); 392   cpu_idle(); 393  } -----------------------------------------------------------------------

Line 388

You might have noticed that this is the first routine start_kernel() calls that is not __init. If you recall from Chapter 2, we said that when a function is preceded by __init, it is because all the memory used to maintain the function variables and the like will be memory that is cleared/freed once initialization nears completion. This is done through a call to free_initmem(), which we see in a moment when we explore what happens in init(). The reason why rest_init() is not an __init function is because it calls the init thread before its completion (meaning the call to cpu_idle). Because the init tHRead executes the call to free_initmem(), there is the possibility of a race condition occurring whereby free_initmem() is called before rest_init() (or the root thread) is finished.

Line 390

This line is the creation of the init thread, which is also referred to as the init process or process 1. For brevity, all we say here is that this thread shares all kernel data structures with the calling process. The kernel thread calls the init() functions, which we look at in the next section.

Line 391

The unlock_kernel() routine does nothing if only a single processor exists. Otherwise, it releases the BKL.

Line 392

The call to cpu_idle() is what turns the root thread into the idle thread. This routine yields the processor to the scheduler and is returned to when the scheduler has no other pending process to run.

At this point, we have completed the bulk of the Linux kernel initialization. We now briefly look at what happens in the call to init().

8.5. The Beginning: start_kernel()

8.5.1. The Call to lock_kernel()

Line 405

Lines 4448

Lines 1015

Lines 5963

8.5.2. The Call to page_address_init()

Line 406

Table 8.2. Objects Introduced During the Call to page_address_init()

Line 597

Figure 8.11. Data Structures Surrounding the Page Address Map Pool

Lines 598599

Lines 600603

Figure 8.12. Page Address Hash Table

Line 604

8.5.3. The Call to printk(linux_banner)

Line 407

8.5.4. The Call to setup_arch

Line 408

Line 1087

Line 1088

Line 1089

Lines 11031116

Lines 11181122

Lines 11241127

Line 1129

Lines 11331141

Line 1143

Line 1145

Lines 11531155

Lines 11571167

Line 1170

Lines 11721174

Lines 11751176

Line 1181

Lines 11831186

Line 1188

Lines 11901197

8.5.5. The Call to setup_per_cpu_areas()

Line 409

Lines 329332

Lines 334341

Lines 343346

8.5.6. The Call to smp_prepare_boot_cpu()

Line 415

8.5.7. The Call to sched_init()

Line 422

Lines 39193926

Figure 8.13. Initialized Run Queue rq

Lines 39383947

Figure 8.14. rq->arrays

Lines 39523956

Lines 39613962

8.5.8. The Call to build_all_zonelists()

Line 424

8.5.9. The Call to page_alloc_init

Line 425

Line 1809

Line 1788

Lines 17941802

8.5.10. The Call to parse_args()

Line 427

Lines 116125

Lines 126153

Lines 4654

Lines 5561

Lines 6266

8.5.11. The Call to trap_init()

Line 431

8.5.12. The Call to rcu_init()

Line 432

8.5.13. The Call to init_IRQ()

Line 433

Lines 422432

Line 437

Line 443

Lines 449450

Line 704

Line 707