Section 14.4. Kernel Module Initialization

14.4. Kernel Module Initialization

In earlier versions of BSD the initialization of the underlying hardware was handled exclusively by custom-written code. An intimate understanding of the entire operating system was needed for someone to add a new service or subsystem that slowed the evolution of the kernel. In FreeBSD, a method has been devised to break down all kernel services into modules and to logically sequence these modules at boot time, making it easier to experimentally add new features to the operating system. All subsystems are listed in the /sys/sys/kernel.h include file.

There are two types of kernel modules used in FreeBSD. Modules that can be loaded and unloaded from the system at run time are called kernel loadable modules and are discussed later in this section. Kernel modules that must be loaded at boot time and cannot be removed once they are running are considered to be permanent kernel modules. A permanent module is exported to the system using the SYSINIT macro to identify the module, its initialization routine, and the order in which it should be called.

 SYSINIT(name, subsystem, order, function, identifier)

All modules are organized into a two-level hierarchy to provide an orderly startup of the system. The subsystem argument is the first level of the hierarchy. Each subsystem is assigned a particular numeric constant that creates the first-level ordering of the modules to be loaded. The second level of the hierarchy is handled by the order argument. If two modules are in the same subsystem, the order shows which comes first. The function argument is the initialization function that the kernel calls at system startup.

After the assembly-language code has completed its work, it calls the first kernel routine written in C: the mi_startup() routine. The mi_startup() routine first sorts the list of modules that need to be started and then calls the function routine of each. Some initializations are machine dependent. They relate only to a particular hardware architecture. Other initializations are machine independent. They include services that do not need to be aware of the underlying hardware. Each kernel module is implemented either as being hardware dependent or independent according to its needs. Once the assembly-language startup sequence has completed, any other assembly-language code necessary to bring up the system is called through C entry points. This separation of dependent and independent initializations makes the task of coding the startup sequence on a new platform easier than in earlier releases of BSD.

Basic Services

Before FreeBSD can do any useful work, it must set up some basic services that are used by all the subsystems inside the kernel. These services include support for mutexes, the lock manager, and the kernel memory manager. They are shown in Table 14.2. All these services are initialized early in the startup sequence so that they can be used by the rest of the kernel.

Table 14.2. Basic services.
Module	First Routine
SI_SUB_MTX_POOL_STATIC	mtx_pool_setup_static()
SI_SUB_LOCKMGR	lockmgr_init()
SI_SUB_VM	vm_mem_init()
SI_SUB_KMEM	kmeminit()
SI_SUB_KVM_RSRC	vmmapentry_rsrc_init()
SI_SUB_WITNESS	witness _initialize()
SI_SUB_MTX_POOL_DYNAMIC	mtx_pool_setup_dynamic()
SI_SUB_LOCK	selectinit()
SI_SUB_EVENTHANDLER	eventhandler_init()
SI_SUB_KLD	linker _init()
SI_SUB_CPU	cpu_startup()

Mutexes, which were covered in Section 4.3, are initialized in two different pools, one static and one dynamic. The initialization is split because dynamic pools use the kernel memory allocator that must be setup after the static mutexes. Because of the tight dependency of kernel services on each other, changes to the ordering of kernel services need to be carefully considered.

Directly after the lock manager is initialized the system enables the virtual memory system with a call to vm_mem_init(). Once the vm_mem_init() routine has completed its work, all memory allocations by the kernel or processes are for virtual addresses that are translated by the memory management hardware into physical addresses. With the virtual memory system running, the kernel now starts its own internal allocator. One last piece of memory subsystem initialization is to set limits on the resources used by the kernel virtual memory system, which is handled by the vmmapentry_rsrc_init() routine. Kernel modules can now ask for memory using the kernel's malloc() routine.

Moving from a uniprocessor kernel to one that supports SMP required the locking of the kernel's data structures against multiple, independent threads. Debugging multiprocessor code is difficult enough at the process level where the operating system can provide some help to the programmer. In an operating-system kernel this problem becomes more complex because a crashed kernel is hard to debug. To aid kernel debugging, FreeBSD added a kernel library that can watch all the locks being taken and released. The witness_initialize() routine instantiates the SI_SUB_WITNESS module as a basic service to provide the witness library to the modules that handle locking.

Some services require their locks to be allocated as part of the basic services. The select system call and the subsystem that handles kernel objects are two such services. The SI_SUB_LOCK module provides a place for services to register their need to have their lock initialization routines called early in the startup sequence.

The event-handler module allows various services to register functions to be called by the kernel when an event occurs. The event-handler service is heavily used by the shutdown sequence, described in Section 14.6, but also handles such diverse events as a device being cloned; the virtual memory system running low on memory; and processes being forked, executed, and exiting. The event-handler module, SI_SUB_EVENTHANDLER is started as part of the basic services so that other modules can use it when they are started.

The last of the basic services to be started is the kernel-module loader that loads dynamic kernel modules into the system at boot or run-time. The module_init() routine creates the basic data structures to handle module loading and unloading as well as registering an event handler to unload all the modules when the system is shut down. The kernel-module loader's use of event handlers is why this module must be initialized after the event-handler module just discussed.

With the basic services running, the kernel can now finish bringing up the CPU. The SI_SUB_CPU_module is really divided among several submodules that are responsible for different components of system startup. The first submodule to be initialized is the Advanced Programmable Interrupt Controller (APIC) that provides hardware interrupt support as well as coordination between multiple CPUs on SMP systems. The apic_init() routine starts the APIC device and then probes the system for other CPUs. Without the APIC device, multiple CPUs in an SMP system would have no way to coordinate their work, so this step is taken early in the system startup sequence. Although the system is not yet ready to start the other CPUs, it does allocate and initialize the data structures that support them. The next subsystem to start is the CPU itself. Although it is already running, having been started by the assembly-language boot code, several other pieces of initialization take place here. The real-time clock is started by the startrtclock() routine, and the CPU information is printed on the console, as are the memory statistics. The buffer system is also started by the CPU module so that the kernel can read data from disk. FreeBSD is heavily dependent on stable storage for swapping and paging memory and for retrieving kernel modules and programs. So the buffer system is set up early in the boot sequence to give the kernel early access to these resources.

On SMP systems, the cpu_mp module is forced to start after the APIC and CPU modules because its order argument is set to SI_ORDER_SECOND, whereas the APIC and CPU both have SI_ORDER_FIRST. Because the APIC is absolutely necessary to an SMP system, it is necessary to put the SMP initialization after the APIC initialization. The cpu_mp's initialization function, mp_start(), calls machine-dependent routines that do the real work of starting the other processors. When this routine completes, all the processors in an SMP system are active but will not be given any work until after the entire startup sequence completes.

Kernel Thread Initialization

The kernel has several processes that are set up at boot time and that are necessary whenever the kernel is running. The modules that represent the kernel processes are shown in Table 14.3. The swapper, init, and idle processes are all set up after the CPU has been initialized, but they are not scheduled to run. No processes or kernel threads execute until all the kernel modules have been initialized.

Table 14.3. Kernel process modules.
Module	First Routine
SI_SUB_INTRINSIC	proc0_init()
SI_SUB_VM_CONF	vm_init_limits()
SI_SUB_RUN_QUEUE	runq_init()
SI_SUB_KTRACE	ktrace_init()
SI_SUB_CREATE_INIT	create_init()
SI_SUB_SCHED_IDLE	idle_setup()

Once the kernel is running, all new processes are created by forking an existing process. However, as there are no processes when the system is first started, the startup code must handcraft the first process. The swapper is always the first process and gets PID 0. Creation of the swapper process started in the assembly-language startup, but now the system has enough services working to almost finish the job. The proc0_init() routine not only sets up process 0, but also initializes all the global data structures for handling processes and threads, the file descriptor table, and the limits structures. The proc0_init() routine also creates a prototype virtual-memory map that will be the prototype for all the other processes that will eventually be created.

After the data structures for the swapper process have been initialized, the system then creates the init process by calling the create_init() routine. The init process is set up directly after the swapper process so that it will always have PID 1. Because privileged users communicate with the init process by sending it signals, ensuring that it has a well-known PID means that users will not have to look it up before being able to communicate.

Each CPU in a system has an idle process. It is the idle process's responsibility to halt a CPU when there is no work for it to do. Once the swapper and init processes have been set up, the kernel initializes an idle process for each CPU in the system. If the system has only one CPU, then only one idle process is created. Like all the other kernel processes, the idle process is not started at this point but only created.

Device Module Initialization

With all the kernel's basic services in place, and the basic processes created, it is now possible to initialize the rest of the devices in the system that includes the disks, network interfaces, and clocks. Table 14.4 (on page 602) shows the main components used to initialize the device modules.

Table 14.4. Device modules.
Module	First Routine
SI_SUB_MBUF	mbuf_init()
SI_SUB_INTR	intr_init()
SI_SUB_SOFTINTR	start_softintr(), start_netisr()
SI_SUB_DEVFS	devfs_init(), devs_set_ready()
SI_SUB_INIT_IF	if_init()
SI_SUB_DRIVERS	many different routines
SI_SUB_CONFIGURE	configure_first()
SI_SUB_VFS	vfsinit()
SI_SUB_CLOCKS	initclocks()
SI_SUB_CLIST	clist_init()

Before any devices can be initialized in particular, network interfaces the mbuf subsystem must be set up so that the network interfaces have a set of buffers they can use in their own initialization. The mbuf subsystem initialization is handled by the SI_SUB_MBUF module and its mbuf_init() routine. As we discussed in Section 11.3, the mbuf subsystem has two kinds of memory to manage: small mbufs and mbuf clusters. Each type of memory is allocated from its own kernel memory area to keep the allocator simple and so that the remaining memory does not get fragmented by different-size allocations.

At this point in the startup sequence, hardware interrupts are not enabled on the system. The kernel now sets up all the interrupt threads that will handle interrupts when the system begins to run. The interrupts are set up by two modules: SI_SUB_INTR, which sets up the interrupt threads that handle device interrupts, and SI_SUB_SOFTINTR, which creates soft-interrupt threads. Soft-interrupt threads are used by services that handle asynchronous events that are not generated by hardware. Soft-interrupt threads provide the soft clock, which supports the callout subsystem. They also provide the network thread that dequeues inbound packets from the network interface queues and moves them through the network stack.

As part of bringing up real hardware devices, the kernel first initializes the device filesystem and then readies the network stack to handle devices with a call to the if_init() routine. The if_init() routine does not initialize any network interfaces; it only sets up the data structures that will support them. Finally, the devices themselves are initialized by the SI_SUB_DRIVERS and SI_SUB_CONFIGURE modules. All devices in the system are initialized by autoconfiguration as described in Section 7.5.

Once the devices are configured, the virtual filesystem is initialized. Bringing up the VFS is a multistage process that entails initializing the VFS itself, the vnode subsystem, and the name cache and pathname translation subsystem that maps pathnames to inodes. Support for named pipes is also initialized as part of the VFS.

The next systems to be set up are those relating to the real-time clock provided by the hardware. The initclocks() routine, which is a part of the SI_SUB_CLOCKS module, calls the architecture's specific cpu_initclocks() routine to initialize the hardware clock on the system and start it running. Once the hardware clock is running other services such as support for the Network Time Protocol (NTP), device polling, and the time counter are started. The last data structures to be initialized are in the terminal subsystem by the SI_SUB_CLIST module. All that needs to be done is to allocate the initial set of cblocks and add them to the clist.

Kernel Loadable Modules

Some kernel modules can be loaded, shut down, and unloaded while the system is running. Providing a system where kernel services can be loaded and unloaded at run time has several advantages over a system where all kernel services must be linked in at build time. For systems programmers, being able to load and unload modules at run time means that they are able to develop their code more quickly. Only those modules absolutely necessary to the system, such as the memory manager and scheduler, need to be directly linked into the kernel. A kernel module can be compiled, loaded, debugged, unloaded, modified, compiled, and loaded again without having to link it directly into the kernel or having to reboot the system. When a system is placed in the field, the use of kernel modules makes it possible to upgrade only selected parts of the system, as necessary. Upgrading in the field is absolutely necessary in embedded applications where the system may be physically unreachable, but it is also convenient in more traditional environments where many systems might have to change at the same time for instance, in a large server farm.

One problem with allowing kernel modules to be loaded and unloaded at run time is that of security. In versions of BSD before FreeBSD, the kernel was inviolate and completely protected from user change at run time. The only way to interact with a running kernel was through the syscall interface. System calls were defined at kernel build time and only provided a narrow channel of communication. This tightly controlled interface provided a layer of security. Although users could crash their own programs, absent a serious bug in the kernel, they could not crash the operating system or other users' processes. With the advent of loadable kernel modules, any user that can acquire root privileges can modify the underlying kernel. Certain services cannot be unloaded, which is a form of protection, but any service that is properly written can be loaded, including those that are malicious. Once a malicious module is loaded into the kernel, there is no protection against it wreaking havoc. A serious flaw in the kernel-module system is that it does not support the digital signing of modules. If a developer wishes to provide a service via a loadable kernel module to the general public for instance, as a part of a larger application it would be helpful to consumers of the module if the kernel could verify that the module really came from the stated developer. As yet, a module signing and verification service is not a part of the loadable kernel-module system. For these security reasons, most people that run FreeBSD in a commercial setting continue to use wholly contained kernels and do not load random services at run time.

Loadable kernel modules are declared in the following way:

 DECLARE_MODULE(name, data, subsystem, order)

Each module has a name, subsystem, and order that serve the same purposes here as they do in the SYSINIT macro. The key difference is the use of the data argument, which is a data structure that is defined in the following way:

 int (*modeventhand_t)(         struct module *module,         int command,         void *argument); typedef moduledata = {         const char *name;         modeventhand_t event_handler;         void *data; } moduledata_t;

All modules have an associated version that is declared with the MODULE VERSION macro. Without a version it would be impossible to differentiate between different revisions of the same module, making field upgrades difficult. One last macro used by kernel modules is the SYSCALL_MODULE_HELPER that developers use to add new system calls to the kernel.

To have kernel modules that can be loaded both at boot time and at run time, two different cases must be handled. When a module is loaded at boot time, it has already been processed by the kernel's build process, which means that its system-call entry points and other data are already known to the kernel. This knowledge simplifies the loading process. All that needs to be done is to call the module's event handler with the MOD_LOAD command. At run time, the module needs to be loaded into memory, registered with the kernel, and its system calls dynamically added to the system-call table. Once all that work is done, it can be initialized by calling its event handler. All the run-time loading is handled by the kldload system call and the module_register() routine. To keep the interface that programmers use simple, all this functionality is hidden by the DECLARE_MODULE macro and the use of the single event-handler routine. When creating a kernel module, a programmer needs to be concerned only with writing the module event handler and exporting the module handler's system calls via the macros.

Interprocess Communication Startup

The modules involved in the startup of the IPC interfaces are shown in Table 14.5. The first three modules to be loaded and started as part of supporting IPC are System V semaphores, shared memory, and message queues. These local-IPC facilities were covered in Section 11.8, and their initialization is simple. What makes them interesting is that they are the first modules that can be loaded and unloaded at run time.

Table 14.5. Interprocess communication modules.
Module	First Routine
SI_SUB_SYSV_SEM	sysvsem_modload()
SI_SUB_SYSV_SHM	sysvshm_modload()
SI_SUB_SYSV_MSG	sysvmsg_modload()
SI_SUB_PROTO_IF	if_check()
SI_SUB_PROTO_DOMAIN	domaininit()
SI_SUB_PROTO_IFATTACHDOMAIN	if_attachdomain()

We will use the SI_SUB_SYSV_SEM module as an example. It implements System V semaphores. This module is declared to the system as follows:

 static moduledata_t sysvsem_mod = {     "sysvsem",     &sysvsem_modload,     NULL }; SYSCALL_MODULE_HELPER(semsys); SYSCALL_MODULE_HELPER(_semctl); SYSCALL_MODULE_HELPER(semget); SYSCALL_MODULE_HELPER(semop); DECLARE_MODULE(sysvsem, sysvsem_mod,     SI_SUB_SYSV_SEM, SI_ORDER_FIRST); MODULE_VERSION(sysvsem, 1);

All the system calls provided by the module are declared in the SYSCALL_MODULE_HELPER macros and should be familiar from Section 11.8. The DECLARE_MODULE macro ensures that an entry in the startup sequence is added for the module and that its order is first. There are no other modules in this subsystem so the order argument is not important. The routine that the startup sequence calls first is not sysvsem_modload() but module_register_init(), which then calls the module's sysvsem_modload() routine. Since this module was linked into the kernel at build time, all its system calls and other data are already linked with the kernel. Once the module_register_init() routine completes, the module is loaded and ready for use.

Following the local-IPC subsystems, several pseudo-devices are started. The pseudo-devices represent diverse services within the kernel that present themselves as devices to the rest of the system. Examples of pseudo-devices include the frame buffer, virtual LAN support, network firewalling, and the cryptography subsystems. All the pseudo-devices are declared as kernel loadable modules.

The last three modules SI_SUB_PROTO_IF, SI_SUB_PROTO_DOMAIN, and SI_SUB_PROTO_IFATTACHDOMAIN initialize the code that supports the networking support discussed in Part IV. The SI_SUB_PROTO_IF module has a single routine, if_check(), that checks each network interface in the system to see if it has been initialized correctly. The SI_SUB_PROTO_DOMAIN is a set of modules that initialize the various communication domains supported by the kernel. Once the basic data structures for domains are initialized by the domaininit() routine, the communication domains themselves are initialized. Each domain is responsible for declaring itself to the system by using the DOMAIN_SET macro that provides an initialization routine for the kernel to call, just like the SYSINIT macro. Unlike many of the preceding services, which were declared as modules, network domains are declared such that they cannot be removed. Domains are locked into the kernel once they are loaded because the socket layer may be using a domain's data structures for active communication, and the result of unloading a domain while it is in use is undefined. Making it possible to unload a communication domain at run time will take many changes to the socket layer and domains modules.

Start Kernel Threads

The collection of modules shown in Table 14.6 finishes the setup of all the threads necessary to have a working kernel. The swapper, init, pagezero, pagedaemon, bufdaemon, vnlru, and syncer kernel threads, as well as the kernel threads for the network filesystem (NFS), are started here. To this point the system has been running in uniprocessor mode. On a multiprocessor system the SI_SUB_SMP module calls the last of the machine-dependent code, the release_aps() routine, to start up the other processors in the system. Now it is time to start the system running. The SI_SUB_RUN_SCHEDULER module is treated specially in the startup sequence because it must always execute last. In the list of modules, it is given the highest possible subsystem number, all 1s, so that it is not accidentally placed earlier in the list. The scheduler() routine is called and never returns. It immediately begins scheduling kernel threads and user-level processes to run on the system's CPUs.

Table 14.6. Kernel thread modules.
Module	First Routine
SI_SUB_INTRINSIC_POST	proc0_post()
SI_SUB_KTHREAD_INIT	kick_init()
SI_SUB_KTHREAD_PAGE	vm_pageout()
SI_SUB_KTHREAD_VM	vm_daemon()
SI_SUB_KTHREAD_BUF	buf_daemon()
SI_SUB_KTHREAD_UPDATE	vnlru_proc()
SI_SUB_KTHREAD_IDLE	ald_daemon()
SI_SUB_SMP	release_aps()
SI_SUB_RUN_SCHEDULER	scheduler()

14.4. Kernel Module Initialization

Basic Services

Table 14.2. Basic services.

Kernel Thread Initialization

Table 14.3. Kernel process modules.

Device Module Initialization

Table 14.4. Device modules.

Kernel Loadable Modules

Interprocess Communication Startup

Table 14.5. Interprocess communication modules.

Start Kernel Threads

Table 14.6. Kernel thread modules.