Section 1.9. Overview of the Linux Kernel

1.9. Overview of the Linux Kernel

There are various components to the Linux kernel. Throughout this book, we use the word component and subsystem interchangeably to refer to these categorical and functional differentiators of the kernel functions.

In the following sections, we discuss some of those components and how they are implemented in the Linux kernel. We also cover some key features of the operating system that provide insight into how things are implemented in the kernel. We break up the components into filesystem, processes, scheduler, and device drivers. Although this is not intended to be a comprehensive list, it provides a reference for the rest of this book.

1.9.1. User Interface

Users communicate with the system by way of programs. A user first logs in to the system through a terminal or a virtual terminal. In Linux, a program, called mingetty for virtual terminals or agetty for serial terminals, monitors the inactive terminal waiting for users to notify that they want to log in. To do this, they enter their account name, and the getty program proceeds to call the login program, which prompts for a password, accesses a list of names and passwords for authentication, and allows them into the system if there is a match, or exits and terminates the process if there is no match. The getty programs are all respawned once terminated, which means they restart if the process ever exits.

Once authenticated in the system, users need a way to tell the system what they want to do. If the user is authenticated successfully, the login program executes a shell. Although technically not part of the operating system, the shell is the primary user interface to the operating system. A shell is a command interpreter and consists of a listening process. The listening process (one that blocks until the condition of receiving input is met) then interprets and executes the requests typed in by the user. The shell is one of the programs found in the top layer of Figure 1.1.

The shell displays a command prompt (which is generally configurable, depending on the shell) and waits for user input. A user can then interact with the system's devices and programs by entering them using a syntax defined by the shell.

The programs a user can call are executable files stored within the filesystem that the user can execute. The execution of these requests is initiated by the shell spawning a child process. The child process might then make system call accesses. After the system call returns and the child process terminates, the shell can go back to listen for user requests.

1.9.2. User Identification

A user logs in with a unique account name. However, he is also associated with a unique user ID (UID). The kernel uses this UID to validate the user's permissions with respect to file accesses. When a user logs in, he is granted access to his home directory, which is where he can create, modify, and destroy files. It is important in a multiuser system, such as Linux, to associate users with access permission and/or restrictions to prevent users from interfering with the activity of other users and accessing their data. The superuser or root is a special user with unrestricted permissions; this user's UID is 0.

A user is also a member of one or more groups, each of which has its own unique group ID (GID). When a user is created, he is automatically a member of a group whose name is identical to his username. A user can also be manually added to other groups that have been defined by the system administrator.

A file or a program (an executable file) is associated with permissions as they apply to users and groups. Any particular user can determine who is allowed to access his files and who is not. A file will be associated with a particular UID and a particular GID.

1.9.3. Files and Filesystems

A filesystem provides a method for the storage and organization of data. Linux supports the concept of the file as a device-independent sequence of bytes. By means of this abstraction, a user can access a file regardless of what device (for example, hard disk, tape drive, disk drive) stores it. Files are grouped inside a container called a directory. Because directories can be nested in each other (which means that a directory can contain another directory), the filesystem structure is that of a hierarchical tree. The root of the tree is the top-most node under which all other directories and files are stored. It is identified by a forward slash (/). A filesystem is stored in a hard-drive partition, or unit of storage.

1.9.3.1. Directories, Files, and Pathnames

Every file in a tree has a pathname that indicates its name and location. A file also has the directory to which it belongs. A pathname that takes the current working directory, or the directory the user is located in, as its root is called a relative pathname, because the file is named relative to the current working directory. An absolute pathname is a pathname that is taken from the root of the filesystem (for example, a pathname that starts with a /). In Figure 1.2, the absolute pathname of user paul's file.c is /home/paul/src/file.c. If we are located inside paul's home directory, the relative pathname is simply src/file.c.

Figure 1.2. Hierarchical File Structure

The concepts of absolute versus relative pathnames come into play because the kernel associates processes with the current working directory and with a root directory. The current working directory is the directory from which the process was called and is identified by a . (pronounced "dot"). As an aside, the parent directory is the directory that contains the working directory and is identified by a .. (pronounced "dot dot"). Recall that when a user logs in, she is "located" in her home directory. If Anna tells the shell to execute a particular program, such as ls, as soon as she logs in, the process that executes ls has /home/anna as its current working directory (whose parent directory is /home) and / will be its root directory. The root is always its own parent.

1.9.3.2. Filesystem Mounting

In Linux, as in all UNIX-like systems, a filesystem is only accessible if it has been mounted. A filesystem is mounted with the mount system call and is unmounted with the umount system call. A filesystem is mounted on a mount point, which is a directory used as the root access to the mounted filesystem. A directory mount point should be empty. Any files originally located in the directory used as a mount point are inaccessible after the filesystem is mounted and remains so until the filesystem is unmounted. The /etc/mtab file holds the table of mounted filesystems while /etc/fstab holds the filesystem table, which is a table listing all the system's filesystems and their attributes. /etc/mtab lists the device of the mounted filesystem and associates it with its mount point and any options with which it was mounted.^[8]

^[8] The options are passed as parameters to the mount system call.

1.9.3.3. File Protection and Access Rights

Files have access permissions to provide some degree of privacy and security. Access rights or permissions are stored as they apply to three distinct categories of users: the user himself, a designated group, and everyone else. The three types of users can be granted varying access rights as applied to the three types of access to a file: read, write, and execute. When we execute a file listing with an ls al, we get a view of the file permissions:

 lkp :~# ls al /home/sophia  drwxr-xr-x 22 sophia sophia    4096 Mar 14 15:13 . drwxr-xr-x 24 root  root     4096 Mar 7 18:47 .. drwxrwx--- 3 sophia department  4096 Mar 4 08:37 sources

The first entry lists the access permissions of sophia's home directory. According to this, she has granted everyone the ability to enter her home directory but not to edit it. She herself has read, write, and execute permission.^[9] The second entry indicates the access rights of the parent directory /home. /home is owned by root but it allows everyone to read and execute. In sophia's home directory, she has a directory called sources, which she has granted read, write, and execute permissions to herself, members of the group called department, and no permissions to anyone else.

^[9] Execute permission, as applied to a directory, indicates that a user can enter it. Execute permission as applied to a file indicates that it can be run and is used only on executable files.

1.9.3.4. File Modes

In addition to access rights, a file has three additional modes: sticky, suid, and sgid. Let's look at each mode more closely.

sticky

A file with the sticky bit enabled has a "t" in the last character of the mode field (for example, -rwx-----t). Back in the day when disk accesses were slower than they are today, when memory was not as large, and when demand-based methodologies hadn't been conceived,^[10] an executable file could have the sticky bit enabled and ensure that the kernel would keep it in memory despite its state of execution. When applied to a program that was heavily used, this could increase performance by reducing the amount of time spent accessing the file's information from disk.

^[10] This refers to techniques that exploit the principle of locality with respect to loaded program chunks. We see more of this in detail in Chapter 4.

When the sticky bit is enabled in a directory, it prevents the removal or renaming of files from users who have write permission in that directory (with exception of root and the owner of the file).

suid

An executable with the suid bit set has an "s" where the "x" character goes for the user-permission bits (for example, -rws------). When a user executes an executable file, the process is associated with the user who called it. If an executable has the suid bit set, the process inherits the UID of the file owner and thus access to its set of access rights. This introduces the concepts of the real user ID as opposed to the effective user ID. As we soon see when we look at processes in the "Processes" section, a process' real UID corresponds to that of the user that started the process. The effective UID is often the same as the real UID unless the setuid bit was set in the file. In that case, the effective UID holds the UID of the file owner.

suid has been exploited by hackers who call executable files owned by root with the suid bit set and redirect the program operations to execute instructions that they would otherwise not be allowed to execute with root permissions.

sgid

An executable with the sgid bit set has an "s" where the "x" character goes for the group permission bits (for example, -rwxrws---). The sgid bit acts just like the suid bit but as applied to the group. A process also has a real group ID and an effective group ID that holds the GID of the user and the GID of the file group, respectively.

1.9.3.5. File Metadata

File metadata is all the information about a file that does not include its content. For example, metadata includes the type of file, the size of the file, the UID of the file owner, the access rights, and so on. As we soon see, some file types (devices, pipes, and sockets) contain no data, only metadata. All file metadata, with the exception of the filename, is stored in an inode or index node. An inode is a block of information, and every file has its own inode. A file descriptor is an internal kernel data structure that manages the file data. File descriptors are obtained when a process accesses a file.

1.9.3.6. Types of Files

UNIX-like systems have various file types.

Regular File

A regular file is identified by a dash in the first character of the mode field (for example, -rw-rw-rw-). A regular file can contain ASCII data or binary data if it is an executable file. The kernel does not care what type of data is stored in a file and thus makes no distinctions between them. User programs, however, might care. Regular files have their data stored in zero or more data blocks.^[11]

^[11] An empty file has zero data blocks.

Block Devices

A block device is identified by a "b" in the first character of the mode field (for example, brw-------). These files represent a hardware device on which I/O is performed in discretely sized blocks in powers of 2. Block devices include disk and tape drives and are accessed through the /dev directory in the filesystem.^[12] Disk accesses can be time consuming; therefore, data transfer for block devices is performed by the kernel's buffer cache, which is a method of storing data temporarily to reduce the number of costly disk accesses. At certain intervals, the kernel looks at the data in the buffer cache that has been updated and synchronizes it with the disk. This provides great increases in performance; however, a computer crash can result in loss of the buffered data if it had not yet been written to disk. Synchronization with the disk drive can be forced with a call to the sync, fsync, or fdatasync system calls, which take care of writing buffered data to disk. A block device does not use any data blocks because it stores no data. Only an inode is required to hold its information.

^[12] The mount system call requires a block file.

Character Devices

A character device is identified by a "c" in the first character of the mode field (for example, crw-------). These files represent a hardware device that is not block structured and on which I/O occurs in streams of bytes and is transferred directly between the device driver and the requesting process. These devices include terminals and serial devices and are accessed through the /dev directory in the filesystem. Pseudo devices or device drivers that do not represent hardware but instead perform some unrelated kernel side function can also be character devices. These devices are also known as raw devices because of the fact that there is no intermediary cache to hold the data. Similar to a block device, a character device does not use any data blocks because it stores no data. Only an inode is required to hold its information.

Link

A link device is identified by an "l" in the first character of the mode field (for example, lrw-------). A link is a pointer to a file. This type of file allows there to be multiple references to a particular file while only one copy of the file and its data actually exists in the filesystem. There are two types of links: hard link and symbolic, or soft, link. Both are created through a call to ln. A hard link has limitations that are absent in the symbolic link. These include being limited to linking files within the same filesystem, being unable to link to directories, and being unable to link to non-existing files. Links reflect the permissions of the file to which it is pointing.

Named Pipes

A pipe file is identified by a "p" in the first character of the mode field (for example, prw-------). A pipe is a file that facilitates communication between programs by acting as data pipes; data is written into them by one program and read by another. The pipe essentially buffers its input data from the first process. Named pipes are also known as FIFOs because they relay the information to the reading program in a first in, first out basis. Much like the device files, no data blocks are used by pipe files, only the inode.

Sockets

A socket is identified by an "s" in the first character of the mode field (for example, srw-------). Sockets are special files that also facilitate communication between two processes. One difference between pipes and sockets is that sockets can facilitate communication between processes on different computers connected by a network. Socket files are also not associated with any data blocks. Because this book does not cover networking, we do not go over the internals of sockets.

1.9.3.7. Types of Filesystems

Linux filesystems support an interface that allows various filesystem types to coexist. A filesystem type is determined by the way the block data is broken down and manipulated in the physical device and by the type of physical device. Some examples of types of filesystems include network mounted, such as NFS, and disk based, such as ext3, which is one of the Linux default filesystems. Some special filesystems, such as /proc, provide access to kernel data and address space.

1.9.3.8. File Control

When a file is accessed in Linux, control passes through a number of stages. First, the program that wants to access the file makes a system call, such as open(), read(), or write(). Control then passes to the kernel that executes the system call. There is a high-level abstraction of a filesystem called VFS, which determines what type of specific filesystem (for example, ext2, minix, and msdos) the file exists upon, and control is then passed to the appropriate filesystem driver.

The filesystem driver handles the management of the file upon a given logical device. A hard drive could have msdos and ext2 partitions. The filesystem driver knows how to interpret the data stored on the device and keeps track of all the metadata associated with a file. Thus, the filesystem driver stores the actual file data and incidental information such as the timestamp, group and user modes, and file permissions (read/write/execute).

The filesystem driver then calls a lower-level device driver that handles the actual reading of the data off of the device. This lower-level driver knows about blocks, sectors, and all the hardware information that is necessary to take a chunk of data and store it on the device. The lower-level driver passes the information up to the filesystem driver, which interprets and formats the raw data and passes the information to the VFS, which finally transfers the data back to the originating program.

1.9.4. Processes

If we consider the operating system to be a framework that developers can build upon, we can consider processes to be the basic unit of activity undertaken and managed by this framework. More specifically, a process is a program that is in execution. A single program can be executed multiple times so there might be more than one process associated with a particular program.

The concept of processes became significant with the introduction of multiuser systems in the 1960s. Consider a single-user operating system where the CPU executes only a single process. In this case, no other program can be executed until the currently running process is complete. When multiple users are introduced (or if we want the ability to perform multiple tasks concurrently), we need to define a way to switch between the tasks.

The process model makes the execution of multiple tasks possible by defining execution contexts. In Linux, each process operates as though it were the only process. The operating system then manages these contexts by assigning the processor to work on one or the other according to a predefined set of rules. The scheduler defines and executes these rules. The scheduler tracks the length of time the process has run and switches it off to ensure that no one process hogs the CPU.

The execution context consists of all the parts associated with the program such as its data (and the memory address space it can access), its registers, its stack and stack pointer, and the program counter value. Except for the data and the memory addressing, the rest of the components of a process are transparent to the programmer. However, the operating system needs to manage the stack, stack pointer, program counter, and machine registers. In a multiprocess system, the operating system must also be responsible for the context switch between processes and the management of system resources that processes contend for.

1.9.4.1. Process Creation and Control

A process is created from another process with a call to the fork() system call. When a process calls fork(), we say that the process spawned a new process, or that it forked. The new process is considered the child process and the original process is considered the parent process. All processes have a parent, with the exception of the init process. All processes are spawned from the first process, init, which comes about during the bootstrapping phase. This is discussed further in the next section.

As a result of this child/parent model, a system has a process tree that can define the relationships between all the running processes. Figure 1.3 illustrates a process tree.

Figure 1.3. Process Tree

When a child process is created, the parent process might want to know when it is finished. The wait() system call is used to pause the parent process until its child has exited.

A process can also replace itself with another process. This is done, for example, by the mingetty() functions previously described. When a user requests access into the system, the mingetty() function requests his username and then replaces itself with a process executing login() to which it passes the username parameter. This replacement is done with a call to one of the exec() system calls.

1.9.4.2. Process IDs

Every process has a unique identifier know as the process ID (PID). A PID is a non-negative integer. Process IDs are handed out in incrementing sequential order as processes are created. When the maximum PID value is hit, the values wrap and PIDs are handed out starting at the lowest available number greater than 1. There are two special processes: process 0 and process 1. Process 0 is the process that is responsible for system initialization and for spawning off process 1, which is also known as the init process. All processes in a running Linux system are descendants of process 1. After process 0 executes, the init process becomes the idle cycle. Chapter 8, "Booting the Kernel," discusses this process in "The Beginning: start_kernel()" section.

Two system calls are used to identify processes. The getpid() system call retrieves the PID of the current process, and the getppid() system call retrieves the PID of the process' parent.

1.9.4.3. Process Groups

A process can be a member of a process group by sharing the same group ID. A process group facilitates associating a set of processes. This is something you might want to do, for example, if you want to ensure that otherwise unrelated processes receive a kill signal at the same time. The process whose PID is identical to the group ID is considered the group leader. Process group IDs can be manipulated by calling the getpgid() and setpgid() system calls, which retrieve and set the process group ID of the indicated process, respectively.

1.9.4.4. Process States

Processes can be in different states depending on the scheduler and the availability of the system resources for which the process contends. A process might be in a runnable state if it is currently being executed or in a run queue, which is a structure that holds references to processes that are in line to be executed. A process can be sleeping if it is waiting for a resource or has yielded to anther process, dead if it has been killed, and defunct or zombie if a process has exited before its parent was able to call wait() on it.

1.9.4.5. Process Descriptor

Each process has a process descriptor that contains all the information describing it. The process descriptor contains such information as the process state, the PID, the command used to start it, and so on. This information can be displayed with a call to ps (process status). A call to ps might yield something like this:

 lkp:~#ps aux | more USER PID TTY STAT COMMAND root   1  ?    S  init [3] root   2   ?   SN [ksoftirqd/0] ... root  10  ?    S< [aio/0] ... root  2026 ?    Ss /sbin/syslogd -a /var/lib/ntp/dev/log root  2029 ?    Ss /sbin/klogd -c 1 -2 x ... root  3324 tty2   Ss+ /sbin/mingetty tty2 root  3325 tty3   Ss+ /sbin/mingetty tty3 root  3326 tty4   Ss+ /sbin/mingetty tty4 root  3327 tty5   Ss+ /sbin/mingetty tty5 root  3328 tty6   Ss+ /sbin/mingetty tty6 root  3329 ttyS0   Ss+ /sbin/agetty -L 9600 ttyS0 vt102 root  14914 ?    Ss sshd: root@pts/0 ... root  14917 pts/0   Ss -bash root  17682 pts/0   R+ ps aux root  17683 pts/0   R+ more

The list of process information shows the process with PID 1 to be the init process. This list also shows the mingetty() and agetty() programs listening in on the virtual and serial terminals, respectively. Notice how they are all children of the previous one. Finally, the list shows the bash session on which the ps aux | more command was issued. Notice that the | used to indicate a pipe is not a process in itself. Recall that we said pipes facilitate communication between processes. The two processes are ps aux and more.

As you can see, the STAT column indicates the state of the process, with S referring to sleeping processes and R to running or runnable processes.

1.9.4.6. Process Priority

In single-processor computers, we can have only one process executing at a time. Processes are assigned priorities as they contend with each other for execution time. This priority is dynamically altered by the kernel based on how much a process has run and what its priority has been until that moment. A process is allotted a timeslice to execute after which it is swapped out for another process by the scheduler, as we describe next.

Higher priority processes are executed first and more often. The user can set a process priority with a call to nice(). This call refers to the niceness of a process toward another, meaning how much the process is willing to yield. A high priority has a negative value, whereas a low priority has a positive value. The higher the value we pass nice, the more we are willing to yield to another process.

1.9.5. System Calls

System calls are the main mechanism by which user programs communicate with the kernel. Systems calls are generally wrapped inside library calls that manage the setup of the registers and data that each system call needs before executing. The user programs then link in the library with the appropriate routines to make the kernel request.

System calls generally apply to specific subsystems. This means that a user space program can interact with any particular kernel subsystem by means of these system calls. For example, files have file-handling system calls, and processes have process-specific system calls. Throughout this book, we identify the system calls associated with particular kernel subsystems. For example, when we talk about filesystems, we look at the read(), write(), open(), and close() system calls. This provides you with a view of how filesystems are implemented and managed within the kernel.

1.9.6. Linux Scheduler

The Linux scheduler handles the task of moving control from one process to another. With the inclusion of kernel pre-emption in Linux 2.6, any process, including the kernel, can be interrupted at nearly any time and control passed to a new process.

For example, when an interrupt occurs, Linux must stop executing the current process and handle the interrupt. In addition, a multitasking operating system, such as Linux, ensures that no one process hogs the CPU for an extended time. The scheduler handles both of these tasks: On one hand, it swaps the current process with a new process; on the other hand, it keeps track of processes' usage of the CPU and indicates that they be swapped if they have run too long.

How the Linux scheduler determines which process to give control of the CPU is explained in depth in Chapter 7, "Scheduling and Kernel Synchronization"; however, a quick summary is that the scheduler determines priority based on past performance (how much CPU the process has used before) and on the criticality of the process (interrupts are more critical than the log system).

The Linux scheduler also manages how processes execute on multiprocessor machines (SMP). There are some interesting features for load balancing across multiple CPUs as well as the ability to tie processes to a specific CPU. That being said, the basic scheduling functionality operates identically across CPUs.

1.9.7. Linux Device Drivers

Device drivers are how the kernel interfaces with hard disks, memory, sound cards, Ethernet cards, and many other input and output devices.

The Linux kernel usually includes a number of these drivers in a default installation; Linux wouldn't be of much use if you couldn't enter any data via your keyboard. Device drivers are encapsulated in a module. Although Linux is a monolithic kernel, it achieves a high degree of modularization by allowing each device driver to be dynamically loaded. Thus, a default kernel can be kept relatively small and slowly extended based upon the actual configuration of the system on which Linux runs.

In the 2.6 Linux kernel, device drivers have two major ways of displaying their status to a user of the system: the /proc and /sys filesystems. In a nutshell, /proc is usually used to debug and monitor devices and /sys is used to change settings. For example, if you have an RF tuner on an embedded Linux device, the default tuner frequency could be visible, and possibly changeable, under the devices entry in sysfs.

In Chapters 5, "Input/Output," and 10, "Adding Your Code to the Kernel," we closely look at device drivers for both character and block devices. More specifically, we tour the /dev/random device driver and see how it gathers entropy information from other devices on the Linux system.