Section 10.1. Traversing the Source | The Linux Kernel Primer. A Top-Down Approach for x86 and PowerPC Architectures

10.1. Traversing the Source

This section covers introductory concepts for system calls and drivers (also called modules) under Linux. System calls are what user programs use to communicate with the operating system to request services. Adding a system call is one way to create a new kernel service. Chapter 3, "Processes: The Principal Model of Execution," describes the internals of system call implementation. This chapter describes the practical aspects of incorporating your own system calls into the Linux kernel.

Device drivers encompass the interface that the Linux kernel uses to allow a programmer to control the system's input/output devices. Entire books have been written specifically on Linux device drivers. This chapter distills this topic down to its essentials. In this section, we follow a device driver from how the device is represented in the filesystem and then through the specific kernel code that controls it. In the next section, we show how to use what we've learned in the first part to construct a functional character driver. The final parts of Chapter 10 describe how to write system calls and how to build the kernel. We start by exploring the filesystem and show how these files tie into the kernel.

10.1.1. Getting Familiar with the Filesystem

Devices in Linux can be accessed via /dev. For example, an ls l /dev/random yields the following:

 crw-rw-rw- 1 root  root  1, 8 Oct 2 08:08 /dev/random

The leading "c" tells us that the device is a character device; a "b" identifies a block device. After the owner and group columns are two numbers that are separated by a comma (in this case, 1, 8). The first number is the driver's major number and the second its minor number. When a device driver registers with the kernel, it registers a major number. When a given device is opened, the kernel uses the device file's major number to find the driver that has registered with that major number.^[1] The minor number is passed through the kernel to the device driver itself because a single driver can control multiple devices. For example, /dev/urandom has a major number of 1 and a minor number of 9. This means that the device driver registered with major number 1 handles both /dev/random and /dev/urandom.

^[1] mknod creates block and character device files.

To generate a random number, we simply read from /dev/random. The following is one possible way to read 4 bytes of random data:^[2]

^[2] head c4 gathers the first 4 bytes and od x formats the bytes in hexadecimal.

 lkp@lkp:~$ head -c4 /dev/urandom | od -x 0000000 823a 3be5 0000004

If you repeat this command, you notice the 4 bytes [823a 3be5] continue to change. To demonstrate how the Linux kernel uses device drivers, we follow the steps that the kernel takes when a user accesses /dev/random.

We know that the /dev/random device file has a major number of 1. We can determine what driver controls the node by checking /proc/devices:

 lkp@lkp:~$ less /proc/devices Character devices:  1 mem

Let's examine the mem device driver and search for occurrences of "random":

 ----------------------------------------------------------------------- drivers/char/mem.c 653 static int memory_open(struct inode * inode, struct file * filp) 654 { 655   switch (iminor(inode)) { 656     case 1: ... 676     case 8: 677       filp->f_op = &random_fops; 678       break; 679     case 9: 680       filp->f_op = &urandom_fops; 681       break; -----------------------------------------------------------------------

Lines 655681

This switch statement initializes driver structures based on the minor number of the device being operated on. Specifically, filps and fops are being set.

This leads us to ask, "What is a filp? What is a fop?"

10.1.2. Filps and Fops

A filp is simply a file struct pointer, and a fop is a file_operations struct pointer. The kernel uses the file_operations structure to determine what functions to call when the file is operated on. Here are selected sections of the structures that are used in the random device driver:

 ----------------------------------------------------------------------- include/linux/fs.h 556 struct file { 557   struct list_head  f_list; 558   struct dentry   *f_dentry; 559   struct vfsmount   *f_vfsmnt; 560   struct file_operations *f_op; 561   atomic_t    f_count; 562   unsigned int   f_flags; ... 581   struct address_space *f_mapping; 582 }; ----------------------------------------------------------------------- ----------------------------------------------------------------------- include/linux/fs.h 863 struct file_operations {  864   struct module *owner;  865   loff_t (*llseek) (struct file *, loff_t, int);  866   ssize_t (*read) (struct file *, char __user *, size_t, loff_t *);  867   ssize_t (*aio_read) (struct kiocb *, char __user *, size_t, loff_t);  868   ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);  869   ssize_t (*aio_write) (struct kiocb *, const char __user *, size_t, loff_t);  870   int (*readdir) (struct file *, void *, filldir_t);  871   unsigned int (*poll) (struct file *, struct poll_table_struct *);  872   int (*ioctl) (struct inode *, struct file *, unsigned int, unsigned long); ... 888 }; -----------------------------------------------------------------------

The random device driver declares which file operations it provides in the following way: Functions that the drivers implement must conform to the prototypes listed in the file_operations structure:

 ----------------------------------------------------------------------- drivers/char/random.c 1824 struct file_operations random_fops = {  1825   .read   = random_read,  1826   .write   = random_write,  1827   .poll   = random_poll,  1828   .ioctl   = random_ioctl,  1829 };  1830   1831 struct file_operations urandom_fops = {  1832   .read   = urandom_read,  1833   .write   = random_write,  1834   .ioctl   = random_ioctl, 1835 }; -----------------------------------------------------------------------

Lines 18241829

The random device provides the operations of read, write, poll, and ioctl.

Lines 18311835

The urandom device provides the operations of read, write, and ioctl.

The poll operation allows a programmer to check before performing an operation to see if that operation blocks. This suggests, and is indeed the case, that /dev/random blocks if a request is made for more bytes of entropy than are in its entropy pool.^[3] /dev/urandom does not block, but might not return completely random data, if the entropy pool is too small. For more information consult your systems man pages, specifically man 4 random.

^[3] In the random device driver, entropy refers to system data that cannot be predicted. Typically, it is harvested from keystroke timing, mouse movements, and other irregular input.

Digging deeper into the code, notice that when a read operation is performed on /dev/random, the kernel passes control to the function random_read() (see line 1825). random_read() is defined as follows:

 ----------------------------------------------------------------------- drivers/char/random.c 1588 static ssize_t  1589 random_read(struct file * file, char __user * buf, size_t  nbytes, loff_t *ppos) -----------------------------------------------------------------------

The function parameters are as follows:

file. Points to the file structure of the device.
buf. Points to an area of user memory where the result is to be stored.
nbytes. The size of data requested.
ppos. Points to a position within the file that the user is accessing.

This brings up an interesting issue: If the driver executes in kernel space, but the buffer is memory in user space, how do we safely get access to the data in buf? The next section explains the process of moving data between user and kernel memory.

10.1.3. User Memory and Kernel Memory

If we were to simply use memcpy() to copy the buffer from kernel space to user space, the copy operation might not work because the user space addresses could be swapped out when memcpy() occurs. Linux has the functions copy_to_user() and copy_from_user(), which allow drivers to move data between kernel space and user space. In read_random(), this is done in the function extract_entropy(), but there is an additional twist:

 ----------------------------------------------------------------------- drivers/char/random.c  1: static ssize_t extract_entropy(struct entropy_store *r, void * buf,  2:         size_t nbytes, int flags)  3: { 1349 static ssize_t extract_entropy(struct entropy_store *r, void * buf,  1350        size_t nbytes, int flags)  1351 { ... 1452     /* Copy data to destination buffer */ 1453     i = min(nbytes, HASH_BUFFER_SIZE*sizeof(__u32)/2); 1454     if (flags & EXTRACT_ENTROPY_USER) { 1455       i -= copy_to_user(buf, (__u8 const *)tmp, i); 1456       if (!i) { 1457         ret = -EFAULT; 1458         break; 1459       } 1460     } else 1461       memcpy(buf, (__u8 const *)tmp, i); -----------------------------------------------------------------------

exTRact_entropy() has the following parameters:

r. A pointer to an internal storage of entropy, it is ignored for the purposes of our discussion.
buf. A pointer to an area of memory that should be filled with data.
nbytes. The amount of data to write to buf.
flags. Informs the function whether buf is in kernel or user memory.

exTRact_entropy() returns ssize_t, which is the size, in bytes, of the random data generated.

Lines 14541455

If flags tells us that buf points to a location in user memory, we use copy_to_user() to copy the kernel memory pointed to by tmp to the user memory pointed to by buf.

Lines 14601461

If buf points to a location in kernel memory, we simply use memcpy() to copy the data.

Obtaining random bytes is something that both kernel space and user space programs are likely to use; a kernel space program can avoid the overhead of copy_to_user() by not setting the flag. For example, the kernel can implement an encrypted filesystem and can avoid the overhead of copying to user space.

10.1.4. Wait Queues

We detoured slightly to explain how to move data between user and kernel memory. Let's return to read_random() and examine how it uses wait queues.

Occasionally, a driver might need to wait for some condition to be true, perhaps access to a system resource. In this case, we don't want the kernel to wait for the access to complete. It is problematic to cause the kernel to wait because all other system processing halts while the wait occurs.^[4] By declaring a wait queue, you can postpone processing until a later time when the condition you are waiting on has occurred.

^[4] Actually, the CPU running the kernel task will wait. On a multi-CPU system, other CPUs can continue to run.

Two structures are used for this process of waiting: a wait queue and a wait queue head. A module should create a wait queue head and have parts of the module that use sleep_on and wake_up macros to manage things. This is precisely what occurs in random_read():

 ----------------------------------------------------------------------- drivers/char/random.c 1588 static ssize_t  1589 random_read(struct file * file, char __user * buf, size_t nbytes, loff_t *ppos)  1590 {  1591   DECLARE_WAITQUEUE(wait, current); ... 1597   while (nbytes > 0) { ... 1608     n = extract_entropy(sec_random_state, buf, n,  1609          EXTRACT_ENTROPY_USER |  1610          EXTRACT_ENTROPY_LIMIT |  1611          EXTRACT_ENTROPY_SECONDARY); ... 1618     if (n == 0) {  1619       if (file->f_flags & O_NONBLOCK) {  1620         retval = -EAGAIN;  1621         break;  1622       }  1623       if (signal_pending(current)) {  1624         retval = -ERESTARTSYS;  1625         break;  1626       } ... 1632       set_current_state(TASK_INTERRUPTIBLE);  1633       add_wait_queue(&random_read_wait, &wait);  1634   1635       if (sec_random_state->entropy_count / 8 == 0)  1636         schedule();  1637   1638       set_current_state(TASK_RUNNING);  1639       remove_wait_queue(&random_read_wait, &wait); ... 1645       continue; 1646  } -----------------------------------------------------------------------

Line 1591

The wait queue wait is initialized on the current task. The macro current refers to a pointer to the current task's task_struct.

Lines 16081611

We extract a chunk of random data from the device.

Lines 16181626

If we could not extract the necessary amount of entropy from the entropy pool and we are non-blocking or there is a signal pending, we return an error to the caller.

Lines 16311633

Set up the wait queue. random_read() uses its own wait queue, random_read_wait, instead of the system wait queue.

Lines 16351636

At this point, we are on a blocking read and if we don't have 1 byte worth of entropy, we release control of the processor by calling schedule(). (The entropy_count variables hold bits and not bytes; thus, the division by 8 to determine whether we have a full byte of entropy.)

Lines 16381639

When we are eventually restarted, we clean up our wait queue.

NOTE

The random device in Linux requires the entropy queue to be full before returning. The urandom device does not have this requirement and returns regardless of the size of data available in the entropy pool.

Let's closely look at what happens when a task calls schedule():

 ----------------------------------------------------------------------- kernel/sched.c 2184 asmlinkage void __sched schedule(void) 2185 { ... 2209   prev = current; ... 2233   switch_count = &prev->nivcsw; 2234   if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) { 2235     switch_count = &prev->nvcsw; 2236     if (unlikely((prev->state & TASK_INTERRUPTIBLE) && 2237         unlikely(signal_pending(prev)))) 2238       prev->state = TASK_RUNNING; 2239     else 2240       deactivate_task(prev, rq); 2241   }  2242 ... -----------------------------------------------------------------------

Line 2209

A pointer to the current task's task structure is stored in the prev variable. In cases where the task itself called schedule(), current points to that task.

Line 2233

We store the task's context switch counter, nivcsw, in switch_count. This is incremented later if the switch is successful.^[5]

^[5] See Chapters 4 and 7 for more information on how context switch counters are used.

Line 2234

We only enter this if statement when the task's state, prev->state, is non-zero and there is not a kernel preemption. In other words, we enter this statement when a task's state is not TASK_RUNNING, and the kernel has not preempted the task.

Lines 22352241

If the task is interruptible, we're fairly certain that it wanted to release control. If a signal is pending for the task that wanted to release control, we set the task's state to TASK_RUNNING so that is has the opportunity to be chosen for execution by the scheduler when control is passed to another task. If no signal is pending, which is the common case, we deactivate the task and set switch_count to nvcsw. The scheduler increments switch_count later. Thus, nvcsw or nivcsw is incremented.

The schedule() function then picks the next task in the scheduler's run queue and switches control to that task.^[6]

^[6] For detailed information, see the "switch_to()" section in Chapter 7.

By calling schedule(), we allow a task to yield control of the processor to another kernel task when the current task knows it will be waiting for some reason. Other tasks in the kernel can make use of this time and, hopefully, when control returns to the function that called schedule(), the reason for waiting will have been removed.

Returning from our digression on the scheduler to the random_read() function, eventually, the kernel gives control back to random_read() and we clean up our wait queue and continue. This repeats the loop and, if the system has generated enough entropy, we should be able to return with the requested number of random bytes.

random_read() sets its state to TASK_INTERRUPTIBLE before calling schedule() to allow itself to be interrupted by signals while it is on a wait queue. The driver's own code generates these signals when extra entropy is collected by calling wake_up_interruptible() in batch_entropy_process() and random_ioctl(). TASK_UNINTERRUPTIBLE is usually used when the task is waiting for hardware to respond as opposed to software (when TASK_INTERRUPTIBLE is normally used).

The code that random_read() uses to pass control to another task (see lines 16321639, drivers/char/random.c) is a variant of interruptible_sleep_on() from the scheduler code.

 ----------------------------------------------------------------------- kernel/sched.c 2489 #define SLEEP_ON_VAR         \  2490   unsigned long flags;       \  2491   wait_queue_t wait;        \  2492   init_waitqueue_entry(&wait, current);  2493  2494 #define SLEEP_ON_HEAD         \  2495   spin_lock_irqsave(&q->lock,flags);    \  2496   __add_wait_queue(q, &wait);      \  2497   spin_unlock(&q->lock);  2498  2499 #define SLEEP_ON_TAIL         \  2500   spin_lock_irq(&q->lock);      \  2501   __remove_wait_queue(q, &wait);     \  2502   spin_unlock_irqrestore(&q->lock, flags); 2503  2504 void fastcall __sched interruptible_sleep_on(wait_queue_head_t *q)  2505 {  2506   SLEEP_ON_VAR  2507  2508   current->state = TASK_INTERRUPTIBLE;  2509  2510   SLEEP_ON_HEAD  2511   schedule();  2512   SLEEP_ON_TAIL 2513 } -----------------------------------------------------------------------

q is a wait_queue_head structure that coordinates the module's sleeping and waiting.

Lines 24942497

Atomically add our task to a wait queue q.

Lines 24992502

Atomically remove the task from the wait queue q.

Lines 25042513

Add to the wait queue. Cede control of the processor to another task. When we are given control, remove ourselves from the wait queue.

random_read() uses its own wait queue code instead of the standard macros, but essentially does an interruptible_sleep_on() with the exception that, if we have more than a full byte's worth of entropy, we don't yield control but loop again to try and get all the requested entropy. If there isn't enough entropy, random_read() waits until it's awoken with wake_up_interruptible() from entropy-gathering processes of the driver.

10.1.5. Work Queues and Interrupts

Device drivers in Linux routinely have to deal with interrupts generated by the devices with which they are interfacing. Interrupts trigger an interrupt handler in the device driver and cause all currently executing codeboth user space and kernel spaceto cease execution. Clearly, it is desirable to have the driver's interrupt handler execute as quickly as possible to prevent long waits in kernel processing.

However, this leads us to the standard dilemma of interrupt handling: How do we handle an interrupt that requires a significant amount of work? The standard answer is to use top-half and bottom-half routines. The top-half routine quickly handles accepting the interrupt and schedules a bottom-half routine, which has the code to do the majority of the work and is executed when possible. Normally, the top-half routine runs with interrupts disabled to ensure that an interrupt handler isn't interrupted by the same interrupt. Thus, the device driver does not have to handle recursive interrupts. The bottom-half routine normally runs with interrupts enabled so that other interrupts can be handled while it continues the bulk of the work.

In prior Linux kernels, this division of top-half and bottom-half, also known as fast and slow interrupts, was handled by task queues. New to the 2.6 Linux kernel is the concept of a work queue, which is now the standard way to deal with bottom-half interrupts.

When the kernel receives an interrupt, the processor stops executing the current task and immediately handles the interrupt. When the CPU enters this mode, it is commonly referred to as being in interrupt context. The kernel, in interrupt context, then determines which interrupt handler to pass control to. When a device driver wants to handle an interrupt, it uses request_irq() to request the interrupt number and register the handler function to be called when this interrupt is seen. This registration is normally done at module initialization time. The top-half interrupt function registered with request_irq() does minimal management and then schedules the appropriate work to be done upon a work queue.

Like request_irq() in the top half, work queues are normally registered at module initialization. They can be initialized statically with the DECLARE_WORK() macro or the work structure can be allocated and initialized dynamically by calling INIT_WORK(). Here are the definitions of those macros:

 ----------------------------------------------------------------------- include/linux/workqueue.h 30 #define DECLARE_WORK(n, f, d)         \ 31   struct work_struct n = __WORK_INITIALIZER(n, f, d) ... 45 #define INIT_WORK(_work, _func, _data)       \ 46   do {             \ 47     INIT_LIST_HEAD(&(_work)->entry);    \ 48     (_work)->pending = 0;       \ 49     PREPARE_WORK((_work), (_func), (_data));  \ 50     init_timer(&(_work)->timer);     \ 51   } while (0) -----------------------------------------------------------------------

Both macros take the following arguments:

n or work. The name of the work structure to create or initialize.
f or func. The function to run when the work structure is removed from a work queue.
d or data. Holds the data to pass to the function f, or func, when it is run.

The interrupt handler function registered in register_irq() would then accept an interrupt and send the relevant data from the top half of the interrupt handler to the bottom half by setting the work_struct data section and calling schedule_work() on the work queue.

The code present in the work queue function operates in process context and can thus perform work that is impossible to do in interrupt context, such as copying to and from user space or sleeping.

Tasklets are similar to work queues but operate entirely in interrupt context. This is useful when you have little to do in the bottom half and want to save the overhead of a top-half and bottom-half interrupt handler. Tasklets are initialized with the DECLARE_TASKLET() macro:

 ----------------------------------------------------------------------- include/linux/interrupt.h 136 #define DECLARE_TASKLET(name, func, data) \ 137 struct tasklet_struct name = { NULL, 0, ATOMIC_INIT(0), func, data } -----------------------------------------------------------------------

name. The name of the tasklet structure to create.
func. The function to call when the tasklet is scheduled.
data. Holds the data to pass to the func function when the tasklet executes.

To schedule a tasklet, use tasklet_schedule():

 ----------------------------------------------------------------------- include/linux/interrupt.h 171 extern void FASTCALL(__tasklet_schedule(struct tasklet_struct *t)); 172  173 static inline void tasklet_schedule(struct tasklet_struct *t) 174 { 175   if (!test_and_set_bit(TASKLET_STATE_SCHED, &t->state)) 176     __tasklet_schedule(t); 177 } -----------------------------------------------------------------------

tasklet_struct. The name of the tasklet created with DECLARE_TASKLET().

In the top-half interrupt handler, you can call tasklet_schedule() and be guaranteed that, sometime in the future, the function declared in the tasklet is executed. Tasklets differ from work queues in that different tasklets can run simultaneously on different CPUs. If a tasklet is already scheduled, and scheduled again before the tasklet executes, it is only executed once. As tasklets run in interrupt context, they cannot sleep or copy data to user space. Because of running in interrupt context, if different tasklets need to communicate, the only safe way to synchronize is by using spinlocks.

10.1.6. System Calls

There are other ways to add code to the kernel besides device drivers. Linux kernel system calls (syscalls) are the method by which user space programs can access kernel services and system hardware. Many of the C library routines available to user mode programs bundle code and one or more system calls to accomplish a single function. In fact, syscalls can also be accessed from kernel code.

By its nature, syscall implementation is hardware specific. In the Intel architecture, all syscalls use software interrupt 0x80. Parameters of the syscall are passed in the general registers. The implementation of syscall on the x86 architecture limits the number of parameters to 5. If more than 5 are required, a pointer to a block of parameters can be passed. Upon execution of the assembler instruction int 0x80, a specific kernel mode routine is called by way of the exception-handling capabilities of the processor.

10.1.7. Other Types of Drivers

Until now, all the device drivers we dealt with have been character drivers. These are usually the easiest to understand, but you might want to write other drivers that interface with the kernel in different ways.

Block devices are similar to character devices in that they can be accessed via the filesystem. /dev/hda is the device file for the primary IDE hard drive on the system. Block devices are registered and unregistered in similar ways to character devices by using the functions register_blkdev() and unregister_blkdev().

A major difference between block drivers and character drivers is that block drivers do not provide their own read and write functionality; instead, they use a request method.

The 2.6 kernel has undergone major changes in the block device subsystem. Old functions, such as block_read() and block_write() and kernel structures like blk_size and blksize_size, have been removed. This section focuses solely on the 2.6 block device implementation.

If you need the Linux kernel to work with a disk (or a disk-like) device, you need to write a block device driver. The driver must inform the kernel what kind of disk it's interfacing with. It does this by using the gendisk structure:

 ----------------------------------------------------------------------- include/linux/genhd.h 82 struct gendisk { 83   int major;      /* major number of driver */ 84   int first_minor; 85   int minors; 86   char disk_name[32];    /* name of major driver */ 87   struct hd_struct **part;  /* [indexed by minor] */ 88   struct block_device_operations *fops; 89   struct request_queue *queue; 90   void *private_data; 91   sector_t capacity; ... -----------------------------------------------------------------------

Line 83

major is the major number for the block device. This can be either statically set or dynamically generated by using register_blkdev(), as it was in character devices.

Lines 8485

first_minor and minors are used to determine the number of partitions within the block device. minors contains the maximum number of minor numbers the device can have. first_minor contains the first minor device number of the block device.

Line 86

disk_name is a 32-character name for the block device. It appears in the /dev filesystem, sysfs and /proc/partitions.

Line 87

hd_struct is the set of partitions that is associated with the block device.

Line 88

fops is a pointer to a block_operations structure that contains the operations open, release, ioctl, media_changed, and revalidate_disk. (See include/ linux/fs.h.) In the 2.6 kernel, each device has its own set of operations.

Line 89

request_queue is a pointer to a queue that helps manage the device's pending operations.

Line 90

private_data points to information that will not be accessed by the kernel's block subsystem. Typically, this is used to store data that is used in low-level, device-specific operations.

Line 91

capacity is the size of the block device in 512-byte sectors. If the device is removable, such as a floppy disk or CD, a capacity of 0 signifies that no disk is present. If your device doesn't use 512-byte sectors, you need to set this value as if it did. For example, if your device has 1,000 256-byte sectors, that's equivalent to 500 512-byte sectors.

In addition to having a gendisk structure, a block device also needs a spinlock structure for use with its request queue.

Both the spinlock and fields in the gendisk structure must be initialized by the device driver. (Go to http://en.wikipedia.org/wiki/Ram_disk for a demonstration of initializing a RAM disk block device driver.) After the device is initialized and ready to handle requests, the add_disk() function should be called to add the block device to the system.

Finally, if the block device can be used as a source of entropy for the system, the module initialization can also call add_disk_randomness(). (For more information, see drivers/char/random.c.)

Now that we covered the basics of block device initialization, we can examine its complement, exiting and cleaning up the block device driver. This is easy in the 2.6 version of Linux.

del_gendisk( struct gendisk ) removes the gendisk from the system and cleans up its partition information. This call should be followed by putdisk (struct gendisk), which releases kernel references to the gendisk. The block device is unregistered via a call to unregister_blkdev(int major, char[16] device_name), which then allows us to free the gendisk structure.

We also need to clean up the request queue associated with the block device driver. This is done by using blk_cleanup_queue( struct *request_queue). Note: If you can only reference the request queue via the gendisk structure, be sure to call blk_cleanup_queue before freeing gendisk.

In the block device initialization and shutdown overview, we could easily avoid talking about the specifics of request queues. But now that the driver is set up, it has to actually do something, and request queues are how a block device accomplishes its major functions of reading and writing.

 ----------------------------------------------------------------------- include/linux/blkdev.h 576 extern request_queue_t *blk_init_queue(request_fn_proc *, spinlock_t *); ... -----------------------------------------------------------------------

Line 576

To create a request queue, we use blk_init_queue and pass it a pointer to a spinlock to control queue access and a pointer to a request function that is called whenever the device is accessed. The request function should have the following prototype:

 static void my_request_function( request_queue_t *q );

The guts of the request function usually use a number of helper functions with ease. To determine the next request to be processed, the elv_next_request() function is called and it returns a pointer to a request structure, or it returns null if there is no next request.

In the 2.6 kernel, the block device driver iterates through BIO structures in the request structure. BIO stands for Block I/O and is fully defined in include/linux/bio.h.

The BIO structure contains a pointer to a list of biovec structures, which are defined as follows:

 ----------------------------------------------------------------------- include/linux/bio.h 47 struct bio_vec { 48   struct page  *bv_page; 49   unsigned int bv_len; 50   unsigned int bv_offset; 51 }; -----------------------------------------------------------------------

Each biovec uses its page structure to hold data buffers that are eventually written to or read from disk. The 2.6 kernel has numerous bio helpers to iterate over the data contained within bio structures.

To determine the size of BIO operation, you can either consult the bio_size field within the BIO struct to get a result in bytes or use the bio_sectors() macro to get the size in sectors. The block operation type, READ or WRITE, can be determined by using bio_data_dir().

To iterate over the biovec list in a BIO structure, use the bio_for_each_segment() macro. Within that loop, even more macros can be used to further delve into biovec bio_page(), bio_offset(), bio_curr_sectors(), and bio_data(). More information can be found in include/linux.bio.h and Documentation/block/biodoc.txt.

Some combination of the information contained in the biovec and the page structures allow you to determine what data to read or write to the block device. The low-level details of how to read and write the device are tied to the hardware the block device driver is using.

Now that we know how to iterate over a BIO structure, we just have to figure out how to iterate over a request structure's list of BIO structures. This is done using another macro: rq_for_each_bio:

 ----------------------------------------------------------------------- include/linux/blkdev.h 495 #define rq_for_each_bio(_bio, rq)  \ 496   if ((rq->bio))     \ 497     for (_bio = (rq)->bio; _bio; _bio = bio->bi_next) -----------------------------------------------------------------------

Line 495

bio is the current BIO structure and rq is the request to iterate over.

After each BIO is processed, the driver should update the kernel on its progress. This is done by using end_that_request_first().

 ----------------------------------------------------------------------- include/linux/blkdev.h 557 extern int end_that_request_first(struct request *, int, int);  -----------------------------------------------------------------------

Line 557

The first int argument should be non-zero unless an error has occurred, and the second int argument represents the number of sectors that the device processed.

When end_that_request_first() returns 0, the entire request has been processed and the cleanup needs to begin. This is done by calling blkdev_dequeue_request() and end_that_request_last() in that orderboth of which take the request as the sole argument.

After this, the request function has done its job and the block subsystem uses the block device driver's request queue function to perform disk operations. The device might also need to handle certain ioctl functions, as our RAM disk handles partitioning, but those, again, depend on the type of block device.

This section has only touched on the basics of block devices. There are Linux hooks for DMA operations, clustering, request queue command preparation, and many other features of more advanced block devices. For further reading, refer to the Documentation/block directory.

10.1.8. Device Model and sysfs

New in the 2.6 kernel is the Linux device model, to which sysfs is intimately related. The device model stores a set of internal data related to the devices and drivers on a system. The system tracks what devices exist and breaks them down into classes: block, input, bus, etc. The system also keeps track of what drivers exist and how they relate to the devices they manage. The device model exists within the kernel, and sysfs is a window into this model. Because some devices and drivers do not expose themselves through sysfs, a good way of thinking of sysfs is the public view of the kernel's device model.

Certain devices have multiple entries within sysfs.

Only one copy of the data is stored within the device model, but there are various ways of accessing that piece of data, as the symbolic links in the sysfs TRee shows.

The sysfs hierarchy relates to the kernel's kobject and kset structures. This model is fairly complex, but most driver writers don't have to delve too far into the details to accomplish many useful tasks.^[7] By using the sysfs concept of attributes, you work with kobjects, but in an abstracted way. Attributes are parts of the device or driver model that can be accessed or changed via the sysfs filesystem. They could be internal module variables controlling how the module manages tasks or they could be directly linked to various hardware settings. For example, an RF transmitter could have a base frequency it operates upon and individual tuners implemented as offsets from this base frequency. Changing the base frequency can be accomplished by exposing a module attribute of the RF driver to sysfs.

^[7] Reference documentation/filesystems/sysfs.txt in the kernel source.

When an attribute is accessed, sysfs calls a function to handle that access, show() for read and store() for write. There is a one-page limit on the size of data that can be passed to show() or store() functions.

With this outline of how sysfs works, we can now get into the specifics of how a driver registers with sysfs, exposes some attributes, and registers specific show() and store() functions to operate when those attributes are accessed.

The first task is to determine what device class your new device and driver should fall under (for example, usb_device, net_device, pci_device, sys_device, and so on). All these structures have a char *name field within them. sysfs uses this name field to display the new device within the sysfs hierarchy.

After a device structure is allocated and named, you must create and initialize a devicer_driver structure:

 ----------------------------------------------------------------------- include/linux/device.h 102 struct device_driver {  103   char     * name;  104   struct bus_type   * bus;  105  106   struct semaphore  unload_sem;  107   struct kobject   kobj;  108   struct list_head  devices;  109  110   int  (*probe)  (struct device * dev);  111   int  (*remove)  (struct device * dev);  112   void (*shutdown)  (struct device * dev);  113   int  (*suspend)  (struct device * dev, u32 state, u32 level);  114   int  (*resume)  (struct device * dev, u32 level); 115}; -----------------------------------------------------------------------

Line 103

name refers to the name of the driver that is displayed in the sysfs hierarchy.

Line 104

bus is usually filled in automatically; a driver writer need not worry about it.

Lines 105115

The programmer does not need to set the rest of the fields. They should be automatically initialized at the bus level.

We can register our driver during initialization by calling driver_register(), which passes most of the work to bus_add_driver(). Similarly upon driver exit, be sure to add a call to driver_unregister().

 ----------------------------------------------------------------------- drivers/base/driver.c 86 int driver_register(struct device_driver * drv)   87 {   88   INIT_LIST_HEAD(&drv->devices);   89   init_MUTEX_LOCKED(&drv->unload_sem);   90   return bus_add_driver(drv); 91 } -----------------------------------------------------------------------

After driver registration, driver attributes can be created via driver_attribute structures and a helpful macro, DRIVER_ATTR:

 ----------------------------------------------------------------------- include/linux/device.h 133 #define DRIVER_ATTR(_name,_mode,_show,_store) \ 134 struct driver_attribute driver_attr_##_name = {     \ 135   .attr = {.name = __stringify(_name), .mode = _mode, .owner = THIS_MODULE },  \ 136   .show = _show,        \ 137   .store = _store,        \ 138 }; -----------------------------------------------------------------------

Line 135

name is the name of the attribute for the driver. mode is the bitmap describing the level of protection of the attribute. include/linux/stat.h contains many of these modes, but S_IRUGO (for read-only) and S_IWUSR (for root write access) are two examples.

Line 136

show is the name of the driver function to use when the attribute is read via sysfs. If reads are not allowed, NULL should be used.

Line 137

store is the name of the driver function to use when the attribute is written via sysfs. If writes are not allowed, NULL should be used.

The driver functions that implement show() and store() for a specific driver must adhere to the prototypes shown here:

 ----------------------------------------------------------------------- include/linux/sysfs.h 34 struct sysfs_ops { 35   ssize_t (*show)(struct kobject *, struct attribute *,char *); 36   ssize_t (*store)(struct kobject *,struct attribute *,const char *, size_t); 37 }; -----------------------------------------------------------------------

Recall that the size of data read and written to sysfs attributes is limited to PAGE_SIZE bytes. The show() and store() driver attribute functions should ensure that this limit is enforced.

This information should allow you to add basic sysfs functionality to kernel device drivers. For further sysfs and kobject reading, see the Documentation/ device-model directory.

Another type of device driver is a network device driver. Network devices send and receive packets of data and might not necessarily be a hardware devicethe loopback device is a software-network device.