The Callout Table

   

The kernel sometimes needs to be able to schedule events to happen in the future. For example, a driver might want a timeout on an I/O operation. The driver starts an I/O and is interrupted by the I/O system when the I/O completes, but if for some reason the I/O doesn't complete, the driver needs to know about that too. The driver sets a callout for some number of ticks in the future, specifying a routine to be called at that time. If the I/O completes successfully, then the callout is cancelled. If the callout routine gets called, then the driver knows something went wrong.

Interface

The basic function for setting a timeout is called timeout(). It takes three arguments: a pointer to a function to call, a void* argument to pass to the function, and an integer representing the number of ticks in the future at which the function should be called. There are also a few other variations on this routine, including timeout_on_spu(), which allows the caller to specify on which system processor unit (SPU) the timeout function should run.

There are also functions for canceling a timeout or for finding timeouts. The untimeout() function is passed a function pointer and an argument, just like the first two arguments to timeout(). If a timeout is registered for that function and that argument, then the timeout is cancelled and the return value will be the number of ticks remaining until the timeout expires. The find_timeout() function does the same, but removal of the timeout is optional. In fact, untimeout just calls find_timeout with the remove value set to one.

Data Structures

The primary data structure used for callouts is the struct callout, or callout_t. The structure is used for two different things: either a "real callout" that is, to represent a callout event or a "callout header." At boot time a fixed number of callout structures are allocated. The number is determined by the ncallout system tunable. These callout structures are divided between the SPUs so that each SPU has its own pool of callout structures. The reason we divide them into per-SPU pools is to reduce lock contention. If we had a single pool, then we'd have to have a single lock controlling access to the pool. Because a callout scheduled on one SPU is triggered by that SPU, it makes sense for each SPU to have its own pool of callouts. An SPU can access its pool without blocking operations on another SPU. In the mpinfo structure for the SPU there is a field called callout_info, which points to a struct callout_info_spu (typedef callout_info_spu_t). This structure maintains information about the callouts that pertain to that SPU. This structure has (among other things) a pointer to the first callout_t in its pool (ci_callout), one past the last callout_t in the pool (ci_calloutEND), and a pointer to a linked list of free callouts (ci_freelist).

When a new callout is scheduled, the system gets an available callout structure from the ci_freelist, fills in the appropriate data, and then attaches it to a list of callouts that are scheduled. If an SPU runs out of free callouts, it can go to the pools belonging to the other SPUs and "borrow" callout structures from those free lists.

The callout_info structure also maintains three sets of linked lists of active events based on when they are scheduled to occur. These are the near-term list pointed to by ci_time_nr, the midterm list pointed to by ci_time_md, and the far-future list pointed to by ci_time_ff. The division of events into groups like this avoids the overhead of having to search the entire list of callouts at every interrupt. We only have to search the near-term list to see if anything has expired. When the near-term list is empty, we migrate events from the other lists.

To enable us to find a particular callout, there is also a hashtable and hash chains. A hashing function determines which of several hash chains to search for a callout, then that chain is searched to see if the callout is on the list. Figure 13-1 shows the relationship of the callout information in the mpinfo structure and the various lists of callouts.

Figure 13-1. Per-Processor Structures for Callouts

graphics/13fig01.jpg


Callout Processing and Expiration

In the previous section we mentioned the three sets of callouts the near term, the midterm, and the far future. While the far future is a single linked list, the midterm and near term are both collections of lists. The code is designed such that the near and the midterm can have a different number of lists, but currently both have 128 lists. These sets of lists are arranged as a circular array, as shown in Figure 13-2.

Figure 13-2. Callout Table Headers

graphics/13fig02.jpg


You can think of these as analogous to a clock. On a clock each time the minute hand makes a complete revolution, the hour hand moves forward one tick. Similarly, each list in the midterm wheel represents the same amount of time as the entire near-term wheel. On the near-term wheel all callouts that expire in a particular tick are on the same list. With 128 lists in the wheel, that means anything expiring in the next 128 ticks is near term. The ci_time_nr_next pointer keeps track of which list is the next one due to expire. The ci_time_nr_futurist pointer is the end of the queue the callouts that expire the latest of those considered near term. Each time we get a clock tick interrupt, we check the expiry time for those callouts pointed to by ci_time_nr_next. If the callout is due (or past due), we call the callout functions and then move ci_time_nr_next to the next list.

In the midterm wheel, each list represents 128 ticks. When the near-term wheel makes one full revolution that is, when ci_time_nr_next catches up to ci_time_nr_futurist all of the callouts pointed to by ci_time_md_next get moved onto the near-term wheel. We've essentially moved the hand on the midterm wheel up a tick. When the midterm wheel has made a full revolution we go to the far-future list.

The far-future list is just a single list, kept in sorted order. As the midterm wheel needs replenishing, callouts are moved from the far-future list.

All of this magic happens every clock interrupt. The clock interrupt calls per_spu_hardclock(), which in turn calls expire_callouts_spu(). Expire_callouts_spu() calls redistribute_buckets() to handle the moving of callouts from one wheel to the next.

While all this sounds like a lot of work, it's actually much less overhead than the alternative. It used to be that all callouts were kept in one big linked list, sorted by expiry time. With that method, you only had to check the head of the list to see if the next callout was due to expire. The problem was the process of keeping the list sorted. With the possibility of thousands of callouts on the list, it becomes very expensive in terms of processing power to walk down the list and find the right place to insert a new callout.



HP-UX 11i Internals
HP-UX 11i Internals
ISBN: 0130328618
EAN: 2147483647
Year: 2006
Pages: 167

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net