Section 4.5. Slab Allocator s Lifecycle | The Linux Kernel Primer. A Top-Down Approach for x86 and PowerPC Architectures

4.5. Slab Allocator's Lifecycle

Now, we explore the interaction of caches and the slab allocator throughout the lifecycle of the kernel. The kernel needs to make sure that certain structures are in place to support memory area requests on the part of processes and the creation of specialized caches on the part of dynamically loadable modules.

A few global structures play key roles for the slab allocator. Some of these were in passing previously in the chapter. Let's look at these global variables.

4.5.1. Global Variables of the Slab Allocator

There are a number of global variables that are associated with the slab allocator. These include

cache_cache. The cache descriptor for the cache that is to contain all other cache descriptors. The human-readable name of this cache is kmem_cache. This cache descriptor is the only one that is statically allocated.
cache_chain. The list element that serves as a pointer to the cache descriptor list.
cache_chain_sem. The semaphore that controls access to cache_chain.^[9] Every time an element (new cache descriptor) is added to the chain, this semaphore needs to be acquired with a down() and released with an up().
^[9] Semaphores are discussed in detail in Chapter 9, "Building the Linux Kernel."
malloc_sizes[]. The array that holds the cache descriptors for the DMA and non-DMA caches that correspond to a general cache.

Before the slab allocator is initialized, these structures are already in place. Let's look at their creation:

 ----------------------------------------------------------------------------- mm/slab.c 486  static kmem_cache_t cache_cache = { 487   .lists   = LIST3_INIT(cache_cache.lists), 488   .batchcount  = 1, 489   .limit   = BOOT_CPUCACHE_ENTRIES, 490   .objsize  = sizeof(kmem_cache_t), 491   .flags   = SLAB_NO_REAP, 492   .spinlock  = SPIN_LOCK_UNLOCKED, 493   .color_off  = L1_CACHE_BYTES, 494   .name   = "kmem_cache", 495  }; 496 497  /* Guard access to the cache-chain. */ 498  static struct semaphore  cache_chain_sem; 499 500  struct list_head cache_chain; -----------------------------------------------------------------------------

The cache_cache cache descriptor has the SLAB_NO_REAP flag. Even if memory is low, this cache is retained throughout the life of the kernel. Note that the cache_chain semaphore is only defined, not initialized. The initialization occurs during system initialization in the call to kmem_cache_init(). We explore this function in detail here:

 ----------------------------------------------------------------------------- mm/slab.c 462  struct cache_sizes malloc_sizes[] = { 463  #define CACHE(x) { .cs_size = (x) }, 464  #include <linux/kmalloc_sizes.h> 465   { 0, } 466  #undef CACHE 467  }; -----------------------------------------------------------------------------

This piece of code initializes the malloc_sizes[] array and sets the cs_size field according to the values defined in include/linux/kmalloc_sizes.h. As mentioned, the cache sizes can span from 32 bytes to 131,072 bytes depending on the specific kernel configurations.^[10]

^[10] There are a few additional configuration options that result in more general caches of sizes larger than 131,072. For more information, see include/linux/kmalloc_sizes.h.

With these global variables in place, the kernel proceeds to initialize the slab allocator by calling kmem_cache_init() from init/main.c.^[11] This function takes care of initializing the cache chain, its semaphore, the general caches, the kmem_cache cachein essence, all the global variables that are used by the slab allocator for slab management. At this point, specialized caches can be created. The function used to create caches is kmem_cache_create().

^[11] Chapter 9 covers the initialization process linearly from power on. We see how kmem_cache_init() fits into the bootstrapping process.

4.5.2. Creating a Cache

The creation of a cache involves three steps:

1.	Allocation and initialization of the descriptor
2.	Calculation of the slab coloring and object size
3.	Addition of the cache to `cache_chain` list

General caches are set up during system initalization by kmem_cache_init() (mm/slab.c). Specialized caches are created by way of a call to kmem_cache_create().

We now look at each of these functions.

4.5.2.1. kmem_cache_init()

This is where the cache_chain and general caches are created. This function is called during the initialization process. Notice that the function has __init preceding the function name. As discussed in Chapter 2, "Exploration Toolkit," this indicates that the function is loaded into memory that gets wiped after the bootstrap and initialization process is over.

 ----------------------------------------------------------------------------- mm/slab.c 659  void __init kmem_cache_init(void) 660  { 661   size_t left_over; 662   struct cache_sizes *sizes; 663   struct cache_names *names; ... 669   if (num_physpages > (32 << 20) >> PAGE_SHIFT) 670    slab_break_gfp_order = BREAK_GFP_ORDER_HI; 671 672 -----------------------------------------------------------------------

Lines 661663

The variable sizes and names are the head arrays for the kmalloc allocated arrays (the general caches with geometrically distributes sizes). At this point, these arrays are located in the __init data area. Be aware that kmalloc() does not exist at this point. kmalloc() uses the malloc_sizes array and that is precisely what we are setting up now. At this point, all we have is the statically allocated cache_cache descriptor.

Lines 669670

This code block determines how many pages a slab can use. The number of pages a slab can use is entirely determined by how much memory is available. In both x86 and PPC, the variable PAGE_SHIFT (include/asm/page.h) evaluates to 12. So, we are verifying if num_physpages holds a value greater than 8k. This would be the case if we have a machine with more than 32MB of memory. If this is the case, we fit BREAK_GFP_ORDER_HI pages per slab. Otherwise, one page is allocated per slab.

 ----------------------------------------------------------------------------- mm/slab.c 690   init_MUTEX(&cache_chain_sem); 691   INIT_LIST_HEAD(&cache_chain); 692   list_add(&cache_cache.next, &cache_chain); 693   cache_cache.array[smp_processor_id()] = &initarray_cache.cache; 694 695   cache_estimate(0, cache_cache.objsize, 0, 696     &left_over, &cache_cache.num); 697   if (!cache_cache.num) 698    BUG(); 699 ... -----------------------------------------------------------------------------

Line 690

This line initializes the cache_chain semaphore cache_chain_sem.

Line 691

Initialize the cache_chain list where all the cache descriptors are stored.

Line 692

Add the cache_cache descriptor to the cache_chain list.

Line 693

Create the per CPU caches. The details of this are beyond the scope of this book.

Lines 695698

This block is a sanity check verifying that at least one cache descriptor can be allocated in cache_cache. Also, it sets the cache_cache descriptor's num field and calculates how much space will be left over. This is used for slab coloring Slab coloring is a method by which the kernel reduces cache alignmentrelated performance hits.

 ----------------------------------------------------------------------------- mm/slab.c 705   sizes = malloc_sizes; 706   names = cache_names; 707 708   while (sizes->cs_size) { ... 714    sizes->cs_cachep = kmem_cache_create( 715     names->name, sizes->cs_size, 716     0, SLAB_HWCACHE_ALIGN, NULL, NULL); 717    if (!sizes->cs_cachep) 718     BUG(); 719 ... 725 726    sizes->cs_dmacachep = kmem_cache_create( 727     names->name_dma, sizes->cs_size, 728     0, SLAB_CACHE_DMA|SLAB_HWCACHE_ALIGN, NULL, NULL); 729    if (!sizes->cs_dmacachep) 730     BUG(); 731 732    sizes++; 733    names++; 734   } -----------------------------------------------------------------------------

Line 708

This line verifies if we have reached the end of the sizes array. The sizes array's last element is always set to 0. Hence, this case is true until we hit the last cell of the array.

Lines 714718

Create the next kmalloc cache for normal allocation and verify that it is not empty. See the section, "kmem_cache_create()."

Lines 726730

This block creates the caches for DMA allocation.

Lines 732733

Go to the next element in the sizes and names arrays.

The remainder of the kmem_cache_init() function handles the replacement of the temporary bootstrapping data for kmalloc allocated data. We leave out the explanation of this because it is not directly pertinent to the actual initialization of the cache descriptors.

4.5.2.2. kmem_cache_create()

Times arise when the memory regions provided by the general caches are not sufficient. This function is called when a specialized cache needs to be created. The steps required to create a specialized cache are not unlike those required to create a general cache: create, allocate, and initialize the cache descriptor, align objects, align slab descriptors, and add the cache to the cache chain. This function does not have __init in front of the function name because persistent memory is available when it is called:

 ----------------------------------------------------------------------------- mm/slab.c 1027  kmem_cache_t * 1028  kmem_cache_create (const char *name, size_t size, size_t offset, 1029   unsigned long flags, void (*ctor)(void*, kmem_cache_t *, unsigned long), 1030   void (*dtor)(void*, kmem_cache_t *, unsigned long)) 1031  { 1032   const char *func_nm = KERN_ERR "kmem_create: "; 1033   size_t left_over, align, slab_size; 1034   kmem_cache_t *cachep = NULL; ... -----------------------------------------------------------------------------

Let's look at the function parameters of kmem_cache_create.

name

This is the name used to identify the cache. This gets stored in the name field of the cache descriptor and displayed in /proc/slabinfo.

size

This parameter specifies the size (in bytes) of the objects that are contained in this cache. This value is stored in the objsize field of the cache descriptor.

offset

This value determines where the objects are placed within a page.

flags

The flags parameter is related to the slab. Refer to Table 4.4 for a description of the cache descriptor flags field and possible values.

ctor and dtor

ctor and dtor are respectively the constructor and destructor that are called upon creation or destruction of objects in this memory region.

This function performs sizable debugging and sanity checks that we do not cover here. See the code for more details:

 ----------------------------------------------------------------------------- mm/slab.c 1079   /* Get cache's description obj. */ 1080   cachep = (kmem_cache_t *) kmem_cache_alloc(&cache_cache, SLAB_KERNEL); 1081   if (!cachep) 1082    goto opps; 1083   memset(cachep, 0, sizeof(kmem_cache_t)); 1084 ... 1144   do { 1145    unsigned int break_flag = 0; 1146   cal_wastage: 1147    cache_estimate(cachep->gfporder, size, flags, 1148         &left_over, &cachep->num); ... 1174   } while (1); 1175 1176   if (!cachep->num) { 1177    printk("kmem_cache_create: couldn't create cache %s.\n", name); 1178    kmem_cache_free(&cache_cache, cachep); 1179    cachep = NULL; 1180    goto opps; 1181  } -----------------------------------------------------------------------------

Lines 10791084

This is where the cache descriptor is allocated. Following this is the portion of the code that is involved with the alignment of objects in the slab. We leave this portion out of this discussion.

Lines 11441174

This is where the number of objects in cache is determined. The bulk of the work is done by cache_estimate(). Recall that the value is to be stored in the num field of the cache descriptor.

 ----------------------------------------------------------------------------- mm/slab.c ... 1201   cachep->flags = flags; 1202   cachep->gfpflags = 0; 1203   if (flags & SLAB_CACHE_DMA) 1204    cachep->gfpflags |= GFP_DMA; 1205   spin_lock_init(&cachep->spinlock); 1206   cachep->objsize = size; 1207   /* NUMA */ 1208   INIT_LIST_HEAD(&cachep->lists.slabs_full); 1209   INIT_LIST_HEAD(&cachep->lists.slabs_partial); 1210   INIT_LIST_HEAD(&cachep->lists.slabs_free); 1211 1212   if (flags & CFLGS_OFF_SLAB) 1213    cachep->slabp_cache = kmem_find_general_cachep(slab_size,0); 1214   cachep->ctor = ctor; 1215   cachep->dtor = dtor; 1216   cachep->name = name; 1217 ... 1242 1243   cachep->lists.next_reap = jiffies + REAPTIMEOUT_LIST3 + 1244     ((unsigned long)cachep)%REAPTIMEOUT_LIST3; 1245 1246   /* Need the semaphore to access the chain. */ 1247   down(&cache_chain_sem); 1248   { 1249    struct list_head *p; 1250    mm_segment_t old_fs; 1251 1252    old_fs = get_fs(); 1253    set_fs(KERNEL_DS); 1254    list_for_each(p, &cache_chain) { 1255     kmem_cache_t *pc = list_entry(p, kmem_cache_t, next); 1256     char tmp; ... 1265     if (!strcmp(pc->name,name)) { 1266       printk("kmem_cache_create: duplicate cache %s\n",name); 1267       up(&cache_chain_sem); 1268       BUG(); 1269     } 1270    } 1271    set_fs(old_fs); 1272   } 1273 1274   /* cache setup completed, link it into the list */ 1275   list_add(&cachep->next, &cache_chain); 1276   up(&cache_chain_sem); 1277  opps: 1278   return cachep; 1279  } -----------------------------------------------------------------------------

Just prior to this, the slab is aligned to the hardware cache and colored. The fields color and color_off of the slab descriptor are filled out.

Lines 12001217

This code block initializes the cache descriptor fields much like we saw in kmem_cache_init().

Lines 12431244

The time for the next cache reap is set.

Lines 12471276

The cache descriptor is initialized and all the information regarding the cache has been calculated and stored. Now, we can add the new cache descriptor to the cache_chain list.

4.5.3. Slab Creation and cache_grow()

When a cache is created, it starts empty of slabs. In fact, slabs are not allocated until a request for an object demonstrates a need for a new slab. This happens when the cache descriptor's lists.slabs_partial and lists.slabs_free fields are empty. At this point, we won't relate how the request for memory translates into the request for an object within a particular cache. For now, we take for granted that this translation has occurred and concentrate on the technical implementation within the slab allocator.

A slab is created within a cache by cache_grow(). When we create a slab, we not only allocate and initialize its descriptor; we also allocate the actual memory. To this end, we need to interface with the buddy system to request the pages. This is done by kmem_getpages() (mm/slab.c).

4.5.3.1. cache_grow()

The cache_grow() function grows the number of slabs within a cache by 1. It is called only when no free objects are available in the cache. This occurs when lists.slabs_partial and lists.slabs_free are empty:

 ----------------------------------------------------------------------------- mm/slab.c 1546  static int cache_grow (kmem_cache_t * cachep, int flags) 1547  { ... -----------------------------------------------------------------------------

The parameters passed to the function are

cachep. This is the cache descriptor of the cache to be grown.
flags. These flags will be involved in the creation of the slab.

 ----------------------------------------------------------------------------- mm/slab.c 1572  check_irq_off(); 1573  spin_lock(&cachep->spinlock); ... 1581 1582   spin_unlock(&cachep->spinlock); 1583 1584   if (local_flags & __GFP_WAIT) 1585    local_irq_enable(); -----------------------------------------------------------------------------

Lines 15721573

Prepare for manipulating the cache descriptor's fields by disabling interrupts and locking the descriptor.

Lines 15821585

Unlock the cache descriptor and reenable the interrupts.

 ----------------------------------------------------------------------------- mm/slab.c ... 1597   if (!(objp = kmem_getpages(cachep, flags))) 1598    goto failed; 1599 1600   /* Get slab management. */ 1601   if (!(slabp = alloc_slabmgmt(cachep, objp, offset, local_flags))) 1602    goto opps1; ... 1605   i = 1 << cachep->gfporder; 1606   page = virt_to_page(objp); 1607   do { 1608    SET_PAGE_CACHE(page, cachep); 1609    SET_PAGE_SLAB(page, slabp); 1610    SetPageSlab(page); 1611    inc_page_state(nr_slab); 1612    page++; 1613   } while (--i) ; 1614 1615   cache_init_objs(cachep, slabp, ctor_flags); -----------------------------------------------------------------------------

Lines 15971598

Interface with the buddy system to acquire page(s) for the slab.

Lines 16011602

Place the slab descriptor where it needs to go. Recall that slab descriptors can be stored within the slab itself or within the first general purpose cache.

Lines 16051613

The pages need to be associated with the cache and slab descriptors.

Line 1615

Initialize all the objects in the slab.

 ----------------------------------------------------------------------------- mm/slab.c 1616   if (local_flags & __GFP_WAIT) 1617    local_irq_disable(); 1618   check_irq_off(); 1619   spin_lock(&cachep->spinlock); 1620 1621   /* Make slab active. */ 1622   list_add_tail(&slabp->list, &(list3_data(cachep)->slabs_free)); 1623   STATS_INC_GROWN(cachep); 1624   list3_data(cachep)->free_objects += cachep->num; 1625   spin_unlock(&cachep->spinlock); 1626   return 1; 1627  opps1: 1628   kmem_freepages(cachep, objp); 1629  failed: 1630   if (local_flags & __GFP_WAIT) 1631    local_irq_disable(); 1632   return 0; 1633  } -----------------------------------------------------------------------------

Lines 16161619

Because we are about to access and change descriptor fields, we need to disable interrupts and lock the data.

Lines 16221624

Add the new slab descriptor to the lists.slabs_free field of the cache descriptor. Update the statistics that keep track of these sizes.

Lines 16251626

Unlock the spinlock and return because all succeeded.

Lines 16271628

This gets called if something goes wrong with the page request. Basically, we are freeing the pages.

Lines 16291632

Disable the interrupt disable, which now lets interrupts come through.

4.5.4. Slab Destruction: Returning Memory and kmem_cache_destroy()

Both caches and slabs can be destroyed. Caches can be shrunk or destroyed to return memory to the free memory pool. The kernel calls these functions when memory is low. In either case, slabs are being destroyed and the pages corresponding to them are being returned for the buddy system to recycle. kmem_cache_destroy() gets rid of a cache. We explore this function in depth. Caches can be reaped and shrunk by kmem_cache_reap() (mm/slab.c) and kmem_cache_shrink(), respectively (mm/slab.c). The function to interface with the buddy system is kmem_freepages() (mm/slab.c).

4.5.4.1. kmem_cache_destroy()

There are a few instances when a cache would need to be removed. Dynamically loadable modules (assuming no persistent memory across loading and unloading) that create caches must destroy them upon unloading to free up the memory and to ensure that the cache won't be duplicated the next time the module is loaded. Thus, the specialized caches are generally destroyed in this manner.

The steps to destroy a cache are the reverse of the steps to create one. Alignment issues are not a concern upon destruction of a cache, only the deletion of descriptors and freeing of memory. The steps to destroy a cache can be summarized as

1.	Remove the cache from the cache chain.
2.	Delete the slab descriptors.
3.	Delete the cache descriptor.

 ----------------------------------------------------------------------------- mm/slab.c 1421  int kmem_cache_destroy (kmem_cache_t * cachep) 1422  { 1423   int i; 1424 1425   if (!cachep || in_interrupt()) 1426    BUG(); 1427 1428   /* Find the cache in the chain of caches. */ 1429   down(&cache_chain_sem); 1430   /* 1431   * the chain is never empty, cache_cache is never destroyed 1432   */ 1433   list_del(&cachep->next); 1434   up(&cache_chain_sem); 1435 1436   if (__cache_shrink(cachep)) { 1437    slab_error(cachep, "Can't free all objects"); 1438    down(&cache_chain_sem); 1439    list_add(&cachep->next,&cache_chain); 1440    up(&cache_chain_sem); 1441    return 1; 1442   } 1443 ... 1450   kmem_cache_free(&cache_cache, cachep); 1451 1452   return 0; 1453  } -----------------------------------------------------------------------------

The function parameter cache is a pointer to the cache descriptor of the cache that is to be destroyed.

Lines 14251426

This sanity check consists of ensuring that an interrupt is not in play and that the cache descriptor is not NULL.

Lines 14291434

Acquire the cache_chain semaphore, delete the cache from the cache chain, and release the cache chain semaphore.

Lines 14361442

This is where the bulk of the work related to freeing the unused slabs takes place. If the __cache_shrink() function returns true, that indicates that there are still slabs in the cache and, therefore, it cannot be destroyed. Thus, we reverse the previous step and reenter the cache descriptor into the cache_chain, again by first reacquiring the cache_chain semaphore, and releasing it once we finish.

Line 1450

We finish by releasing the cache descriptor.