| 
 | 
 | 
| This study references only the new Linux 2.6 components that have a profound impact on system performance; it does not cover all the new features of Linux 2.6. rmap and objrmapOne new VM feature of the Linux 2.6 kernel is referred to as reversed mapping (rmap). In the Linux 2.4 kernel, a single pte to (struct) page mapping exists, a technique that is sufficient for process addressing. However, this implementation presented a challenge for the page replacement algorithm because it was necessary to deal with shared pages. It was not feasible to acquire a list of ptes that map a particular (struct) page without traversing the page tables for every active process on the system. Given a physical memory page, rmap allows the VM to identify the processes that are utilizing the page. From an implementation perspective, Linux 2.6 introduces a union pte component into the page structure. As a page is shared among processes, a pte chain is created and linked to this field. The slab allocator manages the pte chain elements, and each node can locate up to NRPTE number of ptes (NRPTE is related to the L1 cache size of the target architecture). To summarize, the 2.4 kernel scans through all the page tables, one process at a time, invalidating entries for suitable pages. The introduction of the new data structure allows the kernel to identify the page table entries that reference a certain physical page. The problem with this implementation is that the introduced approach memory is intensive and requires a substantial amount of maintenance. Operations such as page faults, munmap, or fork encounter a slowdown in execution speed compared to Linux 2.4. The fork() system call slows down because it is necessary by utilizing rmap to add a new reverse mapping entry for every page in the process's address space. The introduction of object-based reversed mapping (ObjRMAP) differs from rmap. In ObjRMAP, the struct page structure utilizes the mapping file to point to an address_space structure describing the object that backs up that particular page. The address_space structure incorporates two linked lists that contain the vm_area_struct (vma) structures for each process that has a mapping into the file. The vma provides the information needed to locate a page's virtual address that can be utilized to identify the correct page table entry. The object-based approach removes the direct entry-based approach found in rmap, and the page structure points directly to the page-table entry. Identifying a page-table entry via ObjRMAP by consulting the address_space as well as the vma structures takes longer but should be accomplished with a much-reduced overhead factor compared to the rmap implementation. Largepages SupportMost modern computer architectures support more than one page size. As an example, the IA-32 architecture supports either 4KB or 4MB pages; the Linux operating system utilized largepages only for mapping the actual kernel image. Largepage usage is primarily intended to provide performance improvements for high-performance computing (HPC) and other memory-intensive applications. Any memory-access-intensive application that utilizes large amounts of virtual memory may obtain performance improvements by using largepages. (Linux utilizes either 2MB or 4MB largepages, AIX uses 16MB largepages, and Solaris uses 4MB.) The largepage performance improvements are attributable to reduced translation lookaside buffer (TLB) misses, due to the TLB being able to map a larger virtual memory range (by increasing the reach of the TLB). Largepages also improve the process of memory prefetching by eliminating the need to restart prefetch operations on 4KB boundaries. In Linux 2.6, the actual number of available largepages can be configured through the proc (/proc/sys/vm/nr_hugepages) interface. Because the actual allocation of largepages depends on the availability of physically contiguous memory regions, the recommendation is to launch the allocation at system startup. The core of the largepage implementation in Linux 2.6 is referred to as the hugetlbfs, a pseudo file system (implemented in fs/hugetlbfs/inode.c) based on ramfs. The idea behind the implementation is that largepages are used to back up any file that exists in the file system. The actual pseudo file system is initialized at system startup and registered as an internal file system. A process may access largepages either through the shmget() interface to set up a shared region that is backed by largepages or by utilizing the mmap() call on a file that has been opened in the huge page file system. To operate on a file that uses largepages, a hugetlbfs file system has to be mounted first. After the mount operation completes, standard file operation calls such as open() can be used to set up a file descriptor. In the case that the mmap() system call is used on an open file, the hugetlbfs registered mmap() function creates the appropriate vma structure for the process. It is recommended that you use the /proc/meminfo interface to ensure that the size, as well as the number of largepages, is set up as expected. Fragmentation may prevent the allocation process from completing successfully. Page Allocation and ReplacementPhysical page allocation in Linux 2.6 is still based on a buddy algorithm, but the mechanism itself has been extended to resemble a lazy buddy scheme that defers the coalescing part. The buffer release activity of the coalescing portion involves a two-step process, where the first step consists of pushing a buffer back onto the free list. This makes the buffer available to other requests. The second step involves referencing the buffer as free (normally in a bitmap) and combining the buffer (if feasible) with adjacent buffers. A regular buddy system implementation performs both steps on each release. The lazy buddy scheme performs the first step but defers the second step and executes it depending on the state of the buffer class. The Linux 2.6 implementation of the discussed lazy buddy concept utilizes per-CPU page lists. These pagesets contain two lists of pages (marked as hot and cold), where the hot pages have been used recently, and the expectation is that they are still present in the CPU cache. On allocation, the pageset for the running CPU is consulted first to determine if the pages are available; if so, the allocation proceeds accordingly. To determine when the pageset should be released or allocated, a low and high watermark-based approach is used. As the low watermark is reached, a batch of pages is allocated and placed on the list. As the high watermark is reached, a batch of pages is freed simultaneously. As a result of the new and improved implementation, the spinlock overhead is significantly reduced. In the 2.4 Linux kernel, kswapd, a global daemon, acted as the page replacement activator. The kswapd was responsible for keeping a certain amount of memory available, and under memory pressure would initiate the process of freeing pages. In Linux 2.6, a kswapd daemon is available for each node in the system. The idea behind this design is to avoid situations where kswapd would have to free pages on behalf of a remote node. In Linux 2.4, a spinlock has to be acquired before removing pages from the LRU list. In Linux 2.6, operations involving the LRU list take place via pagevec structures. The pagevec structures either add or remove pages to or from the LRU lists in sets of up to PAGEVEC_SIZE pages. In the case of removing pages, the zone lru_lock lock is acquired and the pages are placed on a temporary list. After the list of pages to be removed is assembled, a call to shrink_list() is initiated. The actual process of freeing pages can now be performed without acquiring the zone lru_lock spinlock. In the case of adding pages, a new page vector struct is initialized via pagevec_init(). Pages are added to the vector through pagevec_add() and then are committed to being placed onto the LRU list via pagevec_release(). Slab AllocatorLinux, like most other UNIX operating systems, provides a general-purpose memory allocator that provides arbitrarily sized memory allocations. The basic objective of any slab allocator is to efficiently allocate small memory buffers that are normally less than a page or not an even multiple of a page size. Another objective of a slab allocator is to effectively operate on memory objects that are frequently being allocated and freed, a process that may lead to memory fragmentation. The slab allocator allows caching these frequently used objects (such as sock, inode, or mm_struct), which minimizes the overhead of initializing and destroying these basic entities every time a new object is requested. Many of the changes in the slab allocator for Linux 2.6 are related to address lock contention scenarios. In addition, Linux 2.6 introduces a slab shrinker callback feature. In Linux 2.4, kmem_cache_reap() is called in low-memory situations. In Linux 2.6, the 2.4 setup has been augmented as caches register a shrinker callback via set_shrinker(). The set_shrinker() function populates a struct with a pointer to the callback and a weight that indicates the complexity of re-creating the object. During a page reclaim scenario, each shrinker is called twice. The first call passes 0 as a parameter, indicating that the callback should identify how many pages could be freed (if called properly). A basic heuristic is applied next to determine if it is worth the cost of using the callback. If so, the callback is called again with a parameter that indicates the number of objects to be freed. VM TunablesThe following definitions summarize the VM tunables for Linux 2.6, provide some insight into the objectives of adjusting the parameters, and discuss the implications of doing so. All the parameters can be adjusted through the proc (/proc/sys/vm) file system interface. The parameters are listed in alphabetical order as listed in the proc file system. 
 CPU SchedulerThe thread scheduler represents the subsystem responsible for decomposing the available CPU resources among the runnable threads. The scheduler's behavior and decision-making process have a direct correlation to thread fairness and scheduling latency. Fairness can be described as the capability of all threads to not only encounter forward progress but to do so in an even manner. The opposite of fairness is known as starvation, a scenario where a given thread encounters no forward progress. Fairness is frequently hard to justify because it represents an exercise in compromise between global and localized performance. Scheduling latency describes the actual delay between a thread entering the runnable state (TSRUN) and actually running on a processor (TSONPROC). An inefficient scheduling latency behavior results in a perceptible delay in application response time. An efficient and effective CPU scheduler is paramount to the operation of the computing platform; it decides which task (thread) to run at what time and for how long. Real-time and time-shared jobs are distinguished, each class revealing different objectives. Both are implemented through different scheduling disciplines embedded in the scheduler. Linux assigns a static priority to each task that can be modified through the nice() interface. Linux has a range of priority classes, distinguishing between real time and time-sharing tasks. The lower the priority value, the higher a task's logical priority, or in other words its general importance. In this context, the discussion always references the logical priority when elaborating on priority increases and decreases. Real-time tasks always have higher priority than time-sharing tasks. The Linux 2.6 scheduler, referred to as the O(1) scheduler, is a multiqueue scheduler that assigns an actual run-queue to each CPU, promoting a CPU local scheduling approach. The O(1) label describes the time complexity for retrieval; in other words, it refers to the property that no matter the workload on the system, the next task to run will be chosen in a constant amount of time. The previous incarnation of the Linux scheduler utilized the concept of goodness to determine which thread to execute next. All runnable tasks were kept on a single run-queue that represented a linked list of threads in a TSRUN (TASK_RUNNABLE) state. In Linux 2.6, the single run-queue lock was replaced with a per-CPU lock, ensuring better scalability on SMP systems. The per-CPU run-queue scheme adopted by the O(1) scheduler decomposes the run-queue into a number of buckets (in priority order) and utilizes a bitmap to identify the buckets that hold runnable tasks. Locating the next task to execute requires a read of the bitmap to identify the first bucket with runnable tasks and choosing the first task in that run-queue of the bucket. The per-CPU run-queue consists of two vectors of task lists, labeled as the active vector and the expired vector. Each vector index represents a list of runnable tasks, each at its respective priority level. After executing for a period of time, a task moves from the active list to the expired list to ensure that all runnable tasks get an opportunity to execute. As the active array becomes empty, the expired and active vectors are swapped by modifying the pointers. Occasionally, a load-balancing algorithm is invoked to rebalance the run-queues to ensure that a similar number of tasks are available per CPU. As mentioned, the scheduler has to decide which task to run next and for how long. Time quanta in the Linux kernel are defined as multiples of a system tick. A tick is defined as the fixed delta (1/HZ) between two consecutive timer interrupts. In Linux 2.6, the HZ parameter is set to 1,000, indicating that the interrupt routine scheduler_tick() is invoked once every millisecond, at which time the currently executing task is charged with one tick. Setting the HZ parameter to 1,000 does not improve the system's responsiveness, because the actual scheduler timeslice is not affected by this setting. For systems that primarily execute in a number-crunching mode, the HZ=1,000 setting may not be appropriate. In such an environment, the aggregate system's performance could benefit from setting the HZ parameter to a lower value (around 100). Besides the static priority (static_prio), each task maintains an effective priority (prio). The distinction is made to account for certain priority bonuses or penalties based on the recent sleep average (sleep_avg) of a given task. The sleep average accounts for the number of ticks in which a task was recently descheduled. The effective priority of a task determines its location in the priority list of the run-queue. A task is declared interactive when its effective priority exceeds its static priority by a certain level. This scenario can only be based on the task accumulating sleep average ticks. The interactive estimator framework embedded in Linux 2.6 operates automatically and transparently. In Linux, high-priority tasks reach the interactivity state with a much smaller sleep average than do lower-priority tasks. Because a task's interactivity is estimated via the sleep average, I/O-bound tasks are potential candidates to reach the interactivity status, whereas CPU-bound tasks normally are not perceived as interactive tasks. The actual timeslice is defined as the maximum time a task can execute without yielding the CPU (voluntarily) to another task and is simply a linear function of the static priority of normal tasks. The priority itself is projected into a range of MIN_TIMESLICE to MAX_TIMESLICE, where the default values are set to 10 and 200 milliseconds, respectively. The higher the priority of a task, the greater the task's timeslice for every timer tick the task's running timeslice is decremented. If it is decremented to 0, the scheduler replenishes the timeslice, recomputes the effective priority, and re-enqueues the task either into the active vector, if the task is classified as interactive, or into the expired vector, if the task is considered noninteractive. At this point, the system dispatches another task from the active array. This scenario ensures that tasks in the active array are executed first, before any expired task has the opportunity to run again. If a task is descheduled, its timeslice is not replenished at wakeup time; however, its effective priority might have changed due to any accumulated sleep time. If all the runnable tasks have exhausted their timeslice and have been moved to the expired list, the expired and active vector are swapped. This technique ensures that the O(1) scheduler does not have to traverse a potentially large list of tasks, as is required in the Linux 2.4 scheduler. The issue with the new design is that due to any potential interactivity, scenarios may arise where the active queue continues to have runnable tasks that are not migrated to the expired list. There may be tasks starving for CPU time; to circumvent the starvation issue, the task that first migrated to the expired list is older than the STARVATION_LIMIT (which is set to 10 seconds), and the active and expired arrays are switched. | 
| 
 | 
 | 
