New Features in Linux 2.6

team bbl


This study references only the new Linux 2.6 components that have a profound impact on system performance; it does not cover all the new features of Linux 2.6.

rmap and objrmap

One new VM feature of the Linux 2.6 kernel is referred to as reversed mapping (rmap). In the Linux 2.4 kernel, a single pte to (struct) page mapping exists, a technique that is sufficient for process addressing. However, this implementation presented a challenge for the page replacement algorithm because it was necessary to deal with shared pages. It was not feasible to acquire a list of ptes that map a particular (struct) page without traversing the page tables for every active process on the system. Given a physical memory page, rmap allows the VM to identify the processes that are utilizing the page. From an implementation perspective, Linux 2.6 introduces a union pte component into the page structure. As a page is shared among processes, a pte chain is created and linked to this field. The slab allocator manages the pte chain elements, and each node can locate up to NRPTE number of ptes (NRPTE is related to the L1 cache size of the target architecture). To summarize, the 2.4 kernel scans through all the page tables, one process at a time, invalidating entries for suitable pages. The introduction of the new data structure allows the kernel to identify the page table entries that reference a certain physical page. The problem with this implementation is that the introduced approach memory is intensive and requires a substantial amount of maintenance. Operations such as page faults, munmap, or fork encounter a slowdown in execution speed compared to Linux 2.4. The fork() system call slows down because it is necessary by utilizing rmap to add a new reverse mapping entry for every page in the process's address space. The introduction of object-based reversed mapping (ObjRMAP) differs from rmap. In ObjRMAP, the struct page structure utilizes the mapping file to point to an address_space structure describing the object that backs up that particular page. The address_space structure incorporates two linked lists that contain the vm_area_struct (vma) structures for each process that has a mapping into the file. The vma provides the information needed to locate a page's virtual address that can be utilized to identify the correct page table entry. The object-based approach removes the direct entry-based approach found in rmap, and the page structure points directly to the page-table entry. Identifying a page-table entry via ObjRMAP by consulting the address_space as well as the vma structures takes longer but should be accomplished with a much-reduced overhead factor compared to the rmap implementation.

Largepages Support

Most modern computer architectures support more than one page size. As an example, the IA-32 architecture supports either 4KB or 4MB pages; the Linux operating system utilized largepages only for mapping the actual kernel image. Largepage usage is primarily intended to provide performance improvements for high-performance computing (HPC) and other memory-intensive applications. Any memory-access-intensive application that utilizes large amounts of virtual memory may obtain performance improvements by using largepages. (Linux utilizes either 2MB or 4MB largepages, AIX uses 16MB largepages, and Solaris uses 4MB.) The largepage performance improvements are attributable to reduced translation lookaside buffer (TLB) misses, due to the TLB being able to map a larger virtual memory range (by increasing the reach of the TLB). Largepages also improve the process of memory prefetching by eliminating the need to restart prefetch operations on 4KB boundaries.

In Linux 2.6, the actual number of available largepages can be configured through the proc (/proc/sys/vm/nr_hugepages) interface. Because the actual allocation of largepages depends on the availability of physically contiguous memory regions, the recommendation is to launch the allocation at system startup. The core of the largepage implementation in Linux 2.6 is referred to as the hugetlbfs, a pseudo file system (implemented in fs/hugetlbfs/inode.c) based on ramfs. The idea behind the implementation is that largepages are used to back up any file that exists in the file system. The actual pseudo file system is initialized at system startup and registered as an internal file system. A process may access largepages either through the shmget() interface to set up a shared region that is backed by largepages or by utilizing the mmap() call on a file that has been opened in the huge page file system. To operate on a file that uses largepages, a hugetlbfs file system has to be mounted first. After the mount operation completes, standard file operation calls such as open() can be used to set up a file descriptor. In the case that the mmap() system call is used on an open file, the hugetlbfs registered mmap() function creates the appropriate vma structure for the process. It is recommended that you use the /proc/meminfo interface to ensure that the size, as well as the number of largepages, is set up as expected. Fragmentation may prevent the allocation process from completing successfully.

Page Allocation and Replacement

Physical page allocation in Linux 2.6 is still based on a buddy algorithm, but the mechanism itself has been extended to resemble a lazy buddy scheme that defers the coalescing part. The buffer release activity of the coalescing portion involves a two-step process, where the first step consists of pushing a buffer back onto the free list. This makes the buffer available to other requests. The second step involves referencing the buffer as free (normally in a bitmap) and combining the buffer (if feasible) with adjacent buffers. A regular buddy system implementation performs both steps on each release. The lazy buddy scheme performs the first step but defers the second step and executes it depending on the state of the buffer class. The Linux 2.6 implementation of the discussed lazy buddy concept utilizes per-CPU page lists. These pagesets contain two lists of pages (marked as hot and cold), where the hot pages have been used recently, and the expectation is that they are still present in the CPU cache. On allocation, the pageset for the running CPU is consulted first to determine if the pages are available; if so, the allocation proceeds accordingly. To determine when the pageset should be released or allocated, a low and high watermark-based approach is used. As the low watermark is reached, a batch of pages is allocated and placed on the list. As the high watermark is reached, a batch of pages is freed simultaneously. As a result of the new and improved implementation, the spinlock overhead is significantly reduced.

In the 2.4 Linux kernel, kswapd, a global daemon, acted as the page replacement activator. The kswapd was responsible for keeping a certain amount of memory available, and under memory pressure would initiate the process of freeing pages. In Linux 2.6, a kswapd daemon is available for each node in the system. The idea behind this design is to avoid situations where kswapd would have to free pages on behalf of a remote node. In Linux 2.4, a spinlock has to be acquired before removing pages from the LRU list. In Linux 2.6, operations involving the LRU list take place via pagevec structures. The pagevec structures either add or remove pages to or from the LRU lists in sets of up to PAGEVEC_SIZE pages. In the case of removing pages, the zone lru_lock lock is acquired and the pages are placed on a temporary list. After the list of pages to be removed is assembled, a call to shrink_list() is initiated. The actual process of freeing pages can now be performed without acquiring the zone lru_lock spinlock. In the case of adding pages, a new page vector struct is initialized via pagevec_init(). Pages are added to the vector through pagevec_add() and then are committed to being placed onto the LRU list via pagevec_release().

Slab Allocator

Linux, like most other UNIX operating systems, provides a general-purpose memory allocator that provides arbitrarily sized memory allocations. The basic objective of any slab allocator is to efficiently allocate small memory buffers that are normally less than a page or not an even multiple of a page size. Another objective of a slab allocator is to effectively operate on memory objects that are frequently being allocated and freed, a process that may lead to memory fragmentation. The slab allocator allows caching these frequently used objects (such as sock, inode, or mm_struct), which minimizes the overhead of initializing and destroying these basic entities every time a new object is requested. Many of the changes in the slab allocator for Linux 2.6 are related to address lock contention scenarios. In addition, Linux 2.6 introduces a slab shrinker callback feature. In Linux 2.4, kmem_cache_reap() is called in low-memory situations. In Linux 2.6, the 2.4 setup has been augmented as caches register a shrinker callback via set_shrinker(). The set_shrinker() function populates a struct with a pointer to the callback and a weight that indicates the complexity of re-creating the object. During a page reclaim scenario, each shrinker is called twice. The first call passes 0 as a parameter, indicating that the callback should identify how many pages could be freed (if called properly). A basic heuristic is applied next to determine if it is worth the cost of using the callback. If so, the callback is called again with a parameter that indicates the number of objects to be freed.

VM Tunables

The following definitions summarize the VM tunables for Linux 2.6, provide some insight into the objectives of adjusting the parameters, and discuss the implications of doing so. All the parameters can be adjusted through the proc (/proc/sys/vm) file system interface. The parameters are listed in alphabetical order as listed in the proc file system.

  • The dirty_background_ratio parameter determines the fraction of dirty memory held prior to attempting a writeback operation. In Linux 2.6, a pool of pdflush kernel threads are responsible for the VM writeback operations, as well as for periodically synchronizing file system metadata. If the background writeback percentage is exceeded, the pdflush daemon processes the writeback operations asynchronously.

  • The dirty_expire_centisecs parameter governs the number of centiseconds that data is allowed to remain dirty, a time period determined by consulting the timestamp on all files that have dirty pages cached in memory. This parameter represents an upper bound and ensures that the data will eventually be written out to the disk subsystem. A centisecond represents a one-hundredths-of-a-second time unit.

  • The dirty_ratio parameter indicates the fraction of memory that represents an excessive amount of dirty memory. As a task performs file write operations in an environment where there is an excessive amount of dirty memory, the system is forced to write out dirty memory pages until the percentage of dirty pages is less than the threshold specified on the system. This is significant because I/O operations are considered high-latency. A task's performance is impacted if it has to wait on writeback operations to be completed before the system can satisfy an allocation request. This is predominately considered an issue on large memory systems.

  • The dirty_writeback_sentisecs parameter outlines the time interval between writebacks by the pdflush kernel threads (in centiseconds). As the already discussed dirty_expire_centisecs parameter governs the age of the dirty memory pages, the dirty_writeback_sentisecs parameter identifies how often the pdflush threads are being invoked. The recommendation is to specify the two parameters in a way that dirty_writeback_sentisecs is smaller than dirty_expire_centisecs (a ratio of 1:6 should be reasonable for most workloads). Larger systems that experience a kernel-intensive workload may benefit from increasing the time as well as frequency intervals (but still obeying the 1:6 ratio), because the background writeback operations may interfere with the workload's own performance behavior.

  • The lower_zone_protection parameter identifies the weight assigned to a discouraging factor for memory zone fallbacks. The Linux operating system decomposes physical memory into three distinct zones. The ZONE_DMA section is addressable by (legacy) I/O devices, the ZONE_ NORMAL by the kernel, and the ZONE_HIGHMEM is used for user space and certain kernel (temporary) buffers. The basic philosophy behind this concept is that the ZONE_DMA section can be utilized for the same purpose as ZONE_NORMAL, and that the ZONE_NORMAL section can be used for the same purpose as ZONE_HIGHMEM, but not vice versa. As the ZONE_HIGHMEM section becomes saturated, ZONE_NORMAL can be used (through a fallback operation) for allocation purposes. ZONE_NORMAL is the only section that can be used for kernel data structures and is vital to the system. In Linux, a heuristic called the incremental min is used to discourage the discussed fallback to vital memory zones (at least ZONE_NORMAL). Setting lower_zone_ protection to the default 0 disables the incremental min; setting the parameter to 4 quadruples its effect. This means the parameter now governs the aggressiveness of the lower zone defense algorithm. The lower_zone_protection tunable lets you increase the amount of protection that lower zones receive against allocations that could potentially use higher zones in the system.

  • The min_free_kbytes parameter specifies the size of the memory pool that can be used for urgent allocations such as those initiated by the interrupt handler. On larger systems, this parameter impacts the ratio of memory available to interrupt-time as opposed to runtime allocations. On smaller memory systems, the parameter is kept small to avoid allocation failures. The general consensus is that the low watermark parameter is used to determine when to reject user space allocations and to initiate page replacement scenarios. It is possible not to set the value to any large fraction of physical memory. This holds true especially if user space operations are expected to be the primary consumer of the memory subsystem.

  • The nr_pdflush_threads parameter governs the current size of the pdflush (writeback) thread pool. The VM subsystem utilizes a dynamic algorithm to determine the number of actual threads. The focus is on reserving a certain number of pdflush threads in case of a low-memory situation, where the dynamic allocation of additional threads may fail. The lower and upper bounds, referred to in Linux as MIN_PDFLUSH_THREADS and MAX_PDFLUSH _THREADs, are set to 2 and 8, respectively. Increasing the maximum value (and recompiling the kernel) may be beneficial in an environment with many disk drives and a seek-bound workload to increase the VM I/O concurrency. Another situation where increasing the maximum value may improve performance is a writeback-related CPU-bound situation.

    This is where additional processors are available in the system and could be used to process the aggregate workload.

  • The overcommit_memory parameter represents a flag that enables memory over-commitment. When the flag is set to the default 0, the kernel initiates a check before each malloc() call to ensure that sufficient memory is available. If the flag is set to 1, the system pretends that there is always enough memory. When set to 0, the flag ensures that only as much user virtual memory is allocated as can be dealt with through the page replacement infrastructure, stating that every page has to have a backing store. Anonymous memory pages require swap space, whereas file pages require disk space that can be reserved. Setting the flag to 1 results in a more effective memory utilization behavior. However, it may cause scenarios based on certain workload situations, where some processes get killed. To recap, setting the flag to 0 results in the kernel performing heuristic memory overcommit handling by estimating the amount of memory available and failing requests that are blatantly invalid.

    Unfortunately, because memory is allocated using a heuristic rather than a precise algorithm, this setting can sometimes allow overloading the memory that is available on a system. Setting the flag to 1 results in the kernel performing no memory overcommit handling. Under this setting, the potential for memory overload is increased, but so is performance for memory-intensive tasks (such as those executed by some scientific applications). An additional supported case is to set the flag to 2. In this case, the kernel fails requests for memory that add up to all of swap space plus the percentage of physical RAM specified in the overcommit_ratio parameter, discussed next. In Linux, setting the flag to 2 allows reducing the risk of encountering a memory overcommit situation.

  • The overcommit_ratio parameter specifies the percentage of physical memory that is considered when the overcommit_memory parameter (just discussed) is set to 2. The default value is set to 50.

  • The page-cluster parameter represents an integer value that, when 2 is raised to the power of x, identifies the number of pages the kernel reads in at once (the actual swap read-ahead window). The default values of 2 for systems with less than 16MB of memory, or 3 for larger systems, are considered reasonable settings in most circumstances. The parameter is used to improve the page I/O efficiency, but the system may encounter excessive I/O and memory consumption if the parameter is specified otherwise.

  • The swappiness parameter indicates the relative preference of the VM subsystem to reclaim pages by unmapping and paging out pages as opposed to reclaiming only pages that are not mapped by any process. The actual decision depends on a binary switch that is activated when half the percentage of memory that is mapped into the processes page tables, plus the value of swappiness, exceeds 100. The system further utilizes a distress factor in the decision process that increases by a factor of 2 every time page replacement has to be retried. If the system encounters a situation where it is difficult to locate unmapped pages that can be reclaimed, the VM subsystem reverses its course and starts paging out anonymous memory. The preference of not paging out anonymous memory can be expressed by specifying a lower value for swappiness. If kswapd is utilizing significant CPU resources or the system is spending time in an iowait state, increasing the value for swappiness may increase overall system performance.

CPU Scheduler

The thread scheduler represents the subsystem responsible for decomposing the available CPU resources among the runnable threads. The scheduler's behavior and decision-making process have a direct correlation to thread fairness and scheduling latency. Fairness can be described as the capability of all threads to not only encounter forward progress but to do so in an even manner. The opposite of fairness is known as starvation, a scenario where a given thread encounters no forward progress. Fairness is frequently hard to justify because it represents an exercise in compromise between global and localized performance. Scheduling latency describes the actual delay between a thread entering the runnable state (TSRUN) and actually running on a processor (TSONPROC). An inefficient scheduling latency behavior results in a perceptible delay in application response time. An efficient and effective CPU scheduler is paramount to the operation of the computing platform; it decides which task (thread) to run at what time and for how long. Real-time and time-shared jobs are distinguished, each class revealing different objectives. Both are implemented through different scheduling disciplines embedded in the scheduler.

Linux assigns a static priority to each task that can be modified through the nice() interface. Linux has a range of priority classes, distinguishing between real time and time-sharing tasks. The lower the priority value, the higher a task's logical priority, or in other words its general importance. In this context, the discussion always references the logical priority when elaborating on priority increases and decreases. Real-time tasks always have higher priority than time-sharing tasks.

The Linux 2.6 scheduler, referred to as the O(1) scheduler, is a multiqueue scheduler that assigns an actual run-queue to each CPU, promoting a CPU local scheduling approach. The O(1) label describes the time complexity for retrieval; in other words, it refers to the property that no matter the workload on the system, the next task to run will be chosen in a constant amount of time.

The previous incarnation of the Linux scheduler utilized the concept of goodness to determine which thread to execute next. All runnable tasks were kept on a single run-queue that represented a linked list of threads in a TSRUN (TASK_RUNNABLE) state. In Linux 2.6, the single run-queue lock was replaced with a per-CPU lock, ensuring better scalability on SMP systems. The per-CPU run-queue scheme adopted by the O(1) scheduler decomposes the run-queue into a number of buckets (in priority order) and utilizes a bitmap to identify the buckets that hold runnable tasks. Locating the next task to execute requires a read of the bitmap to identify the first bucket with runnable tasks and choosing the first task in that run-queue of the bucket.

The per-CPU run-queue consists of two vectors of task lists, labeled as the active vector and the expired vector. Each vector index represents a list of runnable tasks, each at its respective priority level. After executing for a period of time, a task moves from the active list to the expired list to ensure that all runnable tasks get an opportunity to execute. As the active array becomes empty, the expired and active vectors are swapped by modifying the pointers.

Occasionally, a load-balancing algorithm is invoked to rebalance the run-queues to ensure that a similar number of tasks are available per CPU. As mentioned, the scheduler has to decide which task to run next and for how long. Time quanta in the Linux kernel are defined as multiples of a system tick. A tick is defined as the fixed delta (1/HZ) between two consecutive timer interrupts. In Linux 2.6, the HZ parameter is set to 1,000, indicating that the interrupt routine scheduler_tick() is invoked once every millisecond, at which time the currently executing task is charged with one tick. Setting the HZ parameter to 1,000 does not improve the system's responsiveness, because the actual scheduler timeslice is not affected by this setting. For systems that primarily execute in a number-crunching mode, the HZ=1,000 setting may not be appropriate. In such an environment, the aggregate system's performance could benefit from setting the HZ parameter to a lower value (around 100).

Besides the static priority (static_prio), each task maintains an effective priority (prio). The distinction is made to account for certain priority bonuses or penalties based on the recent sleep average (sleep_avg) of a given task. The sleep average accounts for the number of ticks in which a task was recently descheduled. The effective priority of a task determines its location in the priority list of the run-queue. A task is declared interactive when its effective priority exceeds its static priority by a certain level. This scenario can only be based on the task accumulating sleep average ticks. The interactive estimator framework embedded in Linux 2.6 operates automatically and transparently. In Linux, high-priority tasks reach the interactivity state with a much smaller sleep average than do lower-priority tasks. Because a task's interactivity is estimated via the sleep average, I/O-bound tasks are potential candidates to reach the interactivity status, whereas CPU-bound tasks normally are not perceived as interactive tasks. The actual timeslice is defined as the maximum time a task can execute without yielding the CPU (voluntarily) to another task and is simply a linear function of the static priority of normal tasks. The priority itself is projected into a range of MIN_TIMESLICE to MAX_TIMESLICE, where the default values are set to 10 and 200 milliseconds, respectively. The higher the priority of a task, the greater the task's timeslice for every timer tick the task's running timeslice is decremented. If it is decremented to 0, the scheduler replenishes the timeslice, recomputes the effective priority, and re-enqueues the task either into the active vector, if the task is classified as interactive, or into the expired vector, if the task is considered noninteractive. At this point, the system dispatches another task from the active array. This scenario ensures that tasks in the active array are executed first, before any expired task has the opportunity to run again. If a task is descheduled, its timeslice is not replenished at wakeup time; however, its effective priority might have changed due to any accumulated sleep time. If all the runnable tasks have exhausted their timeslice and have been moved to the expired list, the expired and active vector are swapped. This technique ensures that the O(1) scheduler does not have to traverse a potentially large list of tasks, as is required in the Linux 2.4 scheduler. The issue with the new design is that due to any potential interactivity, scenarios may arise where the active queue continues to have runnable tasks that are not migrated to the expired list. There may be tasks starving for CPU time; to circumvent the starvation issue, the task that first migrated to the expired list is older than the STARVATION_LIMIT (which is set to 10 seconds), and the active and expired arrays are switched.

    team bbl



    Performance Tuning for Linux Servers
    Performance Tuning for Linux Servers
    ISBN: 0137136285
    EAN: 2147483647
    Year: 2006
    Pages: 254

    flylib.com © 2008-2017.
    If you may any questions please contact us: flylib@qtcs.net