Maui's real power is unleashed when the defaults are replaced with more advanced configuration. Specifically, sites can map mission objectives into scheduling policies: selecting how resources are to be used, how users are to be treated, and how jobs are to be scheduled. To this end, Maui can be thought of as an integrated scheduling toolkit providing a number of capabilities that may be used individually or in concert to obtain the desired system behavior. These capabilities include
node allocation policies,
quality of service,
node sets, and
Most capabilities are disabled by default; thus, a site need configure only the features of interest. In the following subsections, we describe each of these capabilities. While our description will be adequate for configuring these capabilities, the online Maui Administrators Manual should be consulted for full details.
In general, prioritization is the process of determining which of many options best fulfills overall goals. In the case of scheduling, a site often has multiple, independent goals such as maximizing system utilization, giving preference to users in specific projects, or making certain that no job sits in the queue for more than a given period of time. The most common approach to representing a multifaceted set of site goals is to assign to each objective an overall weight (value or priority) that can be associated with each potential scheduling decision. With the jobs prioritized, the scheduler can roughly fulfill site objectives by starting the jobs in priority order.
Maui allows component and subcomponent weights to be associated with many aspects of a job. In order to realize this fine-grained control, a simple priority-weighting hierarchy is used in which the contribution of priority components is calculated as PRIORITY-FACTOR-VALUE * SUBFACTORWEIGHT * FACTORWEIGHT. Component and subcomponent weights are listed in Table 16.1. Values for all weights may be set in the 'maui.cfg' file by using the associated component-weight parameter specified as the name of the weight followed by the string WEIGHT (e.g., SERVICEWEIGHT or PROCWEIGHT).
SERVICE (Level of Service)
QUEUETIME (Current queue time in minutes)
XFACTOR (Current expansion factor)
BYPASS (Number of times jobs were bypassed via backfill)
TARGET (Proximity to Service
Target - Exponential)
TARGETQUEUETIME (Delta to queue-time target in minutes)
TARGETXFACTOR (Delta to Xfactor target)
RESOURCE (Resources Requested)
MEM (Requested memory in MBytes)
SWAP (Requested virtual memory in MBytes)
DISK (Requested local disk in MBytes)
NODE (Requested number of nodes)
WALLTIME (Requested wall time in seconds)
PS (Requested processor-seconds)
PE (Requested processor-equivalents)
FSUSER (User fairshare percentage)
FSGROUP (Group fairshare percentage)
FSACCOUNT (Account fairshare percentage)
FSCLASS (Class fairshare percentage)
FSQOS (QoS fairshare percentage)
USER (User priority)
GROUP (Group priority)
ACCOUNT (Account priority)
CLASS (Class priority)
QOS (QoS priority)
By default, Maui prioritizes jobs exclusively on their submission time. By using priority components, however, a site can incorporate additional information, such as current level of service, quality of service targets, resources requested, and historical usage. The contribution of any single component can be limited by specifying a priority component cap, such as RESCAP, which prevents the contribution of a single component from exceeding the specified value. In the end, a job's priority is equivalent to the sum of all enabled priority components.
Each component or subcomponent may be used for different purposes. For example, WALLTIME can be used to favor (or disfavor) jobs based on their duration; ACCOUNT can be used to favor jobs associated with a particular project; QUEUETIME can be used to favor those jobs that have been waiting the longest. By mixing and matching priority weights, sites can obtain the desired job-start behavior. To aid in tuning job priority, Maui provides the diagnose -p command, which summarizes the impact of the current priority-weight settings on idle jobs.
While most subcomponents are metric based (i.e., number of seconds queued or number of nodes requested), the credential subcomponents are based on priorities specified by the administrator. Maui allows use of the *CFG parameters to rank jobs by individual job credentials. For example, to favor jobs submitted by users bob and john and members of the group staff, a site might specify the following:
USERCFG[bob] PRIORITY=100 USERCFG[john] PRIORITY=500 GROUPWEIGHT[staff] PRIORITY=1000 USERWEIGHT 1 GROUPWEIGHT 1 CREDWEIGHT 1
Note that both component and subcomponent weights are specified to enable these credential priorities to take effect. Further details about the use of these component factors, as well as anecdotal usage information, are available in the Maui Administrators Manual.
Complementing the specification of job prioritization is that of node allocation. When the scheduler selects a job to run, it must also determine which resources to allocate to the job. Depending on the use of the cluster, different allocation policies can be specified using NODEALLOCATIONPOLICY. Parameter values include the following:
MINRESOURCE: This algorithm selects the nodes with the minimum configured resources that still meet the requirements of the job. The algorithm leaves more richly endowed nodes available for other jobs that may specifically request these additional resources.
LASTAVAILABLE: This algorithm is particularly useful when making reservations for backfill. It determines the earliest time a job can run and then selects the resources available at a time such that, whenever possible, currently idle resources are left unreserved and are thus available for backfilling.
PRIORITY: This policy allows a site to create its own node allocation prioritization scheme, taking into account issues such as installed software, jobs currently running on the node, available processors, or other local node configurations. This allocation policy requires specification of the PRIORITYF attribute of the NODECFG parameter. For example, to base node allocation priority on available node memory load, historical utilization, and machine speed, a site may specify something like NODECFG [DEFAULT] PRIORITYF='AMEM - 10*USAGE + SPEED'.
CPULOAD: This policy attempts to allocate the most lightly loaded nodes first.
The next issue often confronting sites is the management of fairness. At first glance, fairness seems like a simple concept, but in actual practice it can be very difficult to map onto a cluster. Should all users get to run the same number of jobs or use the same number of nodes? Do these usage constraints cover the present time only or a specified time frame? If historical information is used, what is the metric of consumption? What is the time frame? Does fair consumption necessarily mean equal consumption? How should resources be allocated if user X bought two-thirds of the nodes and user Y purchased the other third? Is fairness based on a static metric, or is it conditional on current resource demand?
While Maui is not able to address all these issues, it does provide some flexible tools that help with 90 percent of the battle. Specifically, these tools are throttling policies and fairshare used to control immediate and historical usage, respectively.
The term "throttling policies" is collectively applied to a set of policies that constrain real-time resource consumption. Maui supports limits on the number of processors, nodes, proc-seconds, jobs, and processor equivalents allowed at any given time. Limits may be applied on a per user, group, account, QoS, or queue basis via the *CFG set of parameters. For example, specifying USERCFG[bob] MAXJOB=3 MAXPROC=32 will constrain user bob to running no more than 3 jobs and 32 total processors at any given time. Specifying GROUPCFG [DEFAULT] MAXNODE=64 will limit each group to using no more than 64 nodes simultaneously unless overriding limits for a particular group are specified. ACCOUNTCFG, QOSCFG, and CLASSCFG round out the *CFG family of parameters providing a means to throttle instantaneous use on accounts, QoS's, and classes, respectively.
With each of the parameters, hard and soft limits can be used to apply a form of demand-sensitive limits. While hard limits cannot be violated under any conditions, soft limits may be violated if no other jobs can run. For example, specifying USERCFG[DEFAULT] MAXNODE=16,24 will allow each user to cumulatively allocate up to 16 nodes while jobs from other users can use available resources. If no other jobs can use these resources, a user may run on up to 24 nodes simultaneously.
Throttling policies are effective in preventing cluster "hogging" by an individual user or group. They also provide a simple mechanism of fairness and cycle distribution. Such policies may lead to lower overall system utilization, however. For instance, resources might go unused if these policies prevent all queued jobs from running. When possible, throttling policies should be set to the highest feasible level, and the cycle distribution should be managed by tools such as fairshare, allocation management systems, and QoS-based prioritization.
Fairshare algorithms attempt to distribute resources over time according to specified usage targets. As noted earlier, however, this general statement leaves much to interpretation, including the distribution usage metric and the monitored time frame.
The Maui parameter FSPOLICY specifies the usage metric allowing sites to determine how resource distribution is to be measured. The parameters FSINTERVAL, FSDEPTH, and FSDECAY control how historical usage information is to be weighted.
To control resource distribution, Maui uses fairshare targets that can be applied to users, groups, accounts, queues, and QoS mechanisms with both default and specific targets available. Each target may be one of four different types: target, floor, ceiling, or cap. In most cases, Maui adjusts job priorities to meet fairshare targets. With the standard target, Maui attempts to adjust priorities at all times in an attempt to meet the target. In the case of floors, Maui will increase job priority only to maintain at least the targeted usage. With ceilings, the converse occurs. Finally, with fairshare caps, job eligibility rather than job priority is adjusted to prevent jobs from running if the cap is exceeded during the specified fairshare interval.
The example below shows a possible fairshare configuration.
# maui.cfg FSPOLICY DEDICATEDPS FSDEPTH 7 FSINTERVAL 24:00:00 FSDECAY 0.80 USERCFG[DEFAULT] FSTARGET=10.0 USERCFG[john] FSTARGET=25.0+ GROUPCFG[staff] FSTARGET=20.0-
In this case, fairshare usage will track delivered system processor seconds over a seven-day period with a 0.8 decay factor. All users will have a fairshare target of 10 percent of these processor seconds—with the exception of john, who will have a floor of 25 percent. Also, the group staff will have a fairshare ceiling of 20 percent. At any time, the status of fairshare can be examined by using the diagnose -f command.
In managing any cluster system, half of the administrative effort involves configuring it to handle the steady-state situation. The other half encompasses the handling of the vast array of special one-time requests. Maui provides two features, advance reservations and QoS, which greatly ease the handling of these special requests.
Reservations allow a site to set aside a block of resources for various purposes such as cluster maintenance, special user projects, or benchmarking nodes. In general, a reservation consists of time frame, and resource lists, and an access control list. The time frame can be specified as a simple start and end time while the resource list can consist of either a list of specific hosts or a general resource description. The access control list indicates who or what will be allowed to use the specified resources during the reservation time frame. Reservations can be created dynamically by scheduler administrators using the setres command or managed directly by Maui via config file parameters.
For example, to reserve nodeA and nodeB for a four-hour maintenance window starting at 2:30 P.M., one could issue the following command:
> setres -s 14:30 -d 4:00:00 'node[AB]'
For reservations requesting allocation of a given quantity of resources, the TASK keyword can be used in the resource description. For example, the following reservation allocates 20 processors with the feature fast to users john and sam starting on April 14 at 5:00 P.M.
> setres -u john:sam -f fast -s 17:00_04/14 TASKS==20
With no duration or end time specified, this reservation will default to an infinite length and will remain in place until removed by a scheduler administrator using the releaseres command.
Access to reservations is controlled by an access control list (ACL). Reservation access is based on job credentials, such as user or group, and job attributes, such as wall time requested. Reservation ACLs can include multiple access types and individuals. For example, a reservation might reserve resources for users A and B, jobs in class C, and jobs that request less than 30 minutes of wall time. Reservations may also overlap each other if desired, in which case access is granted only if the job meets the access policies of all active reservations.
At many sites, reservations are used on a permanent or periodic basis. In such cases, it is best to use standing reservations. Standing reservations allow a site to apply reservations as an ongoing part of cluster policies. The attributes of the SRCFG parameter are used to configure standing reservations. For example, to specify the periodicity of a given reservation, the SRCFG PERIOD attribute can be set to DAY, WEEK, or INFINITE. Additional parameter attributes are available to determine what time of the day or week the reservation should be enabled. To demonstrate, the following configuration can be used to create a reservation named development that, during primetime hours, will set aside 16 nodes for exclusive use by jobs requiring less than 30 minutes.
SRCFG[development] PERIOD=DAY DAYS=MON, TUE, WED, THU, FRI SRCFG[development] STARTTIME=8:00:00 ENDTIME=17:00:00 SRCFG[development] TASKCOUNT=16 TIMELIMIT=00:30:00
Occasionally, a site may want to allow access to a set of resources only if there are no other resources available. Maui enables this conditional usage through reservation affinity. When any reservation access list is specified, each access value can be associated with positive, negative, or neutral affinity by using the "+", "-", or "=" characters. If nothing is specified, positive affinity is assumed. For example, consider the following reservation line:
SRUSERLIST[special] bob john steve= bill-
With this specification, bob's and john's jobs receive the default positive affinity and are essentially attracted to the reservation. For these jobs, Maui will attempt to use resources in the special reservation first, before considering any other resources. Jobs belonging to steve, on the other hand, can use these resources but are not attracted to them. Finally, bill's jobs will use resources in the special reservation only if no other resources are available. Detailed information about reservations can be obtained by using the showres and diagnose -r commands.
Allocation management systems allow a site to control total resource access in real time. While interfaces to support other systems exist, the allocation management system most commonly used with the Maui scheduler is QBank , provided by Pacific Northwest National Laboratory. This system and others like it allow sites to provide distinct resource allocations much like the creation of a bank account. As jobs run, the resources used are translated into a charge and debited from the appropriate account. In the case of QBank, expiration dates may be associated with allocations, private and shared accounts maintained, per machine allocations created, and so forth.
Within Maui, the allocation manager interface is controlled through the AMCFG parameter such as in the example below:
AMCFG[qbank] TYPE=QBANK HOST=bank.univ.edu AMCFG[qbank] CHARGEPOLICY=DEBITSUCCESSFULWC DEFERJOBONFAILURE=TRUE AMCFG[qbank] FALLBACKACCOUNT=freecycle
This configuration enables a connection to an allocation manager located on bank.univ.edu using the QBank interface. The unit of charge is configured to be dedicated processor-seconds, and users are charged only if their job completes successfully. If the job does not have adequate allocations in the specified account, Maui will attempt to redirect the job to use allocations in the freecycle account. In many cases, a fallback account is configured so as to be associated with lower priorities and/or additional limitations. If the job is not approved by the allocation manager, Maui will defer the job for a period of time and try it again later.
Maui's Quality of Service (QoS) feature allows sites to control access to special functions, resources, and service levels. Each QoS consists of an access control list controlling which users, groups, accounts, and job queues can access the QoS privileges. Associated with each QoS are special service-related priority weights and service targets. Additionally, each QoS can be configured to span resource partitions, preempt other jobs, and the like.
Maui also enables a site to charge a premium rate for the use of some QoS services. For example, the following configuration will cause user john's jobs to use QoS hiprio by default and allow members of the group bio to access it by request:
USERCFG[john] QLIST=hiprio:normal QDEF=hiprio GROUPCFG[bio] QLIST=hiprio:medprio:development QDEF=medprio QOSCFG[hiprio] PRIORITY=50 QTTARGET=30 FLAGS=PREEMPTOR QOSCFG[hiprio] OMAXJOB=20 MAXPROC=150
Jobs using QoS hiprio receive the following privileges and constraints:
A priority boost of 50 * QOSWEIGHT * CREDWEIGHT
A queue-time target of 30 minutes
The ability to preempt lower-priority PREEMPTEE jobs
The ability to override MAXJOB policy limits defined elsewhere
A cumulative limit of 150 processors allocated to QoS hiprio jobs
A site may have dozens of QoS objects described and may allow users access to any number of these. Depending on the type of service desired, users may then choose the QoS that best meets their needs.
The Maui scheduler provides several features to optimize performance in terms of system utilization, job throughput, and average job turnaround time.
Backfill is a now common method used to improve both system utilization and average job turnaround time by running jobs out of order. Backfill, simply put, enables the scheduler to run any job so long as it does not delay the start of jobs of higher priority. Generally, the algorithm prevents delay of high-priority jobs through some form of reservation. Backfill can be thought of as a process of filling in the resource holes left by the high priority jobs. Since holes are being filled, it makes sense that the jobs most commonly backfilled are the ones requiring the least time and/or resources. With backfill enabled, sites typically report system utilization improvements of 10 to 25% and slight improvement in average job response time.
At installation, backfill scheduling is enabled in Maui, but this is configurable with the parameter BACKFILLPOLICY. While the default configuration generally is adequate, sites may want to adjust the job selection policy, the reservation policy, the depth of reservations, or other aspects of backfill scheduling. The online documentation indicates the general effects of changing the backfill algorithm or any of the associated backfill parameters.
While backfill can improve the scheduler's performance in terms of job selection, other facilities can be used to further optimize scheduling decisions. At a high level, the efficiency of a cluster, in terms of actual work accomplished, is a function of both scheduling performance and individual job efficiency. In many clusters, job efficiency can vary widely based on the two key factors, node selection, and node mix. Node selection reflects the impact of how well a single task of a job executes on a single node while node mix accounts for performance changes resulting from communication issues or disparities in node performance.
Since most parallel jobs written in popular languages such as MPI or PVM do not internally load balance their workload, they often run only as fast as the slowest node allocated. Consequently, these jobs run most effectively on homogeneous sets of nodes. While many clusters start out as homogeneous, they quickly evolve as new generations of compute nodes are integrated into the system. Research has shown that this integration, while improving scheduling performance because of increased scheduler selection, can actually decrease average job efficiency.
A feature called node sets allows jobs to request sets of common resources without specifying exactly what resources are required. Node set policy can be specified globally or on a per job basis and can be based on node processor speed, memory, network interfaces, or locally defined node attributes. In addition to forcing jobs onto homogeneous nodes, these policies may also be used to guide jobs to one or more types of nodes on which a particular job performs best, similar to job preferences available in other systems. For example, an I/O-intensive job may run best on a certain range of processor speeds, running slower on slower nodes while wasting cycles on faster nodes. A job may specify ANYOF:PROCSPEED:450:500:650 to request nodes with processors speeds in the range of 450 to 650 MHz. Alternatively, if a simple procspeed-homogeneous node set is desired, ONEOF:PROCSPEED may be specified. On the other hand, a communication-sensitive job may request a network-based node set with the configuration ONEOF:NETWORK:VIA:MYRINET:ETHERNET, in which case Maui will first attempt to locate adequate nodes where all nodes contain VIA network interfaces. If such a set cannot be found, Maui will look for sets of nodes containing the other specified network interfaces. In highly heterogeneous clusters, the use of node sets has been found to improve job throughput by 10 to 15 percent.
Many sites possess workloads of varying importance. While some jobs may required resources immediately, other jobs are less time sensitive but have an insatiable hunger for compute cycles. These latter jobs often have turnaround times on the order of weeks or months. The concept of cycle stealing, popularized by systems such as Condor, handles such situations well and enables systems to run low-priority preemptible jobs whenever something more pressing is not running. These other systems are often employed on compute farms of desktops where the jobs must vacate whenever interactive system use is detected.
Maui's QoS-based preemption system allows a dedicated, noninteractive cluster to be used in much the same way. Certain QoS objects may be marked with the flag PREEMPTOR and others with the flag PREEMPTEE. With this configuration, low-priority "preemptee" jobs can be started whenever idle resources are available. These jobs will be allowed to run until a "preemptor" job arrives, at which point the preemptee job will be checkpointed if possible and vacated. This strategy allows almost immediate resource access for the preemptor job. Using this approach, a cluster can maintain nearly 100 percent system utilization while still delivering excellent turnaround time to the jobs of greatest value.
Use of the preemption system is not be limited to controlling low-priority jobs. Site can use this feature to support optimistic backfill scheduling, enable deadline based scheduling, and provide QoS guarantees.
High-performance computing clusters are complicated. First, such clusters have an immense array of attributes that affect overall system performance, including processor speed, memory, networks, I/O systems, enterprise services, and application and system software. Second, each of these attributes is evolving over time, as is the usage pattern of the system's users. Third, sites are presented with an equally immense array of buttons, knobs, and levers which they can push, pull, kick, and otherwise manipulate. How does one evaluate the success of a current configuration? And how does one establish a causal effect between pushing one of the many provided buttons and improved system performance when the system is constantly changing in multiple simultaneous dimensions?
To help alleviate this problem, Maui offers several useful features.
Maui possesses many internal diagnostic functions that both locate problems and present system state information. For example, the priority diagnostic aggregates priority relevant information, presenting configuration settings and their impact on the current idle workload; administrators can see the contribution associated with each priority factor on a per job and systemwide average basis. The node diagnostic presents significant node-relevant information together with messages regarding any unexpected conditions. Other diagnostics are available for jobs, reservations, QoS, fairshare, priorities, fairness policies, users, groups, and accounts.
Maui maintains internal statistics and records detailed information about each job as it completes. The showstats command provides detailed usage information for users, groups, accounts, nodes, and the system as a whole. The showgrid command presents scheduler performance statistics in a job size/duration matrix to aid in analyzing the effectiveness of current policies.
The completed job statistics are maintained in a flat file located in the 'stats' directory. These statistics are useful for two primary purposes: driving simulations (described later) and profiling actual system usage. The profiler command allows the processing of these historical scheduler statistics and generation of usage reports for specific time frames or for selected users, groups, accounts, or types of jobs.
Maui supports a scheduling mode called test. In this mode, the scheduler initializes, contacts the resource manager and other peer services, and conducts scheduling cycles exactly as it would if running in NORMAL or production mode. Job are prioritized, reservations created, policies and limits enforced, and admin and end-user commands enabled. Using the fact that test mode disables Maui's ability to impact the system, a site can safely verify scheduler operation and validate new policies and constraints. In fact, Maui can be run in test mode on a production system while another scheduler or even another version of Maui is running on the same system. This unique ability can allow new versions and configurations to be fully tested without any exposure to potential failures and with no cluster downtime.
To run Maui in test mode, simply set the MODE attribute of the SCHEDCFG parameter to TEST and start Maui. Normal scheduler commands can be used to evaluate configuration and performance. Diagnostic commands can be used to look for any potential issues. Further, the Maui log file can be used to determine which jobs Maui attempted to start and which resources Maui attempted to allocate.
In addition to test mode, Maui supports a mode known as interactive. This mode also allows for evaluation of new versions and configurations using a different approach. Instead of disabling all resource and job control functions, however, Maui sends the desired change request to the screen and asks for permission to complete it. The administrator must specifically accept each command request before Maui will execute it.
If another instance of Maui is running in production mode and a site wishes to evaluate a different configuration or new version using one of the above evaluation modes, this is easily done, but care should be taken to avoid conflicts with the primary scheduler. Potential conflicts include statistics files, logs, checkpoint files, and user interface ports. One of the easiest ways to avoid these conflicts is to create a new "test" directory with its own log and stats subdirectories. The new 'maui.cfg' file can be created from scratch or based on the existing 'maui.cfg' file already in use. In either case, make certain that the SCHEDCFG PORT attribute parameter differs from that used by the production scheduler. If testing is being done with the production binary executable, the MAUIHOMEDIR environment variable should be set to point to the new test directory in order to prevent Maui from loading the production 'maui.cfg' file.
The Maui simulation facility allows a site to evaluate cluster performance in an almost arbitrary environment. This is done by creating a resource and workload tracefile to specify the desired cluster and workload to be evaluated. These traces, specified via the SIMWORKLOADTRACEFILE and SIMRESOURCETRACEFILE, can accurately and reproducibly replicate the workload and resources recorded at the site or may represent an entirely new cluster and workload. In order to run a simulation, an adjusted 'maui.cfg' file is created with the policies of interest in place and the MODE attribute of the SCHEDCFG parameter set to SIMULATION. Once started, Maui can be stepped through simulated time using the schedctl -S command. In the simulation, all Maui commands continue to function as before, allowing interactive querying of status, adjustment of parameters, or even submission or cancellation of jobs.
This feature enables sites to analyze the impact of different scheduling policies on their own workload and system configuration. The effects of new reservations or job prioritizations can be evaluated in a zero-exposure environment, allowing sites to determine ideal policies without experimenting on a production system. Sites can also evaluate the impact of additional or modified workloads or changes in available resources. What impact will removing a block of resources for maintenance have on average queue time? How much benefit will a new reservation dedicated exclusively to development jobs have on development job turnaround time? How much pain will it cause nondevelopment jobs? Using simulation makes it easier and safer to obtain answers to such questions.
This same simulation feature can also be used to test a new algorithm against workload and resource traces from various supercomputing centers. Moreover, with the simulator, sites can create and plug in modules to emulate the behavior of various job types on different hardware platforms, across bottlenecking networks, or under various data migration conditions.
Further information on the capabilities and use of simulation is given in the Maui Administrators Manual.