The next several chapters introduce cluster workload management concepts and present in detail three specific software packages, Condor, Maui, and PBS, which are commonly used to manage the workload on Beowulf clusters.
Managing the workload on a Beowulf cluster is one of the most visible cluster management activities since its purpose is to run user applications. The following are examples of workload management activities that are critical to cluster management:
Managing node availability
Configuring node attributes important to the workload
Managing user/group/project fair usage quotas
Configuring and tuning scheduling policy
Managing dedicated or maintenance reservations
Tracking user/group/project usage history
After selecting and installing workload management software, a cluster administrator will perform these activities to ensure that a cluster usage is consistent with its goals.
Every cluster has a different set of goals, and how to implement an appropriate workload management policy for those goals depends on the software packages in use. You should consult Condor, Maui, PBS, or other workload management software documentation for details on options available to implement the policies you need.
Regardless of what workload management tools you use you should try to find out how to perform the following activities to assist in failure investigation and recovery.
Taking a node off line so it is not considered for future jobs.
Placing a system or individual user or project reservation on a node so the node is not available to everyone but still available for investigating a hardware or software failure.
Modifying the properties or attributes of a node to reflect a change in the availability of a failing component (like the interconnect), or to reflect that it has a test operating system or collection of software.
Adding or removing individual nodes from the list of known nodes.
Suspending all job execution without losing previously queued jobs.
Canceling running jobs.
Placing a hold on a queued job to ensure that it doesn't run and trigger some type of harmful failure.