13.9 Configuration Management

Configuring management refers to the activities performed on a machine that adapt it to a particular organization, its network services, security policies, management policies, access policy, etc. In other words, it is the set of activities that integrate a machine into the cluster and organization.

Why would configuration management need to be mentioned in the context of Beowulf clusters? First, because it's a critical aspect of making a Beowulf cluster functional, and secondly because it can be a challenge if one doesn't follow a carefully designed configuration management process.

Most Beowulf clusters have a variety of machine configurations that share many common characteristics, but also vary in important ways. For example, management nodes, login nodes, file-servers, and computer nodes all may need to be configured similarly to set and maintain accurate clocks, but they all have a slightly different access policy and collection of configured and available software services.

Describing configuration information and propagating it to machines is often performed by the cluster installation software. Chapter 6 describes cluster setup using various tools. Each of these cluster installation tools provides some type of configuration management capability. In some cases the capability is the basic capability you would use on stand-alone machines.

Regardless of which tool you use, an important aspect of cluster management is maintaining a central repository of the configuration information used in a cluster. Without this information, whenever a machine fails and needs to be rebuilt, determining what configuration information was applied to make the machine functional may be difficult to ascertain. Every time you rebuild a compute node you would rather not have to look at other compute nodes to remember which files contain important configuration information that must be applied to the rebuilt machine, and then go through a diff process comparing it to other nodes to make sure you remembered everything.

The important point to remember is that the most effective way to deal with configuration management is to maintain some type of central repository from which you push configuration changes. If this repository can be organized by node types or some other organizational approach, all the better. That way when you need to change something on all compute nodes, or all login nodes, or every node on the cluster, you don't have to update a centrally managed file for every node, but just the files from the appropriate classes of nodes.

If you need additional functionality in this respect that is not a part of your cluster distribution or installation suite you may find one of the following tools helpful:

cfengine, http://www.cfengine.org/
sanity/cfg, http://www-unix.mcs.anl.gov/systems/software/msys/
and various proprietary vendor solutions

Administration Challenges Unique to Clusters

One appealing way to think of cluster management is as management of a collection of individual machines. This approach is appealing since it sidesteps the complexity of the whole by focusing on the management of the individual components. Although managing a cluster this way may work at a basic level, it isn't very effective and doesn't consider the intended architecture and usage model of a cluster.

The Linux cluster's claim to fame is in its ability to produce supercomputer class results at a fraction of the cost. This means, among other things, that the collection of components at a practical level needs to be usable by applications and manageable by administrators as a single machine.

This is where the cluster management challenge begins. To overcome this challenge, cluster administrators must approach cluster management at the cluster level and therefore need tools for logging, monitoring, build and configuration management, workload management, and so forth that are aware of, and operate at, the cluster level.

Today we have many cluster management tools that make it easier to work with the entire cluster. But there is still significant room for improvement. One example is in the area of fault detection, analysis, and recovery. The major supercomputer vendors have worked for decades to make their machines fault tolerant. By contrast, today's cluster management tools for the most part ignore the issue of fault detection and recovery. This deficiency undoubtedly constitutes the greatest cluster management challenge.

An approach used to bridge the gap between cluster level management and machine specific administration tools is scripting or automation. The premise behind this approach is to make scriptable interfaces to all the actions performed at the machine level and to use cluster-level tools to automatically iterate the same action over many components or machines. While this concept sounds simple and achievable, it is unfortunately not always possible since hardware and software at the machine level is often not designed for complete hands-off administration.

One of the most basic and useful tools for invoking a scriptable command on a group of machines is the "parallel distributed shell" or pdsh. This tool is the cluster aware equivalent of rsh or ssh. With it you can define various sets of nodes and perform operation on those collections in parallel. For example, to verify the uptime and load across an entire cluster with pdsh use the command:

     pdsh -a uptime

For download or learn about pdsh visit: http://www.llnl.gov/linux/pdsh/pdsh.html