NPACI Rocks clustering software leverages RedHat's Kickstart utility to manage the software and configuration of all nodes. It fundamentally enables the notion that reals clusters have many node types (hereafter referred to as "appliance types" or "appliances"). Rocks decomposes the configuration of each appliance into several small single-purpose package and configuration modules. Further, all site- and machine-specific information is managed in an SQL (MySQL) database as a single "oracle" of cluster-wide information.
The Rocks configuration modules can be easily shared between cluster nodes and, more importantly, cluster sites. For example, a single module is used to describe the software components and configuration of the ssh service. Cluster appliance types which require ssh are built with this module. The configuration is completely transferrable, as is, to all Rocks clusters.
In Rocks, a single object-oriented framework is used to build the configuration/installation module hierarchy, resulting in multiple cluster appliances being constructed from the same core software and configuration description. This framework is composed of XML files and a Python engine to convert component descriptions of an appliance into a Redhat-compliant Kickstart file.
Anaconda is RedHat's installer that interprets Kickstart files. The Kickstart file describes what must be done from disk partitioning, to package installation, and finally post- or site-configuration to create a completely functional node. Figure 6.3 presents a sample Kickstart file. It has three sections: command, package, and post. The command section contains almost all the answers posed by an interactive installation (e.g., location of the distribution, disk partitioning parameters and language support). The packages section lists the names of Redhat packages (RPMs) to be installed on the machine. Finally, the post section contains scripts which are run during the installation to further configure installed packages. The post section is the most complicated because this is where site-specific customization is done. Rocks, for example, does not repackage available software—it simply has a mechanism to easily provide the needed post-configuration.
url --url http://10.1.1.1/install/1386 zerombr yes clearpart --all part / --size 4096 lang en_US keyboard us mouse genericps/2 timezone --utc GMT skipx install reboot %packages @Base pdksh %post cat > /etc/motd << 'EOF' Kickstarted on 'date' EOF
While a Kickstart file is a text-based description of all the software packages and software configuration to be deployed on a node, it is both static and monolithic. At best, this requires separate files for each appliance type. At worst, this requires a separate file for each host. The overwhelming advantage of Kickstart is that it provides a de facto standard for installing software, performing the system probing required to install and configure the correct device drivers, and automating the selection of these drivers on a per machine basis. A Kickstart file is quite generic in that references to specific versions of packages are not needed. Neither is specific identification of ethernet, disk, video, memory, motherboard, or other hardware devices needed.
Because the Kickstart file does not contain package versions, resolution of specific version information must be held somewhere. For RedHat, this information is kept in a distribution tree. The distribution is simply a collection of RedHat Packages (RPMS) in particular directory structure and a RedHat-specific index file that maps a generic package name to its fully-qualified version. In this way, the Kick-start file may list an openssh-clients package, but the Anaconda installer will resolve this to the full name openssh-clients-3. 1p1-6.i386.rpm by referencing the distribution's index file. Rocks provides some critical software (rocks-dist) that greatly simplifies the creation of custom distributions. Multiple distributions can sit on a single server and end-users can easily integrate site-specific software. In addition, distribution can be built with the latest update of packages so that when a Rocks appliance installs itself, it can apply completely updated software in a single step. This eliminates an install-then-patch scenario.
It is important to understand that the distribution contains all possible packages that might be installed on a cluster appliance node. The Kickstart file describes exactly which of these will be installed and how each software subsystem will be configured to make a particular appliance. Rocks allows a head node to serve out multiple distributions. This facilitates testing of nodes against new software simply by pointing the installer to new distribution. See Chapter 20 and the Jazz cluster for a real world experience on the necessity of having test hardware.
Description mechanisms for other distributions and operating systems exist and include SuSE's YaST (and YaST2), Debian FAI (Fully Automatic Installer), and Sun Solaris Jumpstart. The structure of each of the text descriptions are actually quite similar as the same problems of hardware probing, software installation, and software post-configuration must be done. The specifics of package naming, partitioning commmands and other details are quite different among these methods.
The key functionality missing from Kickstart to make it the only installation tool needed for clusters is the lack of macro language and a framework for code re-use. A macro language would improve the programmability of Kickstart and code reuse significantly ameliorates the problems of software skew across appliances by having shared configuration among appiance types be truly shared (instead of being copies that require vigilance to keep in sync).
Rocks uses the concept of package and configuration modules as building blocks for creating entire appliances. Rocks modules are small XML files that encapsulate package names and post-configuration into logical "chunks" of functionality. XML is used by Rocks because of de facto standard software for parsing data.
Once the functionality of a system is broken into small single-purpose modules, a framework describing the inheritance model is used to derive the full functionality of complete systems, each of which shares common base configuration. Figure 6.4 is a representation of such a framework which describes the configuration of all appliances in a Rocks cluster. The framework is a directed graph—each vertex represents the configuration for a specific service (software package(s), service configuration, local machine configuration, etc.) Relationships between services are represented with edges. At the top of the graph there are four vertexes which indicate the configuration of a "laptop", "desktop", "frontend", and "compute" cluster appliance.
Figure 6.4: Description (Kickstart) Graph. This graph completely describes all of the appliances of a Rocks Cluster.
When a node is built using Rocks, the Kickstart file for a particular node is generated and customized on-the-fly by starting at an appliance entry node and traversing the graph. The modules (XML Files) are parsed, and customization data is read from the Rocks SQL database. Figure 6.5 shows some detail of the configuration graph. Two appliance types are illustrated here—standalone and node. Both share everything that is contained in the base module and hence will be indentically installed and configured for everything in base and modules below. In this example, a module called c-development is only attached to standalone. With this type of construction it is quite easy to see (and therefore focus on) the differences between appliances.
Figure 6.5: Description Graph Detail. This illustrates how two modules 'standalone.xml' and 'base.xml' share base configuration and also differ in other specifics
It is interesting to note that the interconnection graph is a different file from the modules themselves. This means that if a user desires to have the c-development module as part of the base installation, one simply makes that change in the graph file and attaches c-development to the base module. Also in Figure 6.5, edges can be annotated with architecture type (i386 and ia64 in this example). This allows the same generic structure to describe appliances across significant architectural boundaries. Real differences, such as the grub (for ia32) and elilo (for ia64) boot loaders can be teased out without completely replicating all of the configuration.
In an earlier section, it was stated that image-based systems put the bulk of their configuration into creating an image, while description methods put the bulk of their configuration into the description (e.g. Kickstart) file. In Rocks, the modules are small XML files with simple structures as illustrated in Figures 6.6 and 6.7.
<?xml version="1.0" standalone="no"?> <!DOCTYPE kickstart SYSTEM "dtds/node.dtd" [<!ENTITY ssh "openssh">]> <kickstart> <package>&ssh;</package> <package>&ssh;-clients</package> <package>&ssh;-server</package> <package>&ssh;-askpass</package> <!-- Required for X11 Forwarding --> <package>XFree86</package> <package>XFree86-libs</package> <post> <!-- default client setup --> cat > /etc/ssh/ssh_config << 'EOF' Host * CheckHostIP no ForwardX11 yes ForwardAgent yes StrictHostKeyChecking no UsePrivilegedPort no FallBackToRsh no Protocol 1,2 EOF </post> </kickstart>
<?xml version="1.0" standalone="no"?> <!DOCTYPE kickstart SYSTEM "dtds/node.dtd"> <kickstart> <main> <lang><var name="Kickstart_Lang"/></lang> <keyboard><var name="Kickstart_Keyboard"/></keyboard> <mouse><var name="Kickstart_Mouse"/></mouse> <timezone><var name="Kickstart_Timezone"/></timezone> <rootpw>--iscrypted <var name="RootPassword"/></rootpw> <install/> <reboot/> </main> </kickstart>
Figure 6.6 shows the XML file for an "ssh" module in the graph. The single purpose of this module is to describe the packages and configuration associated with the installation of the ssh service and client on a machine. The package and post XML tags map directly to Kickstart keywords. Figure 6.7 shows how global operations such as the root password and mouse selection similarly can be described. Rocks also contains options on partitioning hard drives that ranges from a fully-automated scheme (which works on IDE, SCSI, and RAID Arrays) to completely manual (adminstrator-controlled). The real advantage here is that ssh configuration policy is done once instead of being replicated across all appliance types.
Rocks uses a graph structure to create decription files for appliances. In the background is a mySQL database that holds cluster-wide configuration information. When a node requests an IP address, a dhcp server on the head node replies with a filename tag that contains a URL for the node's kickstart file. The node contacts the web server and a CGI script is run that 1) looks up the node and appliance type in the database, and 2) traverses and expands the graph for that appliance and node type to dynamically create the Kickstart file. Once the decscription is downloaded, the native installer takes over and downloads packages from the location specified in the kickstart file, installs packages, performs the post installation tasks specified, and then reboots. Rocks also uses the same structure to bootstrap a head node, except that the kickstart generation framework and Linux distribution is held on the local boot CD and interactive screens gather the local information. In summary, we annotate the installation steps with the steps that Rocks takes:
Install Head Node—Boot Rocks-augmented CD
Configure Cluster Services on Head Node—automatically done in step 1
Define Configuration of a Compute Node—Basic setup installed. Can edit graph or nodes to customize further
For each compute node—repeat
Detect Ethernet Hardware Address of New Node use insert-ethers tool
Install complete OS onto new node—Kickstart
Complete Configuration of new node—already described in Kickstart file
Restart Services on head node that are cluster-aware (e.g. PBS, Sun Grid Engine)—part of insert-ethers
The key features of Rocks are that it is RedHat-specific, uses descriptions to build appliances, leverages the Redhat Installer to do hardware detection, and will take hardware with no installed OS to an operating cluster in a short period of time. The description files are almost completely hardware independent allowing the construction of Beowulfs with different physical nodes to be handled as easily as homogeneous nodes.