Initial setup of a cluster is not trivial, but neither is it untenable. Suppose that 68 brand new computers have just arrived. 64 are slated as compute nodes, two function as login and job launching nodes, and two are dedicated to providing I/O services, often using plain old NFS. The hardware related challenges revolve around selecting your network setup (see Chapter 5), laying out the systems in an organized fashion, physically wiring, and getting your electricians to believe just how much power the cluster will actually consume. For the moment, let's assume that the cluster has been physically constructed and sitting with power turned off and no software on any system (frontend, storage, compute, etc). It's raw hardware just waiting to be unleashed.
This subsection starts with a bold proposition—"There are no homogeneous clusters". Standard Beowulfs have at least two types of nodes: login and compute, so homogeneity of function is already split. As clusters get larger, some nodes take on specialized service roles to handle the aggregrate load—system logging nodes, dedicated I/O nodes, additional public login nodes, and dedicated installation nodes are just a few personalities that might need to be supported. In a tale of two clusters (Chapter 20), one can see the various node types that make up a real cluster. It's more than just head node and compute. Role specialization isn't the only way that a model of homogeneity can break—differences in hardware is also quite common at design time and throughout the life of the cluster. Even though many clusters may start out with compute nodes being of a homogeneous hardware type, they often don't stay that way. Hardware is simply moving too quickly to expect that future expansions of a cluster could be identical to current nodes. Equipment breaks and the replacement parts might have different memory types, updated processors or a different network adapter. Even when all nodes are purchased at the same time in an attempt to insure hardware homogeneity, small differencess still can get in the way.
Some years ago, the author worked with NT-based clusters. Our team had purchased 64 9.1 GB SCSI drives, all with the same part number, all with the same specifications. They differed slightly—some had 980 cylinders and some had 981. From the manufacturers perspective, both provided the advertised space. The problem occurred in the imaging program (ImageCast, in this case). An image was taken from the 981 cylinder drive. Attempts to re-image the 980 cylinder drives failed beeause of the single cylinder difference. Image-based programs have certainly improved since then, but these types of small differences can cause many lost hours. We solved the problem by building the model node on the 980 cylinder disk that just happened to work on the larger drive. We were lucky, we might have been forced to have two images just because of a one cylinder difference in the local hard drive. The reality is that in commodity components, small low-level differences exist. Your setup and management methodology must be able to handle these subtle differences without administrative intervention.
The previous example makes clusters sound ominous, impossible to provision, disorganized and the reader may feel that it is a hopeless cause to build a real, functioning and stable cluster. That somehow small differences can wreak havoc on the provisioning stage. Fear not. Clusters are everywhere. They include some of the fastest machines in the world, are stable and can be provisioned easily to meet the configuration challenges to manage the inhomegeneity at the functional and hardware levels.
Commonly, several types of functionality are needed to build a working cluster. As clusters grow in the number of nodes, specialization of particular nodes to perform specific tasks becomes much more prevalent. In the largest clusters, functional specialization of nodes is a necessity. The specialization is a direct outcome of needing to scale certain services. On a small cluster, the head node can "do it all" — system logging, ganglia monitoring, function as an installation server, compilation, login, and serve out home areas. As the cluster grows, these services need to be spread across physical machines so that each can handle the load.
Any node in the cluster is differentiated by the types of services and software that are configured on it. Nodes can change their logical functionality just by deploying and configuring a different software stack. A common differentiation in mid-sized clusters has nodes of the following types (we will henceforth call these appliances):
Head Node/Frontend Node—This node is the public persona of the cluster. This is where users log in, compile, and submit jobs.
Compute Node— Where most of the work happens
I/O server—Often an NFS server, but aggressive systems like PVFS can be used
System logging server
Grid gateway node
Batch Scheduler and cluster-wide monitoring
When setting up a cluster, decisions are made as to how many I/O servers, how many system logging servers, and how many installation servers are needed to support a given number of compute nodes. For small- to mid-sized clusters (perhaps up to 128 nodes), the services are all hosted by a single (or small number) of front-end or head nodes, so no real decision has to be made. However, even in mid-sized clusters, special attention is often paid to improving file handling capability by provisioning a sub-cluster of nodes dedicated to I/O. Chiba City at Argonne, for example, has different "towns"—visualization, storage, and compute—that clearly define functional differences.
In common cluster construction, one builds a head node, a set of I/O nodes (collectively, an I/O cluster), and a set of compute nodes. This chapter assumes that these types of "appliance" classifications have already been made by the cluster designer, but that at this point, nothing is installed or set up.
The issue that overshadows all others in cluster setup and management is creating and maintaining a software environment that is consistent across all nodes and node types. Small anomolies such as different versions of the standard C library can cause performance and correctness of operations problems. Progamming clusters is challenging enough without users needing to figure out that nodes are behaving differently because of software version "skew" across the cluster. It is for this reason that cluster installation and setup is so intimately tied to ongoing management. It simply does no good to install a new node (either expansion or replacement of a failed node) that differs in software versions or configuration from the running cluster. The new node must be brought into parity with the rest of cluster. Two popular open-source clustering toolkits, NPACI Rocks and OSCAR, take radically different approaches to provisioning and management. Both toolkits' perspective on installation will described in some detail in this chapter.
It is worth noting that diskless clusters often have fewer issues with software skew because all nodes mount a common root file system over NFS. Even so, diskless clusters are significantly less popular because of the scaling problems of serving all system software from a central NFS server. Chapters 3 and 20 cover some of the advantages and disadvantages of diskless nodes.