2.3 Software

< Day Day Up >

As mentioned previously, there is more to a cluster than simply servers in a rack. Another main component of the IBM Cluster 1350 is a set of software that includes the Linux operating system and the IBM Cluster Systems Management (CSM) for Linux. IBM General Parallel File System (GPFS) for Linux optionally can be included to provide high performance and parallel access to storage. Once the customer provides the Linux operating system, IBM Global Services can perform all the software installation and configuration.

The following sections provide a brief overview of each of the software components described above that are supported by IBM. The remaining chapters of this book addresses CSM and GPFS in more detail.

2.3.1 Linux operating system

As discussed in Chapter 1, "Clustering concepts and general overview" on page 3, Linux is increasingly popular as an operating system. Due to its openness and relative cost, Linux-based clusters represent a very viable option for those wanting to utilize clustering to enhance availability, scalability, and/or processing power.

To provide a common base upon which the complex mixture of hardware and software that comprises a cluster could be consistently deployed, IBM chose to support a standard Linux distribution on the Cluster 1350. With its large user-base and automated "Kick-Start" installation, Red Hat Linux was the ideal choice.

The IBM Cluster 1350 offering is currently supported by multiple versions of both the Red Hat and SuSE distributions of Linux, for a complete list of supported versions, refer to 3.6, "Software requirements to run CSM" on page 67. The customer should provide the version of the Linux operating system specified by IBM. Additional Linux distributions and versions may be considered in the future.

2.3.2 IBM Cluster Systems Management (CSM) for Linux

IBM Cluster Systems Management for Linux (CSM) provides a distributed system-management solution for clustered machines that are running the Linux operating system, as distributed by Red Hat. CSM is an IBM licensed program that forms an integral part of the IBM Cluster 1350 platform. It is also available as a separately orderable software product. Using CSM, an administrator can easily set up and maintain a Linux cluster by using functions like automated set up, hardware control, monitoring, and configuration file management. The concepts and software are derived from IBM Parallel System Support Programs for AIX (PSSP) and from applications available as open source tools.

CSM allows a cluster of nodes to be managed as a single entity from a single point of control: the management node. Hardware can be monitored and power controlled, software can be installed, configured, and monitored, and problems can be diagnosed from this single point of control using a command-line interface or script. There is no need to have keyboards and monitors attached to the nodes in the cluster, as all operations can be performed from a single management console using the functions discussed in the following list.

With the current version of CSM (1.3.1), it is possible to:

Install Linux and/or CSM on cluster nodes over the network
Add, remove, or change nodes
Remote power control nodes in the cluster
Access a node's console remotely
Run remote commands across groups of nodes in the cluster
Centrally manage configuration files for the cluster
Monitor nodes and applications as to whether they are active
Monitor CPU, memory, and system utilization
Run automated responses when events occur in the cluster

Each of these features is covered in greater detail later in the book, but for now we provide a brief overview.

Node installation

CSM tools are able to perform either a CSM-only install or a Linux and CSM install (known as full-install) on each cluster node. For a complete explanation of these two installation methods, refer to:

Chapter 5, "Cluster installation and configuration with CSM" on page 99
6.1.2, "Adding new nodes using the full installation process" on page 154

Adding and removing nodes

IBM understands that it may be difficult to accurately predict demand or growth within a company. With that in mind, adding or replacing nodes in a cluster with CSM is incredibly simple. Creating a new node definition and installing a node can easily be accomplished inside an hour, subject to network constraints. Removing nodes is equally simple.

Remote hardware control

The service processors in all nodes of the cluster are accessible, via the RSAs, on the management VLAN and therefore to the cluster management node. This capability is used by CSM so the nodes may be powered on/off or reset remotely. Advanced system management functions, such as monitoring environmental conditions, are also available.

Remote console

This function uses the serial ports of the cluster servers and a terminal server to access cluster servers during Linux installation or when network access to the servers is unavailable. A single node's console may be viewed or multiple consoles tiled across an X Windows® display to monitor installation progress.

Distributed shell (dsh)

One of the most powerful and useful features of CSM, dsh allows the execution of arbitrary commands or scripts on all or some of the servers in the cluster. Via a simple command-line on the management node, the administrator may run any command she or he desires on a group of nodes.

Configuration File Manager (CFM)

The CFM tool enables an administrator to set up configuration files in a central place. Configuration files are stored centrally on the management node and then pushed out to the cluster nodes. For any particular configuration file, the administrator can set up one version for all of the nodes or specify alternative versions that should be used for particular node groups. CFM can be configured so the nodes are updated automatically, periodically, or only when desired by the administrator.

Event monitoring and response

Within a cluster, it is important that all the nodes remain operational, despite the stresses jobs may subject them to. CSM allows proactive monitoring of node resources with configurable response. Pre-defined monitors include file system or paging space approaching full, pre-defined responses, including sending e-mail to root and broadcast a message to all users. Additional monitors and responses may be configured and activated, even on a time-based schedule.

2.3.3 General Parallel File System for Linux

A distributed file system is often critical to the operation of a cluster. If a high I/O throughput is required for a distributed file system, IBM General Parallel File System (GPFS) for Linux can be ordered with the IBM Cluster 1350 to provide high speed and reliable parallel storage access from large numbers of nodes within the cluster.

As the name suggests, GPFS is designed to allow multiple Linux nodes optimal access to the file system, even the same file, at the same time. It has been implemented so that, much like NFS, to the applications it looks just like any other file system. Unlike NFS, GPFS does not sit on top of an existing file system such as "ext2"; GPFS accesses local or network attached disks directly.

Concurrent reads and writes from multiple nodes is key in parallel processing. GPFS increases the concurrency and aggregate bandwidth of the file system by spreading reads and writes across multiple disks on multiple servers. Much as RAID-0 stripes read and write operations across multiple disks, GPFS stripes operations across multiple servers.

GPFS is a journalling file system and creates separate logs for each node. These logs record the allocation and modification of metadata, aiding in fast recovery and restoration of data consistency in the event of node failure. Additionally, fail-over support allows for automatic recovery in the event of a server node failure. GPFS can be configured with multiple copies of metadata (the file system data that describes the user data), allowing continued operation should the paths to a disk or the disk itself malfunction. Disks may be added or deleted while the file system is mounted.

GPFS data can be exported using Network File System (NFS), including the capability to export the same data from multiple nodes. This allows for some interesting possibilities in redundant NFS service.

GPFS provides the best performance for larger data objects, but can also provide benefits for large aggregates of smaller objects.

2.3.4 Other software considerations

Once installed and configured, the IBM Cluster 1350 is equipped with everything you need to install, control, and manage hardware and operating system environments. However, there are additional software products that may be required for many clusters to provide the administrators and users with the tools they need to work efficiently. These include but are not limited to:

Batch queuing systems
Advanced schedulers
Compilers
High availability software
Non-IBM distributed file systems
Parallel Programming Libraries

In addition, each of the applications that are expected to run on the cluster should be reviewed to understand what co-requisites or prerequisites each require. In the case of Linux, you should not only consider the version of Red Hat but the kernel version as well, and whether or not it is a custom-built kernel or one officially supported by Red Hat.

< Day Day Up >