Clusters | User Mode Linux

Clusters are another area where we are going to see increasing amounts of interest and activity. At some point, you may have a situation where you need to know whether your workload would benefit in some way from running on a cluster.

I am going to set up a small UML cluster, using Oracle's ocfs2 to demonstrate it. The key part of this, which is not common as hardware, is a shared storage device. For UML, this is simply a file on the host that multiple UML instances can share. In hardware, this would require a shared bus of some sort, which you quite likely don't have and which would be expensive to buy, especially for testing. Since UML requires only a file on the host, using it for cluster experiments is much more convenient and less expensive.

Getting Started

First, since ocfs2 is somewhat experimental (it is in Andrew Morton's-mm TRee, not in Linus' mainline tree at this writing), you will likely need to reconfigure and rebuild your UML kernel. Second, procedures for configuring a cluster may change, so I recommend getting Oracle's current documentation. The user guide is available from http://oss.oracle.com/projects/ocfs2/.

The ocfs2 configuration script requires that everything related toocfs2 be built as modules, rather than just being compiled into the kernel. This means enabling ocfs2 (in the Filesystems menu) andconfigfs (which is the "Userspace-driven configuration filesystem" item in the Pseudo Filesystems submenu). These options both need to be set to "M."

After building the kernel and modules, you need to copy the modules into the UML filesystem you will be using. The easiest way to do this is to loopback-mount the filesystem on the host (at ./rootfs, in this example) and install the modules into it directly:

host% mkdir rootfs host# mount root_fs.cluster rootfs -o loop host# make modules_install INSTALL_MOD_PATH=`pwd`/rootfs   INSTALL fs/configfs/configfs.ko   INSTALL fs/isofs/isofs.ko   INSTALL fs/ocfs2/cluster/ocfs2_nodemanager.ko   INSTALL fs/ocfs2/dlm/ocfs2_dlm.ko   INSTALL fs/ocfs2/dlm/ocfs2_dlmfs.ko   INSTALL fs/ocfs2/ocfs2.ko host# umount rootfs

You can also install the modules into an empty directory, create a tar file of it, copy that into the running UML instance over the network, and untar it, which is what I normally do, as complicated as it sounds.

Once you have the modules installed, it is time to set things up within the UML instance. Boot it on the filesystem you just installed the modules into, and log into it. We need to install the ocfs2 utilities, which I got from http://oss.oracle.com/projects/ocfs2-tools/. There's a Down-loads link from which the source code is available. You may wish to see if your UML root filesystem already has the utilities installed, in which case you can skip down to setting up the cluster configuration file.

My system doesn't have the utilities, so, after setting up the network, I grabbed the 1.1.2 version of the tools:

UML# wget http://oss.oracle.com/projects/ocfs2-tools/dist/\ files/source/v1.1/ocfs2-tools-1.1.2.tar.gz UML# gunzip ocfs2-tools-1.1.2.tar.gz UML# tar xf ocfs2-tools-1.1.2.tar UML# cd ocfs2-tools-1.1.2 UML# ./configure

I'll spare you the configure output; I had to install a few packages, such as e2fsprogs-devel (for libcom_err.so ), readline-devel, and glibc2-devel. I didn't install the python development package, which is needed for the graphical ocfs2console. I'll be demonstrating everything on the command line, so we won't need that.

After configuring ocfs2, we do the usual make and install :

UML# make && make install

install will put things under /usr/local unless you configured it differently.

At this point, we can do some basic checking by looking at the cluster status and loading the necessary modules. The guide I'm reading refers to the control script as /etc/init.d/o2cb, which I don't have. Instead, I have ./vendor/common/o2cb.init in the source directory, which seems to behave as the fictional /etc/init.d/o2cb.

UML# ./vendor/common/o2cb.init status Module "configfs": Not loaded Filesystem "configfs": Not mounted Module "ocfs2_nodemanager": Not loaded Module "ocfs2_dlm": Not loaded Module "ocfs2_dlmfs": Not loaded Filesystem "ocfs2_dlmfs": Not mounted

Nothing is loaded or mounted. The script makes it easy to change this:

UML# ./vendor/common/o2cb.init load Loading module "configfs": OK Mounting configfs filesystem at /config: OK Loading module "ocfs2_nodemanager": OK Loading module "ocfs2_dlm": OK Loading module "ocfs2_dlmfs": OCFS2 User DLM kernel \     interface loaded OK Mounting ocfs2_dlmfs filesystem at /dlm: OK

We can check that the status has now changed:

UML# ./vendor/common/o2cb.init status Module "configfs": Loaded Filesystem "configfs": Mounted Module "ocfs2_nodemanager": Loaded Module "ocfs2_dlm": Loaded Module "ocfs2_dlmfs": Loaded Filesystem "ocfs2_dlmfs": Mounted

Everything looks good. Now we need to set up the cluster configuration file. There is a template in documentation/samples/clus-ter.con, which I copied to /etc/ocfs2/cluster.conf after creating /etc/ocfs2 and which I modified slightly to look like this:

UML# cat /etc/ocfs2/cluster.conf node:         ip_port = 7777         ip_address = 192.168.0.253         number = 0         name = node0         cluster = ocfs2 node:         ip_port = 7777         ip_address = 192.168.0.251         number = 1         name = node1         cluster = ocfs2 cluster:         node_count = 2         name = ocfs2

The one change I made was to alter the IP addresses to what I intend to use for the two UML instances that will form the cluster. You should use IP addresses that work on your network.

The last thing to do before shutting down this instance is to create the mount point where the cluster filesystem will be mounted:

UML# mkdir /ocfs2

Shut this instance down, and we will boot the cluster, after taking care of one last item on the hostcreating the device that the cluster nodes will share:

host% dd if=/dev/zero of=ocfs seek=$[ 100 * 1024 ] bs==1K count=1

Booting the Cluster

Now we boot two UML instances on COW files with the filesystem we just used as their backing file. So, rather than using ubda=rootfs as we had before, we will use ubda=cow.node0, rootfs and ubda=cow.node1,rootfs for the two instances, respectively. I am also giving them umid s of node0 and node1 in order to make them easy to reference with uml_mconsole later.

The reason for mostly configuring ocfs2, shutting the UML instance down, and then starting up the cluster nodes is that the filesystem changes we made, such as installing the ocfs2 tools and the configuration file, will now be visible in both instances. This saves us from having to do all of the previous work twice.

With the two instances running, we need to give them their separate identities. The cluster.conf file specifies the node names as node0 and node1. We now need to change the machine names of the two instances to match these. In Fedora Core 4, which I am using, the names are stored in /etc/sysconfig/network. The node part of the value of HOSTNAME needs to be changed in one instance to node0 and in the other to node1. The domain name can be left alone.

We need to set the host name by hand since we changed the configuration file too late:

UML1# hostname node0

and

UML2# hostname node1

Next, we need to bring up the network for both instances:

host% uml_mconsole node0 config eth0=tuntap,,,192.168.0.254 OK host% uml_mconsole node1 config eth0=tuntap,,,192.168.0.252 OK

When configuring eth0 within the instances, it is important to assign IP addresses as specified in the cluster.conf file previously. In my example above, node0 has IP address 192.168.0.253 andnode1 has address 192.168.0.251 :

UML1# ifconfig eth0 192.168.0.253 up

and

UML2# ifconfig eth0 192.168.0.251 up

At this point, we need to set up a filesystem on the shared device, so it's time to plug it in:

host% uml_mconsole node0 config ubdbc=ocfs

and

host% uml_mconsole node1 config ubdbc=ocfs

The c following the device name is a flag telling the block driver that this device will be used as a clustered device, so it shouldn't lock the file on the host. You should see this message in the kernel log after plugging the device:

Not locking "/home/jdike/linux/2.6/ocfs" on the host

Before making a filesystem, it is necessary to bring the cluster up in both nodes:

UML# ./vendor/common/o2cb.init online ocfs2 Loading module "configfs": OK Mounting configfs filesystem at /config: OK Loading module "ocfs2_nodemanager": OK Loading module "ocfs2_dlm": OK Loading module "ocfs2_dlmfs": OCFS2 User DLM kernel interface loaded OK Mounting ocfs2_dlmfs filesystem at /dlm: OK Starting cluster ocfs2: OK

Now, on one of the nodes, we run mkfs :

mkfs.ocfs2 -b 4K -C 32K -N 8 -L ocfs2-test /dev/ubdb mkfs.ocfs2 1.1.2-ALPHA Overwriting existing ocfs2 partition. (1552,0):__dlm_print_nodes:380 Nodes in my domain \     ("CB7FB73E8145436EB93D33B215BFE919"): (1552,0):__dlm_print_nodes:384 node 0 Filesystem label=ocfs2-test Block size=4096 (bits=12) Cluster size=32768 (bits=15) Volume size=104857600 (3200 clusters) (25600 blocks) 1 cluster groups (tail covers 3200 clusters, rest cover 3200 clusters) Journal size=4194304 Initial number of node slots: 8 Creating bitmaps: done Initializing superblock: done Writing system files: done Writing superblock: done Writing lost+found: done mkfs.ocfs2 successful

This specifies a block size of 4096 bytes, a cluster size of 32768 bytes, a maximum cluster size of eight nodes, and a volume label of ocfs2-test.

At this point, we can mount the device in both nodes, and we have a working cluster:

UML1# mount /dev/ubdb /ocfs2 -t ocfs2 (1618,0):ocfs2_initialize_osb:1165 max_slots for this device: 8 (1618,0):ocfs2_fill_local_node_info:836 I am node 0 (1618,0):__dlm_print_nodes:380 Nodes in my domain \     ("B01E29FE0F2F43059F1D0A189779E101"): (1618,0):__dlm_print_nodes:384 node 0 (1618,0):ocfs2_find_slot:266 taking node slot 0 JBD: Ignoring recovery information on journal ocfs2: Mounting device (98,16) on (node 0, slot 0)

UML2# mount /dev/ubdb /ocfs2 -t ocfs2 (1442,0):o2net_set_nn_state:417 connected to node node0 \     (num 0) at 192.168.0.253:7777 (1522,0):ocfs2_initialize_osb:1165 max_slots for this device: 8 (1522,0):ocfs2_fill_local_node_info:836 I am node 1 (1522,0):__dlm_print_nodes:380 Nodes in my domain \     ("B01E29FE0F2F43059F1D0A189779E101"): (1522,0):__dlm_print_nodes:384 node 0 (1522,0):__dlm_print_nodes:384 node 1 (1522,0):ocfs2_find_slot:266 taking node slot 1 JBD: Ignoring recovery information on journal ocfs2: Mounting device (98,16) on (node 1, slot 1)

Now we start to see communication between the two nodes. This is visible in the output from the second mount and in the kernel log of node0 when node1 comes online.

To quickly demonstrate that we really do have a cluster, I will copy a file into the filesystem on node0 and see that it's visible on node1 :

UML1# cd /ocfs2 UML1# cp ~/ocfs2-tools-1.1.2.tar . UML1# ls -al total 2022 drwxr-xr-x   3 root root 4096 Oct 14 16:24 . drwxr-xr-x  28 root root 4096 Oct 14 16:17 .. drwxr-xr-x   2 root root 4096 Oct 14 16:15 lost+found -rw-r--r--   1 root root 2058240 Oct 14 16:24 \     ocfs2-tools-1.1.2.tar

On the second node, I'll unpack the tar file to see that it's really there.

UML2# cd /ocfs2 UML2# ls -al total 2022 drwxr-xr-x   3 root root 4096 Oct 14 16:15 . drwxr-xr-x  28 root root 4096 Oct 14 16:18 .. drwxr-xr-x   2 root root 4096 Oct 14 16:15 lost+found -rw-r--r--   1 root root 2058240 Oct 14 16:24 \     ocfs2-tools-1.1.2.tar UML2# tar xf ocfs2-tools-1.1.2.tar UML2# ls ocfs2-tools-1.1.2 COPYING         aclocal.m4     fsck.ocfs2     mount.ocfs2    \ rpmarch.guess CREDITS         config.guess   glib-2.0.m4    mounted.ocfs2  \ runlog.m4 Config.make.in config.sub      install-sh     o2cb_ctl       \ sizetest MAINTAINERS    configure       libo2cb        ocfs2_hb_ctl   \ tunefs.ocfs2 Makefile       configure.in    libo2dlm       ocfs2cdsl      \ vendor Postamble.make debian          libocfs2       ocfs2console Preamble.make  debugfs.ocfs2   listuuid       patches README         documentation   mkfs.ocfs2     python.m4 README.O2CB    extras          mkinstalldirs  pythondev.m4

This is the simplest possible use of a clustered filesystem. At this point, if you were evaluating a cluster as an environment for running an application, you would copy its data into the filesystem, run it on the cluster nodes, and see how it does.

Exercises

For some casual usage here, we could put our users' home directories in the ocfs2 filesystem and experiment with having the same file accessible from multiple nodes. This would be a somewhat advanced version of NFS home directories.

A more advanced project would be to boot the nodes into an ocfs2 root filesystem, making them as clustered as they can be, given only one filesystem. We would need to solve a couple of problems.

The cluster needs to be running before the root filesystem can be mounted. This would require an initramfs image containing the necessary modules, initialization script, and tools. A script within this image would need to bring up the network and run the ocfs2control script to bring up the cluster.
The cluster nodes need some private data to give them their separate identities. Part of this is the network configuration and node names. Since the network needs to be operating before the root filesystem can be mounted, some of this information would be in the initramfs image.
The rest of the node-private information would have to be provided in files on a private block device. These files would be bind-mounted from this device over a shared file within the cluster file system, like this:

UML# mount --bind /private/network /etc/sysconfig/network

Without having done this myself, I am no doubt missing some other issues. However, none of this seems insurmountable, and it would make a good project for someone wanting to become familiar with setting up and running a cluster.

Other Clusters

I've demonstrated UML's virtual clustering capabilities using Oracle'socfs2. This isn't the only clustering technology availableI chose it because it nicely demonstrates the use of a host file to replace an expensive piece of hardware, a shared disk. Other Linux cluster filesystems include Lustre from CFS, GFS from Red Hat, and, with a generous definition of clustering, NFS.

Further, filesystems aren't the only form of clustering technology that exists. Clustering technologies have a wide range, from simple failover, high-availability clusters to integrated single-system image clusters, where the entire cluster looks and acts like a single machine.

Most of these run with UML, either because they are architecture-independent and will run on any architecture that Linux supports, or because they are developed using UML and are thus guaranteed to run with UML. Many satisfy both conditions.

If you are looking into using clusters because you have a specific need or are just curious about them, UML is a good way to experiment. It provides a way to bring multiple nodes up without needing multiple physical machines. It also lets you avoid buying exotic hardware that the clustering technology may require, such as the shared storage required by ocfs2. UML makes it much more convenient and less expensive to bring in multiple clustering technologies and experiment with them in order to determine which one best meets your needs.