Creating a Batch Job-Scheduling System with No Single Point of Failure | Linux Enterprise Cluster: Build a Highly Available Cluster with Commodity Hardware and Free Software

In the final case study, we want to build a batch job scheduler with no single point of failure. In this case study, we again assume the cluster is protected by a firewall and security barriers are not required between the cluster nodes. (As with a monolithic server of old, a root compromise means the cracker will have access to everything.)

We'll first use ssh to construct a method of accessing a shell or running a command on any cluster node from a single machine.

Run ssh-keygen

Recall from Chapter 4 that the ssh-keygen command can be used without a passphrase, so the ssh command can be used later without the need to key in a password each time. Run the following command on the cluster node manager:

 #ssh-keygen -t rsa

Do not enter a passphrase when prompted. Now copy the contents of the file /root/.ssh/id_rsa.pub on the cluster node manager to each cluster node's /root/.ssh/authorized_keys2 file.

Modify the sshd_config File on Each Cluster Node

We are trusting in the ssh keys to protect the cluster from a root user attack (you have to know the private half of the RSA key used on the cluster node manager to connect as the root user to the cluster node). The root user account on the cluster node manager and the protection of the RSA key will now be as important as protecting the root password. You can use LIDS (http://www.lids.org) to hide this file for increased security, or you can create a separate cluster administrative account without root privileges.^[8]

Modify the PermitRootLogin line in each cluster node's /etc/ssh/ sshd_config file so it looks like this:

 permitRootLogin without-password

Then start or restart the sshd daemon on each cluster node.

Note

Make sure the sshd daemon is started each time the cluster nodes boot (see Chapter 1).

Create the (RSA) known_hosts Entries on the Cluster Node Manager

Type a simple command such as the following on the cluster node manager:

 #ssh clnode1 hostname

Run the command once for each cluster node, substituting clnode1 in this example with the name of each cluster node. You should see a message like the following that will allow you to save the host key for each node on the cluster node manager.

 The authenticity of host 'node1' can't be established. RSA key fingerprint is 04:54:c1:f3:33:ff:89:14:a0:8d:f3:5d:67:f5:43:21.            Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added 'node1' (RSA) to the list of known hosts.

Note

If you are prompted for the password on each cluster node, review Chapter 4 for details on how to troubleshoot your ssh configuration.

The Batch Job Scheduler

We'll call the system where we launch batch jobs the batch job scheduler. The batch job scheduler in this case study is the cluster node manager (that now has ssh access to all of the cluster nodes). The batch job scheduler is a highly available server pair that places the cron daemon under Heartbeat's control and uses crontab^[9] entries to submit batch jobs using the clustersh script that is included on the CD-ROM in the chapter19 directory. A sample cron table entry on the batch job scheduler looks like this:

 30 14 * * *   clustersh /clusterdata/scripts/myscript

This crontab entry says that every day at 2:30 p.m. the program /clusterdata/scripts/myscript should run on the cluster node with the least activity.

Note

The clustersh script contains the host names of the nodes inside your cluster. You'll need to modify this script file before you can use it on your cluster (see the script for detailed documentation).

If a cluster node goes down, the clustersh script will not attempt to execute the /clusterdata/scripts/myscript on it.

To run the same script on all cluster nodes at 1:30 p.m. each day, the crontab entry would look like this:

 30 14 * * *   clustersh -a /clusterdata/scripts/myscript

By running cron jobs from the highly available batch job scheduler, you avoid a single point of failure for your regularly scheduled batch jobs.

Note

The /etc/ha.d/haresources file, in the Heartbeat configuration file on the batch job scheduler, should specify that the crond daemon is started only on the primary server and that it starts automatically on the backup server only if the primary server goes down.

One danger of using this configuration, however, is that you may change the crontab file on the primary batch job scheduler and forget to make the same change on the backup batch job scheduler. To automate copying the crontab file from the primary batch job scheduler to the backup batch job scheduler, use rsync from the following script, which copies the contents of the /var/spool/cron directory (where the crontab file resides):

 #!/bin/bash OPTS=" --force --delete --recursive --ignore-errors -a -e ssh -z " rsync $OPTS /etc/ha.d/          backup:/etc/ha.d/ rsync $OPTS /etc/hosts          backup:/etc/hosts rsync $OPTS /etc/printcap       backup:/etc/printcap rsync $OPTS /var/spool/lpd/     backup:/var/spool/lpd/ rsync $OPTS /etc/mon            backup:/etc/mon rsync $OPTS /usr/lib/mon/       backup:/usr/lib/mon/ rsync $OPTS /var/spool/cron/    backup:/var/spool/cron/

The last line in this configuration file tells rsync to copy the contents of the /var/spool/cron file to the backup batch job scheduler with the host name backup.

^[8]See the discussion of the sudo command in Chapter 4 for a method of giving special administrative account access to only certain root-level commands or capabilities.

^[9]The crontab file is the name of the file the cron daemon uses to store the list of cron jobs it should run. Edit the crontab file with crontab -e.