Daniel Nurmi and Brian Toonen
Cluster network configuration is a commonly overlooked aspect of many cluster design issues. Although designers think about details regarding the required network hardware, they frequently overlook the network design until after the cluster is installed and users start running code on the system.
The cluster network, the topic of this chapter, is most simply defined as the methods employed to connect various cluster entities via networks. This high level definition leads us to consider equally high level issues of node connectivity, node visibility, and cluster networking services. We will quickly discover that these seemingly simple issues encompass more complex topics, such as how cluster users interact with the machine, how security requirements imposed on the system impact the network design, and how application performance varies based upon the cluster network design. The methods used to handle these issues are implemented in the cluster network design, which we define as an administrative network topology imposed on the cluster to organize security, performance, and usability policies.
This chapter aims to bring the concept of cluster network design and tuning to the forefront of cluster designers' minds during the design phase. Fundamentally, we hope to leave the reader with the sense that a cluster's network design heavily impacts its core operation.
The rest of this chapter is arranged in the following manner. First, we will introduce some important issues that face the cluster designer and show how these issues can be directly affected by the choice of network design. Next, we introduce some fundamental concepts that will be used throughout the rest of the chapter, such as the Internet protocols and simple Linux networking concepts. Then, we will construct a simple cluster from the leftmost side of the cluster network design continuum (fully connected, fully visible). We will cover some of the most fundamental configuration issues involved by taking some machines and setting up the network and network services so that the machines act as a cluster running parallel codes. We then use this theoretical system as a vehicle to introduce the concepts of performance and security optimization techniques. We conclude with a brief discussion of diagnosing and correcting network problems.