Before diving into the world of designing and configuring a Tru64 UNIX cluster, a brief overview of the Tru64 UNIX operating system will help set the table. Consider this chapter the appetizer, and the following chapters as the main meal. To enable you to digest the main topics in this book, we will prepare you by discussing several concepts and features of Tru64 UNIX and the TruCluster Server software that is now part of the operating system.
The name of this operating system speaks volumes. First, it is a UNIX-based operating system. It falls towards the middle of the UNIX family tree because it draws some of its characteristics from both the BSD[1] and System V[2] sides of the family. It also has a healthy dose of core code created by Compaq engineers.
Figure 2-1 depicts several of the common UNIX variants. Note Tru64 UNIX at the bottom of the diagram.
Figure 2-1: Tru64 UNIX History
Second, Tru64 Unix is truly a 64-bit operating system. The virtual addresses used in the system are indeed 64 bits, providing a huge virtual address space and supporting large file offsets and sizes. So, is that it? Are those the distinguishing features of the system? There are actually many more features of the operating system which we will visit in the first part of this chapter. If you are familiar with another UNIX system, you may want to take a quick look at the first section of this chapter but plan on slowing down and carefully reading the TruCluster Server Overview (section 2.7).
Tru64 UNIX has rapidly expanded its capabilities to the point where it provides the ability to support a Single System Image (SSI) Cluster option (as discussed in the previous chapter). Tru64 UNIX has been an integral part of the computer mix at many sites for many years – even before full-bodied clustering (SSI) was available.
Which features attracted customers to Tru64 UNIX before the advent of clustering? As you will see, there are many. We'll point out some features in the next few sections and relate those features to TruCluster Server (the focus of this book).
Lurking at the very heart of Tru64 UNIX are elements of the Mach kernel. Mach is a system created at Carnegie-Mellon University. It includes the notions of tasks and threads that figure prominently within the workings of Tru64 UNIX. A "task" represents a running program, while a ‘thread’ is a schedulable entity within that program. Historically, programs were written with a single thread. Most of the advanced UNIX variants provide the ability to create multi-threaded programs such that better advantage can be taken of multiple CPU systems.
The Mach kernel is also touted as a "Microkernel" (a small, compartmentalized kernel supported by many kernel mode threads and user mode processes), despite the fact that earlier releases, such as the V2.5 upon which Tru64 UNIX was built, were actually monolithic in nature. While not central to the function of TruCluster Server, this notion is important to the ongoing development of TruCluster Server and Tru64 UNIX in general. Essentially, key alterations in the system kernel can be implemented much more rapidly with a microkernel (or even a pseudo-microkernel) since the subsystems are very well defined and somewhat isolated from one another. Note that Tru64 UNIX is not strictly using the microkernel strategy but borrows heavily from it (we'd love to say the subsystems are completely distinct from one another, but that's just not true). This provides us with a flexible software product (Tru64 UNIX and TruCluster Server) to which new features can be added relatively quickly. Indeed, TruCluster Server itself is an example of rapid adaptation of the operating system to include new features and subsystems.
Some cluster components are implemented as kernel threads. Others are implemented as process-based code consisting of one or more threads. Still others are subsystems within the kernel. These components ultimately rely on system functions partially derived from Mach. Many other cluster components are implemented as driver-level kernel code. The following sections will briefly develop and introduce many of the key system components and cluster components. All cluster components will be discussed in subsequent chapters of the book.
The system uses virtual addresses, which are translated into physical addresses to provide access to data and code in memory (or I/O space). The previous sentence could be used to describe just about any modern operating system. Tru64 UNIX has solved the problem of representing a virtual address space consisting of 264 bytes of potential addressability (most other UNIX variants are years behind Compaq, now HP, in developing 64-bit systems). It does this using a clever three-level page table scheme that we don't need to detail here. The point is that it is a key feature of the system and is used heavily by all components including the TruCluster Server components.
The Unified Buffer Cache (UBC) is an innovation through which Tru64 UNIX can tune itself, at least partially. The memory caching needs of the file systems tend to be in direct conflict with the memory needs of processes. If the system is experiencing a burst of I/O activity, the file system caching memory count (generally referred to as the UBC page count) will increase. If the virtual memory requests from processes become heavy, the pages are taken back from the UBC and used for process memory. And so the pendulum can swing back and forth throughout the life of your system without your lifting a finger. Pretty impressive, huh?
To be fair, Tru64 UNIX is not the only UNIX that uses this strategy.
As you will see, the UBC is used by several of the I/O components that make clustering possible.
Shared libraries provide for the sharing of code at the function level. UNIX has always been good at sharing code at the process level, meaning that two users who both happen to be running the vi(1)editor at the same time, for example, will be sharing the single copy of the vi code that is brought into memory. But UNIX has traditionally been weak at sharing at the function level. So if one process were running the vi editor and the other were running emacs(1) (we'll assume that these two editors use many of the same functions), traditional UNIX would have brought two copies of the potentially shared functions into memory.
Shared libraries provide a mechanism where any program that uses shared library functions (think printf(3)), will reference the single copy of the function code that has been brought into memory. Note that the system is an ‘on demand’ system, so none of the shared functions are in memory until the first request causes one to be brought in. Likewise, as soon as there are no users of the function, the memory that it occupies will be freed.
The process-level TruCluster Server code is linked against shared libraries. The following example shows that the Cluster Application Availability Daemon (caad(8)) is linked against shared libraries. We then document which shared libraries are referenced within the caad process.
# file /usr/sbin/caad /usr/sbin/caad: COFF format alpha dynamically linked, demand paged executable or object module stripped - version 3.13-14
# odump -Dl /usr/sbin/caad ***LIBRARY LIST SECTION*** Name Time-Stamp CheckSum Flags Version /usr/sbin/caad: libpolicy.so Jan 16 04:39:17 2002 0x7958cdb5 0 osf.1 libevm.so Jan 15 17:37:17 2002 0xde4a5d09 0 osf.1 libclu.so Jan 15 17:36:05 2002 0xd148a817 0 osf.1 libm.so Jan 15 17:20:50 2002 0x07757304 0 osf.1 libpthread.so Jan 15 17:26:48 2002 0x42a00c94 0 osf.1 libcxx.so Jan 15 17:29:14 2002 0x9060972e 0 cxx6.3 libexc.so Jan 15 17:20:58 2002 0xb0f9a902 0 osf.1 libc.so Jan 15 17:19:09 2002 0x1e4e245f 0 osf.1
The following command output lists the location and some of the shared library files available in Tru64 UNIX.
# ls /usr/shlib .mrg..so_locations libarmui.so libmsfs.so .new..so_locations libaud.so libmxr.so .proto..so_locations libawt.so libndb.so TCR_libclu.so libawt_g.so libnet.so X11 libbkr.so libnet_g.so _null libc.so libnuma.so diagui__unix.uid libc_r.so libots.so ev6 libcdrom.so libots3.so generic libcfg.so libpacl.so libDSNLinkAPI.so libchf.so libpolicy.so libDXm.so libclu.so libproplist.so libDXterm.so libclua.so libpset.so libDeCOR.so libcmalib.so libpthread.so libDtHelp.so libcsa.so libpthreaddebug.so libDtMail.so libcurses.so libpthreads.so ...
Most of the physical memory on Tru64 UNIX is pageable. This means that the contents of the memory pages may be paged out to swap space (on disk), or swapped out to swap space if the system's free page list becomes critically low. Certain applications may require that portions of its memory be treated as if it were non-pageable. This activity (referred to as "wiring down" a page) is limited to processes that are owned by root. The kernel may also wire down pageable pages to meet its dynamic memory requirements.
The following example concludes with a section displaying statistics on the wired pages within the system.
# vmstat -P Total Physical Memory = 128.00 M = 16384 pages Physical Memory Clusters: start_pfn end_pfn type size_pages / size_bytes 0 256 pal 256 / 2.00M 256 16287 os 16031 / 125.24M 16287 16384 pal 97 / 776.00k Physical Memory Use: start_pfn end_pfn type size_pages / size_bytes 256 288 scavenge 32 / 256.00k 288 1036 text 748 / 5.84M 1036 1180 data 144 / 1.12M 1180 1400 bss 220 / 1.72M 1400 1594 kdebug 194 / 1.52M 1594 1600 cfgmgmt 6 / 48.00k 1600 1601 locks 1 / 8.00k 1601 1615 pmap 14 / 112.00k 1615 1811 unixtable 196 / 1.53M 1811 1814 logs 3 / 24.00k 1814 2046 vmtables 232 / 1.81M 2046 16287 managed 14241 / 111.26M ================================= Total Physical Memory Use: 16031 / 125.24M Managed Pages Break Down: free pages = 582 active pages = 1901 inactive pages = 4839 wired pages = 3869 ubc pages = 3082 ================== Total = 14273 WIRED Pages Break Down: vm wired pages = 705 ubc wired pages = 0 meta data pages = 1467 malloc pages = 995 contig pages = 88 user ptepages = 585 kernel ptepages = 21 free ptepages = 8 ================== Total = 3869
Another example of the rapidly changing nature of Tru64 UNIX is the inclusion of support for Non-Uniform Memory Access (NUMA) systems. Traditional Symmetric Multiprocessing (SMP) systems do not scale well as more processors are added. The Compaq (now HP) GS-series of computers (GS80, GS160, and GS320) can handle 8, 16, and 32 CPUs respectively (more in the future) in a manner yielding excellent scalability. This requires specialized hardware, but the operating system software is, once again, nothing more than good, old Tru64 UNIX. While some folks refer to the GS-series as a "cluster in a box," that is definitely not the intent of these machines (and certainly nothing that we recommend), although the hardware will support it.
The next sequence will take you through the conceptual developments in the world of computers that led to the notion of clusters. Along the way, several of the Tru64 UNIX features will be mentioned.
[1]Berkeley Standard Distribution (BSD), developed at the University of California at Berkeley.
[2]UNIX System V, developed by the UNIX System Development Lab at AT&T.