10.3 From Disks to Filesystems

As we've seen, the basic Unix file storage unit is the disk partition. Filesystems are created on disk partitions, and all of the separate filesystems are combined into a single directory tree. The initial parts of this section discuss the process by which a physical disk becomes one or more filesystems on a Unix system, treating the topic at a conceptual level. Later subsections discuss the mechanics of adding a new disk to the various operating systems we are considering.

10.3.1 Defining Disk Partitions

Traditionally, the Unix operating system organizes disks into fixed-size partitions, whose sizes and locations are determined when the disk is first prepared (as we'll see). Unix treats disk partitions as logically independent devices, each of which is accessed as if it were a physically separate disk. For example, one physical disk may be divided into four partitions, each of which holds a separate filesystem. Alternatively, a physical disk may be configured to contain only one partition comprising its entire capacity.

Many Unix implementations allow several physical disks to be combined into a single logical device or partition upon which you can build a filesystem. Systems offering a logical volume manager carry this trend to its logical conclusion, allowing multiple physical disks to be combined into a single logical disk, which can then be divided into logical partitions. AIX uses only an LVM and does not use traditional partitions at all.

Physically, adisk consists of a vertical stack of equally spaced circular platters. Reading and writing is done by a stack of heads that move in and out along the radius as the platters spin around at high speed. The basic idea is not so different from an audio turntable I hope you've seen one although both sides of the platters can be accessed at once.^[13]

^[13] Also, the disk tracks are concentric, not continuous, as they are on an LP. If you don't know what an LP is, think of it as a really wide CD (about 12" diameter) with data on both sides.

Partitions consist of subcylinders^[14] of the disk: specific ranges of distance from the spindle (the vertical center of the stack ofplatters): e.g., from one inch to two inches, to make up an arbitrary example. Thus, a disk partition uses the same sized and located circular section on all the platters in the disk drive. In this way, disks are divided vertically, through the platters, not horizontally.

^[14] I'm using this term in a descriptive sense only. Technically, a disk cylinder consists of the same set of tracks on all the platters that make up the disk (where a track is the portion of the platter surface that can be accessed from one of the discrete radial positions that the head can take as its moves along the radius).

Partitions can be defined as part of adding a new disk. In some versions of Unix, default disk partitions are defined in advance by the operating system. These default definitions provide some amount of flexibility by defining more than one division scheme for the physical disk.

Figure 10-3 depicts a BSD-stylepartition scheme. Each drawing corresponds to a different disk layout: one way of dividing up the disk. The various cylinders graphically represent each partition's location on the disk. The solid black area at the center of each disk indicates the part of the disk that cannot be accessed, containing the bad block list and other disk data.

Figure 10-3. Sample disk partitioning scheme

Readers who prefer numeric to graphical representations can consider the numeric partitioning scheme in Table 10-4, which illustrates the same point.

Table 10-4. Sample disk partitioning scheme
Partition	Start	End
a	655360	671739
b	327680	655359
c	0	671739
d	163840	327679
e	0	163839
f	327680	671739
g	0	327679

Seven different partitions are defined for the disk, named by letters from a to g. Three drawings are needed to display all seven partitions because some of them are defined to occupy the same disk locations.

Traditionally, the c partition comprised the entire disk, including the forbidden area; this is why the c partition was never used under standard BSD. However, on most current systems using this sort of naming convention, you can use the c partition to build a filesystem that uses the entire disk. Check the documentation if you're unsure about the conventions on your system.

The other six defined partitions are a, b, and d through g. However, it is not possible to use them all at one time, because some of them include the same physical areas of the disk. Partitions d and e occupy the same space as partition g in the sample layout. Hence, a disk will use either partitions d and e, or partition g, but not both. Similarly, the a and b partitions use the same area of the disk as partition f, and partitions f and g use the same area as partition c.

Thisdisk layout, then, offers three different ways of using the disk, divided into one, two, or four partitions, each of which may hold a filesystem or be used as a swap partition. Some disk partitioning schemes offer even more alternative layouts of the disk. Flexibility is designed in to meet the needs of different systems.

NOTE

figs/armadillo_tip.gif

This flexibility also has the following consequence: nothing prevents you from using a disk drive inconsistently. For example, nothing prevents you from mounting /dev/disk2d and /dev/disk2g from the same disk. However, this will have catastrophic consequences, because these two partitions overlap. Best practice is to modify partitions in a standard layout that you will not be using so that they have zero length (or delete them).

These days, the following partition naming conventions generally apply:

The partition holding the root (or boot) filesystem is the first one on the disk and is named partition a or slice 0.
The primary swap partition is normally partition b/slice 1.
Partition c and slice 2 refer to the entire disk.

10.3.2 Adding Disks

In this section, we'll begin by examining the general process of adding a disk to a Unix system and then go on to consider the commands and procedures for the various operating systems. The following list outlines the steps needed to make a new disk accessible to users:

The disk must be physically attached to the computer system. Consult the manufacturer's instructions and your own system's hardware documentation for the procedure.
A suitable device driver for the disk's controller must be present in the operating system. If the new disk is being added to an existing controller, or you're also adding a new controller that is among those supported by the operating system, this is not a problem. Otherwise, you'll need to build a new kernel or load the appropriate kernel module (see Chapter 16).
The disk must be low-level formatted.^[15] These days, this is always done by the manufacturer.

^[15] What I'm referring to here is not what is meant when one "formats" a diskette or disk on a PC system. In general, microcomputer operating systems like Windows use the term format differently than Unix does. Formatting a disk on these systems is equivalent to making a filesystem under Unix (and most other operating systems). Unix disk formatting is equivalent to what Windows calls a low-level format. This step is almost never needed in either environment.
One or more partitions must be defined on the disk.
The special files required to access the disk's partitions must exist or be created.
A Unix filesystem must be created on each of the disk partitions to be used for user files.
The new filesystem should be checked with fsck.
The new filesystem should be entered into the filesystem configuration file.
The filesystem can be mounted (perhaps after creating a new directory for its mount point).
Any site-specific activities must be performed (such as configuring backups and installing disk quotas).

The processes used to handle these activities will be discussed in the sections that follow.

As usual, planning should precede implementation. Before performing any of these operations, the system administrator must decide how the disk will be used: which partitions will have filesystems created on them and what files (types of files) will be stored in them. The layout of your filesystems can influence your system's performance significantly. You should therefore take some care in planning the structure of your filesystem.

For best performance, heavily used filesystems should each have their own disk drive, and they should not share a disk with aswap partition. Preferably, heavily used filesystems should be located on drives attached to different controllers. This setup balances the load between disk drives and disk controllers. These issues are discussed in more detail in Section 15.5. Coming up with the optimal layout may require consulting with other people: the database administrator, software developers, and so on.

We now turn to the mechanics of adding a new disk. We'll begin by considering aspects of the process that are common to all systems. The subsequent subsections discuss adding a new SCSI disk to each of the various Unix versions we are considering.

Finding a Hardware/Software Balance

Some system administrators love tinkering with hardware; the most hard-core of them consider reseating the CPU boards as the first response to any system glitch. At the other extreme are system administrators who can program their way out of any emergency but throw up their hands when they have to install a new disk drive.

A good system administrator will be able to hold her own in both the hardware and software arenas. Most of us tend to prefer one to the other, but we can all become proficient in both areas in the long run. The best way to improve your skills in whatever areas you feel least comfortable is to find a safe test system where you can learn, experiment, play, and make mistakes in private and without risk. In time, you may even find that you actually enjoy doing jobs that used to bore, disgust, or intimidate you.

10.3.2.1 Preparing and connecting the disk

There are two main types of disks in wide use today: IDEdisks and SCSI disks. IDE^[16] disks are low cost devices developed for the microcomputer market, and they are generally used on PC-based Unix systems. SCSI disks are generally used on (non-Intel) Unix workstations and servers from the major hardware vendors. IDE disks generally do not perform as well as SCSI disks (claims made by ATA-2 drive vendors notwithstanding).

^[16] IDE expands to Integrated Drive Electronics. These disks are also known as ATA disks (AT Attachment). Current IDE disks are virtually always EIDE: extended IDE, a follow-on to the original standard. SCSI expands to Small Computer System Interface.

IDE disks are easy to attach to the system, and the manufacturer's instructions are generally good. When you add a second disk drive to an IDE controller, you will usually need to perform some minor reconfiguration for both the existing and new disks. One disk must be designated as the master device and the other as the slave device; generally, the existing disk becomes the master and the new disk is the slave.

The master/slave setting for a disk is specified by means of a jumper on the disk drive itself, and it is almost always located on the same face of the disk as the bus and power connector sockets. Consult the documentation for the disk you are using to determine the jumper location and proper setting. Doing so on the new drive is easy because you can do it before you install the disk. Remember to check the existing drive's configuration as well, because single drives are often left unjumpered by the manufacturer. Note that the master/slave setting is not an operational definition; the two disks are treated equally by the operating system.

SCSI disks are in wide use in both PC-based systems and traditional Unix computers. When performance counts, use SCSI disks, because high-end SCSI subsystems are many times faster than the best EIDE-based ones. The SCSI subsystems are also more expensive than the best EIDE-based ones.

SCSI disks may be internal or external. These disks are designated by a number ranging from 0 to 6 known as their SCSI ID (the SCSI ID 7 is used by the controller itself). Normal SCSI adapters thus support up to seven devices, each of which must be assigned a unique SCSI ID; wide SCSI controllers support up to 15 devices (ID 7 is still used for the controller). SCSI IDs are generally set via jumpers on internal devices and via a thumbwheel or push button counter on external devices. Keep in mind that when you change the ID setting of a SCSI disk, the device must generally be power-cycled before the change will take effect.

On rare occasions, the ID display setting on an external SCSI disk will not match what is actually being set. When this happens, the counter is either attached incorrectly (backwards) or faulty (the SCSI ID does not change even though the counter does). When you are initially configuring a device, check the controller's power-on message to determine whether all devices are being recognized and to determine the actual SCSI ID assignments being used. Once again, these problems are rare, but I have seen two examples of the former and one example of the latter in my career.

SCSI disks come in many varieties; the current offerings are summarized in Table 10-5. You should be aware of the distinction between normal and differentialSCSI devices. In the latter type, there are two physical wires for each signal within the bus, and such devices use the voltage difference between the two wires as the signal value. This design reduces noise on the bus and allows for longer total cable lengths. Special cables and terminators are needed for such SCSI devices (as well as adapter support), and you cannot mix differential and normal devices. Differential signaling has used two forms over the years, high voltage differential (HVD) and low voltage differential (LVD); the two forms cannot be mixed. The most recent standards employ the latter exclusively.

Table 10-5. SCSI versions
			Maximum total cable length
Version name	Single-ended	Bus width	Differential	Maximum speed
SCSI-1, SCSI-2	5 MB/s	8 bits	6 m	25 m (HVD)
Fast SCSI	10 MB/s	8 bits	3 m	25 m (HVD)
Fast Wide SCSI	20 MB/s	16 bits	3 m	25 m (HVD)
Ultra SCSI	20 MB/s	8 bits	1.5 m	25 m (HVD)
Wide Ultra SCSI	40 MB/s	16 bits	1.5 m	25 m (HVD)
Ultra2 SCSI	40 MB/s	8 bits	n/a	12 m (HVD), 25 m (LVD)
Wide Ultra-2 SCSI	80 MB/s	16 bits	n/a	12 m (HVD), 25 m (LVD)
Ultra3 SCSI (a.k.a. Ultra160 SCSI)	160 MB/s	16 bits	n/a	12 m (LVD)
Ultra320 SCSI	320 MB/s	16 bits	n/a	12 m (LVD)

Table 10-5 can also serve as a simple history of SCSI. It shows the progressively faster speeds these devices have been able to obtain. Speed-ups come from a combination of a faster bus speed and using more bits for the bus (the "wide" devices). The most recent SCSI standards are all 16 bits, and the term "wide" has been dropped from the name because there are no "narrow" devices from which they need to be distinguished.

The maximum total cable length in the table refers to a chain consisting entirely of devices of that type. If you are using different (compatible) device types in the same chain, the maximum length is the minimum allowed for the various device types. Lowest common denominator wins in this case.

There are a variety of connectors that you will encounter on SCSI devices. These are the most common:

DB-25connectors are 25-pin connectors that resemble those on serial cables. They have 25 rounded pins positioned in two rows about 1/8" apart. For example, these connectors are used on external SCSIZip drives.
50-pin Centronics connectors were once the most common sort of SCSI connector. The pins on the connector are attached to the top and bottom of a narrow flat plastic bar about 2" long, and the connector is secured to the device by wire clips on each end.
50-pin micro connectors (also known as mini-micro connectors or SCSI II connectors) are distinguished by their flat, very closely spaced pins, also placed in two rows. This connector is much narrower than the others at about 1.5" in width.
68-pin connectors (also known as SCSI III connectors) are a 68-pin version of micro connectors designed for wide SCSI devices.

Figure 10-4 illustrates these connector types (shown in the external versions).

Figure 10-4. SCSI connectors

From left to right, Figure 10-4 shows a Centronics connector, two versions of the 50-pin mini-micro connector, and a DB-25 connector. 68-pin connectors look very similar to these 50-pin mini-micro connectors; they are simply wider. Figure 10-5 depicts the pin numbering schemes for these connectors.

Figure 10-5. SCSI connector pinouts

You can purchase cables that use any combination of these connectors and adapters to convert between them.

The various SCSI devices on a system are connected in a daisy chain (i.e., serially, in a single line). The first and last devices in the SCSI chain must be terminated for proper operation. For example, when the SCSI chain is entirely external, the final device will have a terminator attached and the SCSI adapter itself will usually provide termination for the beginning of the chain (check its documentation to determine whether this feature must be enabled or not). Similarly, when the chain is composed of both internal and external devices, the first device on the internal portion of the SCSI bus will have termination enabled (for example, via a jumper on an internal disk), and the final external device will again have a terminator attached.

Termination consists of regulating the voltages across the various lines comprising the SCSI bus. Terminators prevent the signal reflection that would occur on an open end. There are several different types of SCSI terminators:

Passive terminators are constructed from resistors. They attempt to ensure that the line voltages in the SCSI chain remain within their proper operating ranges. This type of termination is the least expensive, but it tends to work well only when there are just one or two devices in the SCSI chain and activity on the bus is minimal.
Active terminators use voltage regulators and resistors to force the line voltages to their proper ranges. While passive terminators simply reduce the incoming signal to the proper level (thus remaining susceptible to all power fluctuations within it), active terminators use a voltage regulator to ensure a steady standard for use in producing the target voltages. Active terminators are only slightly more expensive than passive terminators, and they are always more reliable. In fact, the SCSI II standard calls for active termination for all SCSI chains.
Forced perfect termination (FPT) uses a more complex and accurate voltage regulation scheme to force line voltages to their correct values. In this scheme, the voltage standard is taken from the output of two regulated voltages, and diodes are used to eliminate fluctuations within it. This results in increased stability over active termination. FPT will generally eliminate any flakiness in a SCSI chain, and you should consider it any time your chain consists of more than three devices (despite the fact that it is 2-3 times more expensive than active termination).
Some hybrid terminators are also available. In such devices, key lines are controlled via forced perfect termination, and the remaining lines are regulated with active termination. Such devices tend to be almost as expensive as FPT terminators and so are seldom preferable to them.

A few SCSI devices have built-in terminators that you select or deselect via a switch. External boxes containing multiple SCSI disks also often include termination. Check the device characteristics for your devices to determine if such features are present.

NOTE

figs/armadillo_tip.gif

Be aware that filesystems on SCSI disks are not guaranteed to survive a change of controller model (although they usually will); the standard does not specify that they must be interoperable. Thus, if you move a SCSI disk containing data from one system to another system with a different kind of SCSI controller, there's a chance you will not be able to access the existing data on the disk and will have to reformat it. Similarly, if you need to change the SCSI adapter in a computer, it is safest to replace it with another of the same model.

Having said this, I will note that I do move SCSI disks around fairly often, and I've only seen one failure of this kind. It's rare, but it does happen.

Once the disk is attached to the system, you are ready to configure it. The discussion that follows assumes that the new disk to be added is connected to the computer and is ready to accept partitions. These days, disks seldom if ever require low-level formatting, so we won't pay much attention to this process.

Before turning to the specific procedures for various operating systems, we'll look at the general issue of creating special files.

10.3.2.2 Making special files

Before filesystems can be created on a disk, thespecial files for the desired disk partitions must exist. Sometimes, they are already on the system when you go to look for them. On many systems, the boot process automatically creates the appropriate special files when it detects new hardware.

Otherwise, you'll have to create them yourself. Special files are created with the mknod command. mknod has the following syntax:

# mknod  name  |  major minor

The first argument is the filename, and the second argument is the letter c or b, depending on whether you're making the character or block special file. The other two arguments are the major and minor device numbers for the device. These numbers serve to identify the proper device driver to the kernel. The major device number indicates the general device type (disk, serial line, etc.), and the minor device number indicates the specific member within that class.

These numbers are highly implementation-specific. To determine the numbers you need, use the ls -l command on some existing special files for disk partitions; the major and minor device numbers will appear in the size field. For example:

$ cd /dev/dsk; ls -l c1d*                  Major, minor device numbers.  brw-------   1 rootroot0,144 Mar 13 19:14 c1d1s0  brw-------   1 rootroot0,145 Mar 13 19:14 c1d1s1  brw-------   1 rootroot0,146 Mar 13 19:14 c1d1s2  ...  brw-------   1 rootroot0,150 Mar 13 19:14 c1d1s6  brw-------   1 rootroot0,151 Mar 13 19:14 c1d1s7  brw-------   1 rootroot0,160 Mar 13 19:14 c1d2s0  brw-------   1 rootroot0,161 Mar 13 19:14 c1d2s1  ...  $ cd /dev/rdsk; ls -l c1d1*   crw-------   1 rootroot3,144 Mar 13 19:14 c1d1s0  crw-------   1 rootroot3,145 Mar 13 19:14 c1d1s1  .. .

In this example, the numbering pattern is pretty clear: block special files for disks on controller 1 have major device number 0; the corresponding character special files have major device number 3. The minor device number of the same partition of successive disks differs by 16. So if you want to make the special files for partition 2 on disk 3, its minor device number would be 162+16 = 178, and you'd use the following mknod commands:

# mknod /dev/dsk/c1d3s2  b 0 178 # mknod /dev/rdsk/c1d3s2 c 3 178

Except on Linux and FreeBSD systems, be sure to make both the block and character special files.

On many systems, the /dev directory includes a shell script named MAKEDEV which automates running mknod. It takes the base name of the new device as an argument and creates the character and block special files defined for it. For example, the following command creates the special files for a SCSI disk under Linux:

# cd /dev # ./MAKEDEV sdb

The command creates the special files /dev/sdb0 through /dev/sdb16.

10.3.2.3 FreeBSD

The first step is to attach the disk to the system and then reboot.^[17]FreeBSD should detect the new disk. You can check the boot messages or the output of the dmesg command to ensure that it has:

^[17] If the system has hot swappable SCSI disks, you can use the cancontrol rescan bus command to detect them without rebooting.

da1 at adv0 bus 0 target 2 lun 0 da1: <SEAGATE ST15150N 0017> Fixed Direct Access SCSI-2 device da1: 10.000MB/s transfers (10.000MHz, offset 15), Tagged Queueing Enabled da1: 4095MB (8388315 512 byte sectors: 255H 63S/T 522C)

On Intel-based systems, disk ordering happens at boot time, so adding a new SCSI disk with a lower SCSI ID than an existing disk will cause special files to be reassigned^[18] and probably break your /etc/fstab setup. Try to assign SCSI IDs in order if you anticipate adding additional devices later.

^[18] This can happen at other times as well. For example, changes to fiber channel configurations such as switch reconfigurations might lead to unexpected device reassignments, because the operating system gets information on hardware addressing from the programmable switch.

FreeBSD disk partitioning is a bit more complex than for the other operating systems we are considering. It is a two-part process. First, the disk is divided into physical partitions, which BSD calls slices. One or more of these is assigned to FreeBSD. The FreeBSD slice is then itself subdivided into partitions. The latter are where filesystems actually get built.

The fdisk utility is used to divide a disk into slices. Here we create a single slice comprising the entire disk:

# fdisk -i /dev/da1  ******* Working on device /dev/da1 ******* .. . Information from DOS bootblock is: The data for partition 1 is: <UNUSED> Do you want to change it? [n] y  Supply a decimal value for "sysid (165=FreeBSD)" [0] 165  Supply a decimal value for "start" [0]  Supply a decimal value for "size" [0] 19152  Explicitly specify beg/end address ? [n] n  sysid 165,(FreeBSD/NetBSD/386BSD)     start 0, size 19152 (9 Meg), flag 0         beg: cyl 0/ head 0/ sector 1;         end: cyl 18/ head 15/ sector 63 Are we happy with this entry? [n] y The data for partition 2 is: <UNUSED> Do you want to change it? [n] n  .. . Do you want to change the active partition? [n] n  Should we write new partition table? [n] y

Unless you want to create multiple slices, this step is required only on the boot disk on an Intel-based system. However, if you're using a slice other than the first one, you'll need to create the special files to access it:

# cd /dev; ./MAKEDEV /dev/da1s2a

The disklabel command creates FreeBSD partitions within the FreeBSD slice:

# disklabel -r -w da1 auto

The auto parameter says to create a default layout for the slice. You can preview what disklabel will do by adding the -n option.

Once you have created a default label (division), you can edit it by running disklabel -e. This command starts a editor session from which you can modify the partitioning (using the editor specified in the EDITOR environment variable).

NOTE

figs/armadillo_tip.gif

disklabel is a very cranky utility, and often fails with the message:

disklabel: No space left on device

The message is completely spurious. This happens more often with larger disks than with smaller ones. If you encounter this problem, try running sysinstall, and select the ConfigureLabel menu path. This form of the utility can usually be coaxed to work, but even it will not accept all valid partition sizes. Caveat emptor.

Once you have made partitions, you create filesystems using the newfs command, as in this example:

# newfs /dev/da1a /dev/da1a: 19152 sectors in 5 cylinders of 1 tracks, 4096 sectors   9.4MB in 1 cyl groups (106 c/g, 212.00MB/g, 1280 i/g) super-block backups (for fsck -b #) at:  32

The following options can be used to customize the newfs operation:

-U: Enable soft updates (recommended).
-b size: Filesystem block size in bytes (the default is 16384; value must be a power of 2).
-f size: Filesystem fragment size: the smallest allocatable unit of disk space. The default is 2048 bytes. This parameter determines the minimum file size, among other things. It must be a power of 2 less than or equal to the filesystem block size and no smaller than one eighth of the filesystem block size. Experts recommend always making this value one eighth of the filesystem block size.
-i bytes: Number of bytes per inode (the default is 4 times the fragment size: 8192 with the default fragment size). This setting controls how many inodes are created for the new filesystem (number of inodes equals filesystem size divided by byte per inode). The default value generally works well.
-m free: Percentage of free space reserved. The default is 8%; you can usually safely decrease it to about 5% or even less for a very large disk.
-o speed | space: Set the optimization preference. speed means that the filesystem will attempt to minimize the time spent allocating disk blocks, while space means that it will try to minimize disk fragmentation. The default is space if the minimum free space percentage is less than 8%, and speed otherwise. Hence, speed is the default with the default free space percentage.

The tunefs command can be used to modify the values of -m and -o for an existing filesystem (using the same option letters). Similarly, -n can be used to enable/disable soft updates for an existing filesystem (it takes enable or disable as its argument).

Finally, we run fsck on the new filesystem:

# fsck /dev/da1a ** /dev/da1a ** Last Mounted on ** Phase 1 - Check Blocks and Sizes ** Phase 2 - Check Pathnames ** Phase 3 - Check Connectivity ** Phase 4 - Check Reference Counts ** Phase 5 - Check Cyl groups 1 files, 1 used, 4682 free (18 frags, 583 blocks, 0.4% fragmentation)

In this instance, fsck finishes very quickly.

If you use the menu-driven version of disklabel in the sysinstall utility, the newfs and mount commands can be run for you automatically (and the utility does so by default).

The growfs command can be used to increase the size of an existingfilesystem, as in this example:

# growfs /dev/da1a

By default, the filesystem is increased to the size of the underlying partition. You can specify a specific new size with the -s option if you want to.

10.3.2.4 Linux

After attaching the disk to the system, it should be detected when the system is booted. You can use the dmesg command to display boot messages. Here are some sample messages from a very old, but still working, Intel-based Linux system:

scsi0 : at 0x0388 irq 10 options CAN_QUEUE=32 CMD_PER_LUN=2 ...  scsi0 : Pro Audio Spectrum-16 SCSI  scsi : 1 host.  Detected scsi disk sda at scsi0, id 2, lun 0  scsi : detected 1 SCSI disk total.

The messages indicate that this disk is designated as sda.

On Intel-based systems, disk ordering happens at boot time, so adding a new SCSI disk with a lower SCSI ID than an existing disk will cause special files to be reassigned^[19] and probably break your /etc/fstab setup. Try to assign SCSI IDs in order if you anticipate adding additional devices later.

^[19] This can happen at other times as well. For example, changes to fiber channel configurations such as switch reconfigurations might lead to unexpected device reassignments because the operating system gets information on hardware addressing from the programmable switch.

If necessary, create the device special files for the disk (needed only when you have many, many disks). For example, these commands create the special files used to access the sixteenth SCSI disk:

# cd /dev; ./MAKEDEV sdp

Assuming we have our special files all in order, we will use fdisk or cfdisk (a screen-oriented version) to divide the disk into partitions (we'll be creating two partitions). The following commands will start these utilities:

# fdisk /dev/sda # cfdisk /dev/sda

The available subcommands for these utilities are listed in Table 10-6.

Table 10-6. Linux partitioning utility subcommands
Action	fdisk	cfdisk
Create new partition.	N	N
Change partition type.	T	T
Make partition active (bootable).	A	B
Write partition table to disk.	W	W
Change display/entry size units.	U	U
Display partition table.	P	Always visible
List available subcommands.	m	At dialog bottom

cfdisk is often more convenient to use because the partition table is displayed continuously, and we'll use it here. cfdisk subcommands always operate on the current (highlighted) partition. Thus, in order to create a new partition, move the highlight to the line corresponding to Free Space and press n.

You first need to select either a primary or a logical (extended) partition. PC disk partitions are of two types: primary and extended. A disk may contain up to four partitions. Both partition types are a physical subset of the total disk. Extended partitions may be further subdivided into units known as logical partitions (or drives) and thereby provide a means for dividing a physical disk into more than four pieces.

Next, cfdisk prompts for the partition information:

Primary or logical [pl]: p  Size (in MB): 110

If you'd rather enter the size in a different set of units, use the u subcommand (units cycle among MB, sectors, and cylinders). Once these prompts are answered, you will be asked if you want the partition placed at the beginning or the end of the free space (if there is a choice).

Use the same procedure to create a second partition, and then activate the first partition with the b subcommand. Then, use the t subcommand to change the partition types of the two partitions. The most commonly needed type codes are 6 for Windows FAT16, 82 for a Linux swap partition, and 83 for a regular Linux partition.

Here is the final partition table (output has been simplified):

                         cfdisk 2.11i                        Disk Drive: /dev/hde                     Size: 3228696576 bytes      Heads: 128   Sectors per Track: 63   Cylinders: 782 Name         Flags        Part Type     FS Type      Size (MB)  -------------------------------------------------------------- /dev/sda1    Boot        Primary        Linux        110.0  /dev/sda2                Primary        Linux         52.5                          Pri/Log        Free Space     0.5

(Yes, those sizes are small; I told you it was an old system.)

At this point, I reboot the system. In general, when I've changed the partition layout of the disk in other words, done anything other than change the types assigned to the various partitions I always reboot PC-based systems. Friends and colleagues accuse me of being mired in an obsolete Windows superstition by doing so and argue that this is not really necessary. However, many Linux utility writers (see fdisk) and filesystem designers (see mkreiserfs) agree with me.

Next, use the mkfs command to create a filesystem on the Linux partition. mkfs has been streamlined in the Linux version and requires little input:

# mkfs -t ext3 -j /dev/sda1

This command^[20] creates a journaled ext3 filesystem, the current default filesystem type for many Linux distributions. The ext3 filesystem is a journaled version of the ext2 filesystem, which was used on Linux systems for several years and is still in wide use. In fact, ext3 filesystems are backward-compatible and can be mounted in ext2 mode.

^[20] Actually, the fsck, mkfs, mount, and other commands are front ends to filesystem-specific versions. In this case, mkfs runs mke2fs.

If you want to customize mkfs's operation, the following options can be used:

-b bytes: Set filesystem block size in bytes (the default is 1024).
-c: Check the disk partition for bad blocks before making the filesystem.
-i n: Specify bytes/inode value: create one inode for each n bytes. The default value of 4096 usually creates more than you'll ever need, but probably isn't worth changing.
-m percent: Specify the percentage of filesystem space to reserve (accessible only by root and group 0). The default is 5% (half of what is typical on other Unix systems). In these days of multigigabyte disks, even this percentage may be worth rethinking.
-J device: Specify a separate device for the filesystem log .

Once the filesystem is built, run fsck:

# fsck -f -y /dev/sda1

The -f option is necessary to force fsck to run even though the filesystem is clean. The new filesystem may now be mounted and entered into /etc/fstab.

The tune2fs command may be used to list and alter fields within the superblock of an ext2 filesystem. Here is an example of its display output (shortened):

# tune2fs -l /dev/sdb1  Filesystem magic number:  0xEF53 Filesystem revision #:    1 (dynamic) Filesystem features:      filetype sparse_super Filesystem state:         not clean Errors behavior:          Continue Filesystem OS type:       Linux Inode count:              253952 Block count:              507016 Reserved block count:     25350 Free blocks:              30043 Free inodes:              89915 First block:              0 Block size:               4096 Last mount time:          Thu Apr  4 11:28:19 2002 Last write time:          Wed May 22 10:00:36 2002 Mount count:              1 Maximum mount count:      20 Last checked:             Thu Apr  4 11:28:01 2002 Check interval:           15552000 (6 months) Next check after:         Tue Oct  1 12:28:01 2002 Reserved blocks uid:      0 (user root) Reserved blocks gid:      0 (group root)

The check-related items in the list indicate when fsck will check the filesystem even if it is clean (they appear fifth to third from last). The Linux version of fsck for ext3 filesystems checks the filesystem if either the maximum number of mounts without a check has been exceeded or the maximum time interval between checks has expired (20 times and 6 months in the preceding output; the check interval is given in seconds).

tune2fs's -i option may be used to specify the maximum time interval between checks in days, and the -c option may be used to specify the maximum number of mounts between checks. For example, the following command disables the time-between-checks function and sets the maximum number of mounts to 25:

# tune2fs -i 0 -c 25 /dev/sdb1  Setting maximal mount count to 25  Setting interval between check 0 seconds

Another useful option to tune2fs is -m, which allows you to change the percentage of filesystem space held in reserve. The -u and -g options allow you to specify the user and group ID (respectively) allowed to access the reserved space.

You can convert an ext2 filesystem to ext3 with a command like this one:

# tune2fs -j /dev/sdb2

Existing ext2 and ext3 filesystems can be resized using the resize2fs command, which takes the filesystem and new size (in 512-byte blocks) as parameters. For example, the following commands will change the size of the specified filesystem to 200,000 blocks:

# umount /dev/sdc1 # e2fsck -f /dev/sdc1 e2fsck 1.23, 15-Aug-2001 for EXT2 FS 0.5b, 95/08/09 Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information /1: 11/247296 files (0.0% non-contiguous), 15979/493998 blocks # resize2fs -p /dev/sdc1 200000 resize2fs 1.23 (15-Aug-2001) Begin pass 1 (max = 1) Extending the inode table     XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Begin pass 3 (max = 10) Scanning inode table          XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX The filesystem on /dev/sdc1 is now 200000 blocks long.

The -p option says to display a progress bar as the operation runs. Naturally, the size of the underlying disk partition or logical volume (discussed later in this chapter) will need to be increased beforehand.

Increasing the size of a filesystem is always safe. If you want the new size to be the same as the size of the underlying disk partition as is virtually always the case you can omit the size parameters from the resize2fs command. To decrease the size of a filesystem, perform the resize2fs operation first, and then use fdisk or cfdisk to decrease the size of the underlying partition. Note that data loss is always possible, even likely, when decreasing the size of a filesystem, because no effort is made to migrate data within the filesystem prior to shortening it.

10.3.2.4.1 The Reiser filesystem

Some Linux distributions also offer the Reiser filesystem, designed by Hans Reiser (see http://www.reiserfs.org).^[21] The commands to create a Reiser filesystem are very similar:

^[21] The name is pronounced like the word riser (as in stairs) and rhymes with sizer and miser.

# mkreiserfs /dev/sdb3 <-------------mkreiserfs, 2001-------------> reiserfsprogs 3.x.0k-pre9 mkreiserfs: Guessing about desired format.. mkreiserfs: Kernel 2.4.10-4GB is running. 13107k will be used Block 16 (0x2142) contains super block of format 3.5 with standard journal Block count: 76860 Bitmap number: 3 Blocksize: 4096 Free blocks: 68646 Root block: 8211 Tree height: 2 Hash function used to sort names: "r5" Objectid map size 2, max 1004 Journal parameters:         Device [0x0]         Magic [0x18bbe6ba]         Size 8193 (including journal header) (first block 18)         Max transaction length 1024         Max batch size 900         Max commit age 30 Space reserved by journal: 0 Correctness checked after mount 1 Fsck field 0x0 ATTENTION: YOU SHOULD REBOOT AFTER FDISK!         ALL DATA WILL BE LOST ON '/dev/hdf2'! Continue (y/n):y Initializing journal - 0%....20%....40%....60%....80%....100% Syncing..ok ReiserFS core development sponsored by SuSE Labs (suse.com) Journaling sponsored by MP3.com. To learn about the programmers and ReiserFS, please go to http://namesys.com Have fun. # reiserfsck -x /dev/sdb3 <-------------reiserfsck, 2001-------------> reiserfsprogs 3.x.0k-pre9 Will read-only check consistency of the filesystem on /dev/hdf2         Will fix what can be fixed w/o --rebuild-tree Will put log info to 'stdout' Do you want to run this program?[N/Yes] (note need to type Yes):Yes 13107k will be used ########### reiserfsck --check started at Wed May 22 11:36:07 2002 ########### Replaying journal.. No transactions found Checking S+tree..ok Comparing bitmaps..ok Checking Semantic tree...ok No corruptions found There are on the filesystem:         Leaves 1         Internal nodes 0         Directories 1         Other files 0         Data block pointers 0 (zero of them 0) ########### reiserfsck finished at Wed May 22 11:36:19 2002 ###########

Reiser filesystems may be resized with the resize_reiserfs -s command. They can also be resized when they are mounted. The latter operation uses a command like the following:

# mount -o remount,resize=200000 /dev/sdc1

This command changes the size of the specified filesystem to 200,000 blocks. Once again, increasing the size of a filesystem is always safe, while decreasing it requires great care to avoid data loss.

10.3.2.5 Solaris

In this section, we add a SCSI disk (SCSI ID 2) to aSolaris system.

After attaching the device, boot the system with boot -r, which tells the operating system to look for new devices and create the associated special files and links into the /devices tree.^[22] The new disk should be detected when the system is booted (output simplified):

^[22] You should verify that these steps are done correctly after the boot. If not, you can create the /devices entries and links in /dev by running the drvconfig and disks commands. Neither requires any arguments.

sd2 at esp0: target 2 lun 0    corrupt label - wrong magic number    Vendor 'QUANTUM', product 'CTS160S', 333936 512 byte blocks

The warning message about a corrupt label comes because no valid Sun label (a vendor-specific disk header block that Sun uses) has been written to the disk yet. If you miss the messages during the boot, use the dmesg command.

We now label the disk and then create partitions on it (which Solaris sometimes calls slices). Solaris uses the format utility for these tasks.^[23] Previously, it was often necessary to tell format about the characteristics of your disk. These days, however, the utility knows about most kinds of disks, which makes adding a new disk much simpler.

^[23] Solaris also contains a version of the fdisk utility designed for operating system installations. This is not what you should use to prepare a new disk.

Here is the command used to start format and write a generic label to the disk (if it is unlabeled):

# format /dev/rdsk/c0t2d0s2               Partition 2 = the entire disk. selecting /dev/rdsk/c0t2d0s2 [disk formatted, no defect list found] FORMAT MENU: ...Menu is printed here. format> label                             Write generic disk label. Ready to label disk, continue? y

Once the disk label is written, we can set up partitions. We'll be dividing this disk into two equal partitions. We use the partition subcommand to define them:

format> partition  PARTITION MENU:         0      - change `0' partition         1      - change `1' partition         .. .         7      - change `7' partition         select - select a predefined table         modify - modify a predefined partition table         name   - name the current table         print  - display the current table         label  - write partition map and label to the disk         quit partition>                                Redefine partition 0 Enter partition id tag[unassigned]: root   Specifies partition use. Enter partition permission flags[wm]: wm   Read-write, mountable. Enter new starting cyl[0]:  Enter partition size[0b, 0c, 0e, 0.00mb, 0.00gb]: 5.00gb  .. . partition> 1  Enter partition id tag[unassigned]:  Enter partition permission flags[wm]: wm  Enter new starting cyl[0]: 10403  Enter partition size[0b, 0c, 0e, 0.00mb, 0.00gb]: 7257c  .. . partition> print                           Print partition table. Current partition table (unnamed): Total disk cylinders available: 17660 + 2 (reserved cylinders) Part      Tag    Flag     Cylinders         Size            Blocks   0       root    wm       0 - 10402        5.00GB    (10403/0/0) 10486224   1 unassigned    wm   10403 - 17659        3.49GB    (7257/0/0)   7315056   2 unassigned    wm       0                0         (0/0/0)            0   .. .   7 unassigned    wm       0                0         (0/0/0)            0

We define two partitions here, 0 and 1. In the first case, we specify a starting cylinder number of 0 and the partition size in GB. In the second case, we specify a starting cylinder and the length in cylinders. We took a look at the partition table between issuing these two commands to find these numbers.

The partition ID tag is a label specifying the intended use of the partition. Partition 0 will be used for the root filesystem and is labeled accordingly.

The permission flags are usually one of wm (read-write and mountable) and wu (read-write and not mountable). The latter is used for swap partitions.

Once the partitions are defined, we write a label to the disk using the label subcommand:

partition> label  Ready to label disk, continue? y partition> quit format> quit

The partition submenu also has a name subcommand, which allows a custom partition table to be named and saved; it can be applied to a new disk with the select subcommand on the same menu.

Now, we create filesystems on the new disk partitions with the newfs command:

# newfs /dev/rdsk/c0t2d0s0   newfs: construct a new file system /dev/rdsk/c0t2d0s3: (y/n)? y  /dev/rdsk/c0t0d0s3: 10486224 sectors in 10403 cylinders                      of 16 tracks, 63 sectors 5120.2MB in 119 cyl groups (88 c/g, 43.31MB/g, 5504 i/g) super-block backups (for fsck -F ufs -o b=#) at:  32, 88800, 177568, 266336, 355104, 443872, 532640, 621408, 710176, .. .

The prudent course of action is to print out this list and store it somewhere for safe keeping, in case both the primary superblock and the one at address 32 get corrupted.^[24]

^[24] A tip from one of the book's technical reviewers: "If you lose your list of backup superblocks, make a filesystem on a device of the same size and read the locations of the superblocks when you newfs that new partition."

Finally, we run fsck on the new filesystem:

# fsck -y /dev/rdsk/c0t2d0s0 ** /dev/rdsk/c0t0d0s3 ** Last Mounted on ** Phase 1 - Check Blocks and Sizes ** Phase 2 - Check Pathnames ** Phase 3 - Check Connectivity ** Phase 4 - Check Reference Counts ** Phase 5 - Check Cyl groups 2 files, 9 used, 5159309 free (13 frags, 644912 blocks,  0.0% fragmentation)

This process is repeated for the other disk partition.

You can customize the parameters for the new filesystem using these options to newfs:

-b size: Filesystem block size in bytes (the default is 8192; value must be a power of 2 from 4096 to 8192).
-f size: Filesystem fragment size: the smallest allocateable unit of disk space. The default is 1024 bytes (must be a power of 2 in the range of 1024 to 8192). This parameter determines the minimum file size, among other things. It must be less than or equal to the filesystem block size and no smaller than one eighth the filesystem block size.
-i bytes: Number of bytes per inode (the default is 2048). This setting controls how many inodes are created for the new filesystem (number of inodes equals filesystem size divided by bytes per inode). The default value of 2048 usually creates more than you'll ever need except for filesystems with many, many tiny files. You can usually increase this to 4098 without risk.
-m free: Percentage of free space reserved. The default is 10%; you can usually safely decrease it to about 5% or even less for a very large disk.

The -N option to newfs may be used to have the command display all of the parameters it would pass to mkfs the utility that does the actual work without building the filesystem.

Logging is enabled for Solaris UFS filesystems at mount time, via the logging mount option.

10.3.2.6 AIX, HP-UX, and Tru64

These operating systems use a logical volume manager (LVM) by default. Adding disks to these systems is considered during the LVM discussion later in this chapter.

10.3.2.7 Remaking an existing filesystem

Occasionally, it may be necessary to reconfigure a disk. For example, you might want to select another layout, using a different set of partitions. You might want to change the value of a filesystem parameter, such as its block size. Or you might want to add an additional swap partition or get rid of an unneeded one. Sometimes, these operations require that you recreate the existing filesystems.

Recreating a filesystem will destroy all the existing data in the filesystem, so it is essential to perform a full backup first (and to verify that the tapes are readable; see Chapter 11). For example, the following commands may be used to reconfigure a filesystem with a 4K block size under Linux:

# umount /chem                          Dismount filesystem. # dump 0 /dev/sda1                      Backup. # restore -t                            Check tape is OK! # mke2fs -b 4096 -j /dev/sda1           Remake filesystem. # mount /chem                           Mount new filesystem. # cd /chem; restore -r                  Restore files.

A very cautious administrator would make two copies of the backup tape.

10.3.3 Logical Volume Managers

This section looks at logical volume managers (LVMs). The LVM is the only disk management facility under AIX, and the corresponding facilities are also used by default under HP-UX and Tru64. Linux and Solaris 9 also offer LVM facilities. As usual, we'll begin this section with a conceptual overview of logical volume managers and then move on to the specifics for the various operating systems.

When dealing with an LVM, you will do well to forget everything you know about disk partitions under Unix. Not only is a completely different vocabulary employed, but some Unix terms like partition also are used with completely different meanings. However, once you get past the initial obstacles, the LVM point of view is very clear and sensible, and it is superior to the standard Unix approach to handling disks. A willing suspension of disbelief will come in very handy at first.

In general, an LVM brings the following benefits:

Filesystems and individual files can be larger than a single physical disk.
Filesystems may be dynamically extended in size without having to be rebuilt.
Software disk mirroring and RAID are often supported (for data protection and continued system availability even in the face of disk failures).
Software disk striping is often provided as part of an LVM for improved I/O performance.

10.3.3.1 Disks, volume groups, and logical volumes

To begin at the beginning, there are disks: real, material, solid objects that hurt your toe if they fall on it. However, such disks must be initialized and made into physical volumes before they may be used by the LVM. When they are made part of a volume group (defined in a moment), these disks are divided into allocable units of space known as physical partitions (AIX) or physical extents (HP-UX and Tru64). The default size for these units is generally 4 MB. Note that these partitions/extents are units of disk storage only; they have nothing to do with traditional Unix disk partitions.

A volume group is a named collection of disks. Volume groups can also include collections of disks accessed as a single hardware unit (e.g., a RAID array). Volume groups allow filesystems to span physical disks (although it is not required that they do so). Paradoxically, the volume group is the LVM equivalent of the Unix physical disk: that entity which can be split into subunits called logical volumes, each of which holds a single filesystem. Unlike Unix disk partitions, volume groups are infinitely flexible in how they may be divided into filesystems.

HP-UX allows volume groups to be subdivided into sets of disks called physical volume groups (PVGs). These groups of disks are accessed through separate controllers and/or buses, and the facility is designed to support high-availability systems by reducing the number of potential single points of hardware failure.

Logical volumes are the entities on which filesystems reside; they may also be used as swap devices, as dump devices, for storing boot programs, and by application programs in raw mode (analogously to a raw-mode disk partition). They consist of some number of fixed physical partitions (disk chunks) generally located arbitrarily within a volume group (although some implementations optionally allow specific physical volumes to be requested when a logical volume is created or extended). Hence, logical volumes may be any size that is a multiple of the physical partition size for their volume group. They may be easily increased in size after creation while the operating system is running. Logical volumes may also be shrunk (although not without consequences to any filesystem they may contain).

Logical volumes are composed of logical partitions (AIX) or logical extents (HP-UX). Many times, physical and logical partitions are identical (or at least map one-to-one). However, logical volumes have the capability of storing redundant copies of all data, if desired; from one to two additional copies of each data block may be stored. When only a single copy of the data is stored, one logical partition corresponds to one physical partition. If two copies are stored, one logical partition corresponds to two physical partitions: one original and one mirror. Similarly, in a doubly mirrored logical volume, each logical partition corresponds to three physical partitions.

The main LVM data storage entities are illustrated in Figure 10-6 (representing an AIX system). The figure shows how three physical disks are combined into a single volume group (named chemvg). The separate disks composing it are suggested via shading.

Figure 10-6. Logical volume managers illustrated

Three user logical volumes are then defined from chemvg.^[25] Two of them chome and cdata store a single copy of their data using physical partitions from three separate disks. cdata is a striped logical volume, writing data to all three disks in parallel. It uses identically sized sections from each physical disk. chome illustrates the way that a filesystem can be spread across multiple physical disks, even noncontiguously in the case of hdisk3.

^[25] In addition to the logging volume group required by AIX for the jfs journaled filesystem type.

The other logical volume, qsar, is a mirrored logical volume. It contains an equal number of physical partitions from all three disks; it stores three copies of its data (each on a separate disk), and one physical partition per disk is used for each of its logical partitions.

Once a logical volume exists, you can build a filesystem on it and mount it normally. At any point in its lifetime, a filesystem's size may be increased as long as there are free physical partitions within its volume group. There need not initially be any free logical partitions within its logical volume. Generally, both the logical volume and filesystem are resized using a single command.

Some operating systems can also reduce the size of an existing logical volume. If this operation is performed on a mounted filesystem, and the new size of the logical volume is still at least a little larger than the existing filesystem, it can be accomplished without losing any data. Under any other conditions, data loss is very, very likely indeed. This technique is not for the fainthearted.

Currently, there is no easy way to decrease the size of a filesystem under AIX or FreeBSD, even if there is unused space within the filesystem. If you want to make a filesystem smaller, you need to back up the current files (and verify that the tape is readable!), delete the filesystem and its logical volume, create a new, smaller logical volume and filesystem, and then restore the files. The freed logical partitions can then be allocated as desired within their volume group; they can be added to an existing logical volume, used to make a new logical volume and filesystem, used in a new or existing paging space, or held in reserve.

Table 10-7 lists theLVM-related terminology used by the various Unix operating systems.

Table 10-7. LVM terminology
Item	AIX	FreeBSD^[26]	HP-UX	Linux	Solaris	Tru64 AdvFS^[27]	Tru64 LSM
Facility	Logical Volume Manager	Vinum Volume Manager	Logical Volume Manager	Logical Volume Manager	Volume Manager	Advanced File System	Logical Storage Manager
Virtual disk	volume group	None	volume group	volume group	volume	domain	disk group
Logical volume	logical volume	Volume	logical volume	logical volume	volume, soft partition	fileset	volume
Allocation unit	partition	Subdisk	extent	Extent	extent	extent	extent

^[26] As we'll see, the FreeBSD entity mappings here are not precise because the concepts are somewhat different.

^[27] Not a true LVM, AdvFS nevertheless shares many features with them.

10.3.3.2 Disk striping

Disk striping is an option that is increasingly available as an extension to Unix, especially on high-performance systems. Striping combines one or more physical disks (or disk partitions) into a single logical disk, viewed like any other filesystem device by the rest of Unix. Disk striping is used to increase I/O performance at least as often as it is used to create very large filesystems spanning more than one physical disk. Striped disks split I/O operations across the physical disks in the stripe, performing them in parallel, and are thus able to achieve significant performance improvements over a single disk (although not always the nearly linear speedups that are sometimes claimed). Striping is especially effective for single-process transfer rates to a very large file and for processes performing a large number of I/O operations. Disk striping performance is discussed in detail in Section 15.5.

Special-purpose striped-disk devices are available from many vendors. In addition, many Unix systems offer software disk-striping. They provide utilities for configuring physical disks into a striped device, and the striping itself is done by the operating system, at the cost of some additional overhead.

The following general considerations apply to softwarestriped-disk configurations:

For maximum performance, the individual disks in the striped filesystem should be on separate disk controllers. However, it is permissible to place different disks on a given controller into separate stripe sets.
Some operating systems require that the individual disks be identical devices: the same size, the same partition layout, and often the same brand. If the layouts are different, the size of the smallest disk is often what is used for the filesystem and any additional space on the other disks will be unusable and wasted.
In general, disks used for striping should not be used for any purpose other than the I/O whose performance you want to optimize. Placing ordinary user files on striped disks seldom makes sense. Similarly, striping swap space makes sense only if paging performance is the most significant disk I/O performance factor on the system.
In no case should the device containing the root filesystem be used for disk striping. This is really a corollary of the previous item.
The stripe size selected for a striped filesystem is important. The optimal value depends on the typical data transfer characteristics and requirements for the application programs for which the filesystem is intended. Some experimentation with different stripe sizes will probably be necessary. Provided that processes using the striped filesystem perform large enough I/O operations, a larger stripe size will generally result in better I/O performance. However, the tradeoff is that larger stripe sizes mean a larger filesystem block size and, accordingly, less efficient allocation of available disk space.
Software disk striping is really designed for two to four disks. In most cases, any additional performance gains are generally quite modest.
SCSI disks make the most sense when you're using software striping for performance.

Software disk-striping is generally accomplished via the LVM or similar facility.

10.3.3.3 Disk mirroring and RAID

Another approach to combining multiple disks into a single logical device is RAID (or Redundant Array of Inexpensive^[28] Disks). In general, RAID devices are designed for increased data integrity and availability (via redundant copies), not for improved performance (RAID 0 is an exception).

^[28] Some acronym expansions put "Independent" here.

There are at least 6 definedRAID levels that differ in how the multiple disks within the unit are organized. Most available hardware RAID devices support some combination of the following levels (level 2 is not used in practice). Table 10-8 summarizes the available RAID levels.

Table 10-8. Commonly used RAID levels
Level	Description	Advantages/Disadvantages
0	Disk striping only.	+ Best I/O performance for large transfers. + Largest storage capacity. - No data redundancy.
1	Disk mirroring: every disk drive is duplicated for 100% data redundancy.	+ Most complete data redundancy. + Good performance on small transfers. - Largest disk requirements for fault tolerance.
3	Disk striping with a parity disk; data is split across component disks on a byte-to-byte basis; the parity disk enables reconstruction of all data if a drive fails.	+ Data redundancy with minimal overhead. + Decent I/O performance for reads. - Parity disk is a bottleneck for writes. - Significant operating system overhead.
4	Disk striping with a parity disk; data is split across component disks on a per-block basis; the parity disk enables reconstruction of all data if a drive fails.	+ Data redundancy with minimal overhead. + Better than level 3 for large sequential writes. - Parity disk is a bottleneck for small writes. - Significant operating system overhead.
5	Same as level 3 except that the parity information is split across multiple component disks, in an attempt to prevent the parity disk from becoming an I/O bottleneck.	+ Data redundancy with minimal overhead. + Best performance for writes. - Not as fast as level 3 or 4 for reads. - Significant operating system overhead.

Figure 10-7 illustratesRAID 5 in action, using 5 disks.

Figure 10-7. The RAID 5 data distribution scheme

There are also some hybrid RAID levels:

RAID 0+1: Mirroring of striped disks. Two striped sets are mirrors of one another. Data is striped across each stripe set, and the same data is sent to both stripes. Thus, this RAID variation provides both I/O performance advantages and fault tolerance.
RAID 1+0 (sometimes called RAID 10): Striping across mirror sets. Similar in intent to RAID 0+1, it provides equivalent performance advantages and slightly better fault tolerance in that it is easier to rebuild the RAID device after a single disk failure (since the data on only one mirror set is affected).

Both these levels use a minimum of four disks.

Most hardware RAID devices connect to standard SCSI or SCSI-2 controllers.^[29] Many systems also offer software RAID facilities within their LVM (as we shall see).

^[29] A small minority use fiber channel.

The following considerations apply to all softwareRAID implementations:

Be careful not to overload disk controllers when using software RAID, because this will significantly degrade performance for all RAID levels. Putting disks on separate controllers is almost always beneficial.
As with plain disk striping, the stripe size chosen for RAID 5 can effect performance. The optimum value to choose is very highly dependent on the typical I/O operation type.
The sad fact is that if you want both high performance and fault tolerance, software RAID, and especially RAID 5, is likely to be a poor choice. RAID 1 works reasonably (with two-way mirroring), although it does add some overhead to the system. The additional overhead that RAID 5 places on the operating system is considerable, about 23% more than required for normal I/O operations. The bottom line for RAID 5 is to spend the money to get a hardware solution, and use software RAID 5 only if you can't afford anything better. Having said that, software RAID 5 often works well on a dedicated file server with a lot of CPU horsepower, some fast SCSI disks, and very few write operations.

10.3.3.4 AIX

AIX defines the root volume group, rootvg, automatically when the operating system is installed. Here is a typical setup:

# lsvg rootvg               Display volume group attributes. VOLUME GROUP:   rootvg                   VG IDENTIFIER:  0000018900004c0.. . VG STATE:       active                   PP SIZE:        32 megabyte(s) VG PERMISSION:  read/write               TOTAL PPs:      542 (17344 MB)  MAX LVs:        256                      FREE PPs:       69 (2208 MB) LVs:            11                       USED PPs:       473 (15136 MB) OPEN LVs:       10                       QUORUM:         2 TOTAL PVs:      1                        VG DESCRIPTORS: 2 STALE PVs:      0                        STALE PPs:      0 ACTIVE PVs:     1                        AUTO ON:        yes MAX PPs per PV: 1016                     MAX PVs:        32 LTG size:       128 kilobyte(s)          AUTO SYNC:      no HOT SPARE:      no # lsvg -l rootvg                      List logical volumes in a volume group. rootvg: LV NAME             TYPE       LPs   PPs   PVs  LV STATE      MOUNT POINT hd5                 boot       1     1     1    closed/syncd  N/A hd6                 paging     16    16    1    open/syncd    N/A hd8                 jfs2log    1     1     1    open/syncd    N/A hd4                 jfs2       1     1     1    open/syncd    / hd2                 jfs2       49    49    1    open/syncd    /usr hd9var              jfs2       3     3     1    open/syncd    /var hd3                 jfs2       1     1     1    open/syncd    /tmp hd1                 jfs2       1     1     1    open/syncd    /home hd10opt             jfs2       1     1     1    open/syncd    /opt lg_dumplv           sysdump    32    32    1    open/syncd    N/A

Adding a new disk under AIX follows the same basic steps as for other Unix systems, although the commands used to perform them are quite different. Once you've attached the device to the system, reboot it. Usually, AIX will discover new devices at boot time and automatically create special files for them. Defined disks have special filenames like /dev/hdisk1. The cfgmgr command may be used to search for new devices between boots; it has no arguments.

The lsdev command will list the disks present on the system:

$ lsdev -C -c disk   hdisk0 Available 00-00-0S-0,0 1.0 GB SCSI Disk Drive  hdisk1 Available 00-00-0S-2,0 Other SCSI Disk Drive  .. .

The new disk must then be made part of a volume group. To create a new volume group, use the mkvg command:

# mkvg -y "chemvg" hdisk5 hdisk6

This command creates a volume group named chemvg consisting of the disks hdisk5 and hdisk6. mkvg's -s option can be used to specify the physical partition size in MB: from 1 to 1024 (4 is the default). The value must be a power of 2.^[30]

^[30] You will need to increase this parameter for disks larger than 4 GB (1016 * 4 MB), because the maximum number of physical partitions is 1016. You can increase the latter limit using the -t option to mkvg and chvg. The new maximum will be this option's value times 1016. This can be necessary when adding a large (18 GB or more) disk to an existing volume group containing significantly smaller disks. It may also eventually be necessary for future very, very large disks.

After a volume group is created, it must be activated with the varyonvg command:

# varyonvg chemvg

Thereafter, the volume group will be activated automatically at each boot time. Volume groups are deactivated with varyoffvg; all of their filesystems must be dismounted first.

A new disk may be added to an existing volume group with the extendvg command. For example, the following command adds the disk hdisk4 to the volume group named chemvg:

# extendvg chemvg hdisk4

The following other commands operate on volume groups:

chvg: Change volume group characteristics.
reducevg: Remove a disk from a volume group (removing all disks deletes the volume group).
importvg: Add an existing volume group to the system (used to move disks between systems and to activate existing volume groups after replacing the root disk).
exportvg: Remove a volume group from the system device database but don't alter it (used to move disks to another system).

Logical volumes are created with the mklv command, which has the following basic syntax:

mklv -y "lvname" volgrp n [disks]

lvname is the name of the logical volume, volgrp is the volume group name, and n is the number of logical partitions. For example, the command:

# mklv -y "chome" chemvg 64

makes a logical volume in the chemvg volume group consisting of 64 logical partitions (256 MB) named chome. The special files /dev/chome and /dev/rchome will automatically be created by mklv.

The mklv command has many other options, which allow the administrator as much control over how the logical volume maps to physical disks as desired, down to the specific physical partition level. However, the default settings work very well for most applications.

The following commands operate on logical volumes:

extendlv: Increase the size of a logical volume.
chlv: Change the characteristics of a logical volume.
mklvcopy: Increase the number of data copies in a logical volume.
rmlv: Delete a logical volume.

A small logical volume in each volume group is used for logging and other disk management purposes. Such logical volumes are created automatically by AIX and have names like lvlog00.

Once the logical volumes have been created, you can build filesystems on them. AIX has a version of mkfs, but crfs is a much more useful command for creating filesystems. There are two ways to create a filesystem:

Create a logical volume and then create a filesystem on it. The filesystem will occupy the entire logical volume.
Create a filesystem and let AIX create a logical volume for you automatically.

The second way is faster, but the logical volume name AIX chooses is quite generic (lv00 for the first one so created, and so on), and the size must be specified in 512-byte blocks rather than in logical partitions (which default to 4 MB units).

The crfs command is used to create a filesystem. The following basic form may be used to create a filesystem:

crfs -v jfs2 -g vgname -a size=n -m mt-pt -A yesno -p prm

The options have the following meanings:

-v jfs2: The filesystem type is jfs2 ("enhanced journaled filesystem," using the logging logical volume in its volume group), the recommended local filesystem type.
-g vgname: Volume group name.
-a size= n: Size of the filesystem, in 512-byte blocks.
-m mt-pt: Mount point for the filesystem (created if necessary).
-A yesno: Whether the filesystem is mounted by mount -a commands.
-a frag= n: Use a fragment size of n bytes for the filesystem. This value can range from 512 to 4096, in powers of 2, and it defaults to 4096. Smaller sizes will allocate disk space more efficiently for usage patterns consisting of many small files.
-a nbpi= n: Specify n as the number of bytes per inode. This setting controls how many inodes are created for the new filesystem (number of inodes equals filesystem size divided by bytes per inode). The default value of 4096 usually creates more than you'll ever need except for filesystems with many, many tiny files. The maximum value is 16384.
-a compress=LZ: Use transparent LZ compression on the files in the filesystem (this option is disabled by default).

For example, the following command creates a new filesystem in the chemvg volume group:

# crfs -v jfs2 -g chemvg -a size=50000 -a frag=1024 -m /organic2 -A yes # mount /organic2

The new filesystem will be mounted at /organic2 (automatically at boot time), is 25 MB in size, and uses a fragment size of 1024 bytes. A new logical volume will be created automatically, and the filesystem will be entered into /etc/filesystems. The initial mount must be done by hand.

The -d option is used to create a filesystem on an existing logical volume:

# crfs -v jfs2 -d chome -m /inorganic2 -A yes

This command creates a filesystem on the logical volume we created earlier. The size and volume group options are not needed in this case.

The chfs command may be used to increase the size of a filesystem. For example, the following command increases the size of the /inorganic2 filesystem (and of its logical volume chm00) created above:

# chfs -a size=+50000 /inorganic

An absolute or relative size may be specified for the size parameter (in 512-byte blocks). The size of a logical volume may be increased with the extendlv command, but it has no effect on filesystem size.

The following commands operate on AIX jfs and jfs2 filesystems:

chfs: Change filesystem characteristics.
rmfs: Remove a filesystem, its associated logical volume, and its entry in /etc/filesystems.

10.3.3.4.1 Replacing a failed disk

When you need to remove a disk from the system, most likely due to a hardware failure, there are two considerations to keep in mind:

If possible, perform the steps to remove a damaged non-rootdisk from the LVM configuration before letting field service replace it (otherwise, it will take some persistence to get the system to forget about the old disk).
Items must be removed in the reverse order from the way they were created: filesystems, then logical volumes, then volume groups.

The following commands remove hdisk4 from the LVM configuration (the volume group chemvg2 and the logical volume chlv2 holding the /chem2 filesystem are used as an example):

# umount /chem2                         Unmount filesystem. # rmfs /chem2                           Repeat for all affected filesystems. # rmlvcopy chlv2 2 hdisk4               Remove mirrors on hdisk4. # chps -a n paging02                    Don't activate paging space at next boot. # shutdown -r now                       Reboot the system. # chpv -v r hdisk4                      Make physical disk unavailable. # reducevg chemvg2 hdisk4               Remove disk from volume group. # rmdev -l hdisk4 -d                    Remove definition of disk.

When the replacement disk is added to the system, it will be detected, and devices will be created for it automatically.

10.3.3.4.2 Getting information from the LVM

AIX provides many commands and options for listing information about LVM entities. Table 10-9 attempts to make it easier to figure out which one to use for a given task.

Table 10-9. AIX LVM informational commands
If you want to see:	Use this command:
All disks on the system	`lspv`
All volume groups	`lsvg`
All logical volumes	`lsvg -l 'lsvg'`
All filesystems All filesystems of a given type	`lsfs` `lsfs -v` `type`
What logical volumes are in a volume group	`lsvg -l` `vgname`
What filesystems are in a volume group	`lsvgfs` `vgname`
What disks are in a volume group	`lsvg -p` `vgname`
Which volume group a disk is in	`lsvg -n hdiskn`
Disk characteristics and settings	`lspv hdiskn`
Volume group settings	`lsvg` `vgname`
Logical volume characteristics	`lslv` `lvname`
Size of an unmounted local filesystem (in blocks)	`lsfs` `file-system`
Whether there is any unused space on a logical volume already containing a filesystem (compare lv size and fs size)	`lsfs -q` `file-system`
Disk usage summary map by region	`lspv -p hdiskn`
Locations of the free physical partitions on a disk broken down by region	`lspv hdiskn`
Locations of all free physical partitions in a volume group, by disk and disk region	`lsvg -p` `vgname`
Which logical volumes use a given disk, broken down by disk region	`lspv -l hdisk` `n`
What disks a logical volume is stored on, including disk region distribution	`lslv -l` `lvname`
Table showing the physical-to-logical partition mapping for a logical volume	`lslv -m` `lvname`
Table showing physical partition usage for a disk by logical volume	`lspv -M hdiskn`

10.3.3.4.3 Disk striping and disk mirroring

A striped logical volume is created by specifying mklv's -S option, indicating the stripe size, which must be a power of 2 from 4K to 128K. For example, this command creates a 500 MB logical volume striped across two disks consisting of a total of 125 logical partitions, each 4 MB in size:

# mklv -y cdata -S 64K chemvg 125 hdisk5 hdisk6

Note that the disk names are required on the mklv command when creating a striped logical volume.

Multiple data copies mirroring may be specified with the -c option, which takes the number of copies as its argument (the default is 1). For example, the following command creates a two-way mirror logical volume:

# mklv -c 2 -s s -w y biovg 500 hdisk2 hdisk3

The command specifies two copies, a super strict allocation policy (forces each mirror to a separate physical disk, which are listed), and specifies that write synchronization take place during each I/O operation (which reduces I/O performance but guarantees data synchronization).

An entire volume group can also be mirrored. This is configured using the mirrorvg command.

Finally, the -a option is used to request placement of the new logical volume within a general region of the disk. For example, this command requests that the logical volume be placed in the center portion of the disk to as great an extent as possible:

# mklv -y chome -ac chemvg 64

Disks are divided into five regions named as follows (beginning at the outer edge): edge, middle, center, inner-middle, and inner-edge. The middle region is the default, and the other available arguments to -a are accordingly e, im, and ie.

AIX does not provide general software RAID, although one can use mirrors and stripes to achieve the same functionality as RAID 0, 1, and 1+0.

10.3.3.5 HP-UX

HP-UX provides another version of a LVM that is used by default. The vg00volume group holds the system files, which are divided into several logical volumes:

# vgdisplay vg00                         Display volume group attributes. --- Volume groups ---                    Output shortened. VG Name                     /dev/vg00 VG Write Access             read/write VG Status                   available Max LV                      255 Cur LV                      8 Open LV                     8 Max PV                      16 Cur PV                      1 Act PV                      1 Max PE per PV               2500 PE Size (Mbytes)            4 Total PE                    2169 Alloc PE                    1613 Free PE                     556 Total Spare PVs             0 Total Spare PVs in use      0 # bdf                                    Output shows mounted logical volumes. Filesystem          kbytes    used   avail %used Mounted on /dev/vg00/lvol3     143360   22288  113567   16% / /dev/vg00/lvol1      83733   32027   43332   42% /stand /dev/vg00/lvol7    2097152  419675 1572833   21% /var /dev/vg00/lvol6    1048576  515524  499746   51% /usr /dev/vg00/lvol5      65536    1128   60386    2% /tmp /dev/vg00/lvol4    2097152  632916 1372729   32% /opt /dev/vg00/lvol8      20480    1388   17900    7% /home

The process of creating a volume group begins by designating the component disks (or disk partitions) as physical volumes, using the pvcreate command:

# pvcreate /dev/rdsk/c2t0d0

Next, a directory and character special file must be created in /dev for the volume group:

# mkdir /dev/vg01 # mknod /dev/vg01/group c 64 0x010000

The major number is always 64, and the minor number is of the form 0x0n0000, where n varies from 0 to 9 and must be unique across all volume groups (I assign them in order).

The volume group may now be created with the vgcreate command, which takes the volume group directory in /dev and the component disks as its arguments:

# vgcreate /dev/vg01 /dev/dsk/c2t0d0

vgcreate's -s option may be used to specify an alternate physical extent size (in megabytes). The default of 4 may be too small for large disks. You can add an additional volume to an existing volume group with the vgextend command.

The vgcreate and vgextend commands also have a -g option, which allows you to define named subsets of the disks in the volume group, known as physical volume groups, as in this example that creates two physical volume groups in the vg01 volume group:

# vgcreate /dev/vg01 -g groupa /dev/dsk/c2t2d0 /dev/dsk/c2t4d0 # vgextend /dev/vg01 -g groupb /dev/dsk/c1t0d0 /dev/dsk/c1t1d0

The file /etc/lvmpvg holds the physical volume group data, and it may be edited directly rather than running vgcreate:

VG     /dev/vg01  PVG    groupa  /dev/dsk/c2t0d0  /dev/dsk/c2t4d0  PVG    groupb  /dev/dsk/c1t0d0  /dev/dsk/c1t1d0

Once the volume group is created, the lvcreate command may be used to create a logical volume. For example, the following command creates a 200 MB logical volume named chemvg:

# lvcreate -n chemvg -L 200 /dev/vg01

If the specified size is not an even multiple of the extent size (4 MB), the size is rounded up to the nearest multiple.

If the new logical volume is to be used for the root or boot filesystem or as a swap space, you must run the lvlnboot command with its -r, -b, or -s option (respectively). The command takes the logical volume device as its argument:

# lvlnboot -r -s /dev/vg01/swaplv

The -r option will create a combined boot/root volume if the specified logical volume is the first one on the physical volume.

Once a logical volume is built, a filesystem may be built upon it. For example:

# newfs /dev/vg01/rchemvg

The logical volume name is concatenated to the volume group directory in /dev to form the special filenames referring to the logical volume; note that newfs uses the raw device. The new filesystem may then be mounted and entered into the filesystem configuration file.

You can customize the parameters for a new VxFS filesystem using these options to newfs :

-b size: Filesystem block size in bytes. The default is 1024 for filesystems smaller than 8 GB, 2048 for ones up to 16 GB, 4096 for ones less than 32 GB, and 8192 for larger ones. The value must be a power of 2 from 4096 to 8192 (or to 65536 on 700 series systems using disk striping, which is discussed later in this chapter).
-l: Enable files larger than 2 GB.

Other commands that operate on LVM entities are listed below:

vgextend: Add disk to volume group.
vgreduce: Remove disk from volume group.
vgremove: Remove a volume group.
lvextend: Add physical extents or mirrored copies to a logical volume.
lvreduce: Remove physical extents or mirrored copies from a logical volume.
Lvremove: Remove a logical volume from a volume group.

10.3.3.5.1 Displaying LVM information

The following commands display information about LVM entities:

pvdisplay disk: Summary information about the disk drive.
pvdisplay -v disk: Mapping of physical extents to logical extents.
vgdisplay vg: Summary information about the volume group.
vgdisplay -v vg: Brief information about all logical volumes within the volume group.
lvdisplay lv: Summary information about the logical volume.
lvdisplay -v lv: Mapping of logical to physical extents for the logical volume.

10.3.3.5.2 Disk striping and mirroring

The LVM is also used to perform disk striping and disk mirroring onHP-UX systems. For example, the following command creates a 200 MB logical volume named cdata with one mirrored copy:

# lvcreate -n cdata -L 200 -m 1 -s g /dev/vg01

The -s g option specifies that the mirrors must be placed into different physical volume groups.

Under HP-UX, disk striping occurs at the logical volume level. The following command creates an 800 MB four-way striped logical volume, using a striped width of 64 KB:

# lvcreate -n tyger -L 400 -i 4 -I 64 /dev/vg01

The -i option specifies the number of stripes (disks) and can be no larger than the total number of disks in the volume group; -I specifies the stripe size in KB, and its valid range is powers of 2 from 4 to 64.

Most HP-UX version do not provide software RAID.^[31]

^[31] Software RAID is provided under HP-UX with VxVM (the Veritas Volume Manager, which supports software RAID 5, 0+1, and 1+0). HP began shipping VxVM with HP-UX 11i.

10.3.3.6 Tru64

Tru64 provides two facilities which have many of the characteristics of a logical volume manager:

The Advanced File System (AdvFS), whose name is something of a misnomer, as it is actually both a filesystem type and a simple logical volume manager. The filesystem is included with the operating system, but there is also an add-on product containing additional AdvFS utilities.
The Logical Storage Manager (LSM) facility is an advanced LVM facility. It adds an additional layer to the structure usually found in logical volume managers. This is an add-on product.

We'll consider each of them in separate subsections.

10.3.3.6.1 AdvFS

The AdvFS defines the following entities:

A volume is a logical entity that can correspond to a disk partition, an entire disk, an LSM volume (see below), or even an external storage device such as a hardware RAID array.
A domain is a set of one or more volumes.
A fileset is a directory tree that can be mounted within the filesystem. Domains can contains multiple filesets.

Unlike other LVMs, under the AdvFS, domains and filesets physical storage and directory trees are independent, and either one can be modified without affecting the other (as we'll see).

The AdvFS facility is used by default on Tru64 systems. It defines two domains and several filesets:

# showfdmn root_domain                    Describe this domain.                Id              Date Created  LogPgs  Version  Domain Name 3a535b22.000c47c0  Wed Jan  3 12:02:26 2001     512        4  root_domain   Vol   512-Blks        Free  % Used  Cmode  Rblks  Wblks  Vol Name    1L     524288       95680     82%     on    256    256  /dev/disk/dsk0a # mountlist -v                            List mounted filesets.         root_domain#root                  Root filesystem.         usr_domain#usr                    Mounted at /usr.         usr_domain#var                    Mounted at /var. # showfsets usr_domain                    List filesets in a domain. usr         Id           : 3a535b27.0005a120.1.8001         Files        :    43049,  SLim=        0,  HLim=        0         Blocks (512) :  1983812,  SLim=        0,  HLim=        0         Quota Status : user=off group=off Output shortened. var         Id           : 3a535b27.0005a120.2.8001         Files        :     1800,  SLim=        0,  HLim=        0         Blocks (512) :    34954,  SLim=        0,  HLim=        0         Quota Status : user=off group=off

You can create a new domain with the mkfdmn command:

# mkfdmn /dev/disk/dsk1c chem_domain

This command creates the chem_domain domain consisting of the specified volume (here, a disk partition). If you have the AdvFS Utilities installed, you can add volumes to a domain with the addvol command, as in this example, which adds a second disk partition to the chem_domain domain:

# addvol /dev/disk/dsk2c chem_domain # balance chem_domain

You can similarly remove a volume from a domain with the rmvol command. The balance command is typically run after either one; it has the effect of balancing disk space usage among the various volumes in the domain to improve performance.

Once a domain has been created, you can create filesets within it. This process creates an entity which is effectively a relocateable filesystem; a fileset is ready to accept files as soon as it is created (no mkfs step is required), and its contents can be moved to a different physical disk location in its domain if required.

The following commands create two filesets with our domain, and mount them into two existing directories immediately afterward:

# mkfset chem_domain bronze # mkfset chem_domain silver # mount chem_domain#bronze /bronze # mount chem_domain#silver /silver

The fileset is referred to by appending its name to the domain name, separated by a number sign (#). Note that we don't have to specify any actual disk locations (and indeed we cannot do so). These matters are handled by the AdvFS itself.

The rmfset command may be used to remove a fileset from a domain. The renamefset command may be used to change the name of a fileset, as in this example:

# renamefset chem_domain lead gold

The AdvFS offers some limiteddisk striping facilities as part of its optional utilities package. A file can be striped by creating it with the stripe command:

# stripe -n 2 sulfur

This command creates the file sulfur as a two-way striped file. The file must created before any data is placed into it. More complex striping of entire volumes can be done with the Logical Storage Manager described in the next subsection.

10.3.3.6.2 LSM

The Tru64 Logical Storage Manager is designed to support advanced disk features such as disk striping and fault tolerance. It is a layered product which must be added to the basic Tru64 operating system.

Under the LSM, a whole new set of terminology comes into play:

Disk group: A named collection of disks using a common LSM database. This roughly corresponds to a volume group.
Plex: The primary data storage entity. A plex can be concatenated, meaning that the discrete subdisks are simply combined sequentially, or striped, where data is striped across subdisks for higher performance. Software RAID 5 plexes can also be created.
Subdisk: A group of contiguous physical disk blocks. Subdisks are defined to force plexes to specific physical disk locations.
Volume: A collection of one or more plexes, conceptually performing the same function as a logical volume. Filefilesystemssystems are built upon volumes. The innovation introduced by the LSM is that the component plexes in a mirrored volume need not be identical. For example, one plex might be made up of three subdisks, and another one could be composed of four subdisks of the same total size.

For the most common cases, you need only worry about disk groups and volumes; plexes are taken care of automatically by the LSM. In the remainder of this section, we'll look briefly at some simple examples of LSM configuration. Consult the documentation for full details.

The voldiskadd command is used to create new disk group.^[32] This command takes the disks to be added to the group as its arguments:

^[32] This discussion assumes that the LSM has been initialized by creating the root disk group. This is done with the volsetup command, which takes two or more disks as its arguments. The vold and voliod daemons should also be running (which happens automatically during a successful LSM installation).

# voldiskadd dsk3 dsk4  .. .

It is an interactive tool which will prompt you for the additional information it needs, including the disk group name (we'll use dg1 in our examples) and the use for each disk (data or spare).

If you later want to place additional disks into a disk group, you use the voldg command, as in this example, which adds several more disks to dg1:

# voldg -g dg1 adddisk dsk9 dsk10 dsk11

Volumes are generally created with the volassist command. For example, the following command creates a volume consisting of a concatenated plex named chemvol, essentially a logical volume comprised of space from multiple disks on which a filesystem can be built:

# volassist -g dg1 make chemvol 2g dsk3 dsk4

The volume is created using the dg1 disk group, using the specified disks (the disk list is optional). Its size is 2 GB.

We'll go on to make this a mirrored plex, using these commands:

# volassist -g dg1 mirror chemvol init=active layout=nolog dsk5 dsk6 # volassist addlog chemvol dsk7

The first command adds a mirror to the chemvol volume (we've again chosen to specify which disks to use). The second command adds the required logging area to the volume.

The same technique could be used to mirror a single disk by using only one disk in each volassist command.

We can create a striped plex in a similar way:

# volassist -g dg1 make stripevol 2g layout=stripe nstripe=2 dsk3 dsk4

This command creates a two-way striped volume named stripevol.^[33]

^[33] For more complex striped and RAID 5 plexes, you may need to define subdisks to force the various stripes to specific disks (e.g., to spread them across multiple controllers) as the default assignments made by the LSM often do not do so.

The following command will create a 3 GB RAID 5 volume:

# volassist -g dg1 make raidvol 3g layout=raid5 nstripe=5  disks

For both striped andRAID 5 volumes, you can also use the stripeunit attribute (following nstripe) to specify the stripe size.

Disk groups containing mirrored or RAID 5 volumes should include designated hot spare disks. The following commands designate dsk9 as a hot spare for our disk group:

# voledit -g dg1 set spare=on dsk9 # volwatch -s lsmadmin@ahania.com

The volwatch command enables automatic hot spare replacement (-s), and its argument is the email address to which to send notifications when these events occur.

Once an LSM volume is created, it can be placed within an AdvFS domain and used for creating filesets.

The following commands are useful in obtaining information about LSM entities:

voldg -g dg free: Display free space in a disk group.
voldisk list: List all component disks used by the LSM.
volprint -v: List all volumes.
volprint -ht volume: Display information about a specific volume.
volprint -pt: List all plexes.
volprint -lp plex: Display information about a specific plex.
volprint -st: List all subdisks.
volprint -l subdisk: Display information about a specific subdisk.

Finally, the volsave command is used to save the LSM metadata to a disk file, which can then be backed up. The default location for these files is /usr/var/lsm/db, but you can specify an alternate location using the command's -d option. The files themselves are given names of the form LSM.n.host, where n is a 14 digit encoding of the date and time. The volrestore command will restore the saved data should it ever be necessary.

10.3.3.7 Solaris

Solaris 9 introduces a logical volume manager as part of the standard operating system. This facility was available as an add-on product with earlier versions of Solaris (although there have been some changes with respect to previous versions see the documentation for details).

The Solaris Volume Manager supports striping, mirroring, RAID 5, soft partitions (the ability to divide any disk into more than four partitions), and some other features. The Volume Manager must be initialized before its first use, using commands like these:

# metadb -a -f c0t0d0s7                   Create initial state database replicas. # metadb -a -c 2 c1t3d0s2                 Add replicas on this slice.

We are now ready to create volumes. We will look briefly at some simple examples in the remainder of this section.

The Solaris Volume Manager uses fixed names for volumes of the form dn, where n is an integer from 0 to 127. Thus, the maximum number of volumes is 128. The metainit command does most of the work of creating and configuring volumes.

The following command will create a concatenated volume consisting of three disks:

# metainit d1 3 1 c1t1d0s2 1 c1t2d0s2 1 c1t3d0s2

The parameters are the volume name, the number of components (always greater than one for a concatenated volume), and then three pairs consisting of the number of component disks (always 1 here) followed by desired disk(s). When the command completes, the volume d1 can be treated as if it were a single disk partition.

You can expand an existingfilesystem using a similar command, as in this example, which expands the /docs filesystem (originally on c0t0d0s6):

# umount /docs # metainit d10 2 1 c0t0d0s6 1 c2t3d0s2 Add additional disk space. # vi /etc/vfstab                       Change the filesystem's devices to /dev/md/[r]dsk/md10. # mount /docs # growfs -M /dev/md/rdsk/d10           Increase the filesystem size to the volume size.

The following command will create a striped volume:

# metainit d2 1 2 c1t1d0s2 c2t2d0s2 -i 64k

The parameters following the volume name indicate that we are creating a singlestriped volume with two component disks, using a stripe size (interlace value) of 64 KB (-i).

You can mirror volumes using metainit's -m option, followed by the metattach command, as in this example:

# metainit d20 -f 1 1 c0t3d0s2         Create the volume to be mirrored. # umount /dev/dsk/c0t3d0s2 # metainit d21 1 1 c2t1d0s2            Create a volume to be used as the mirror. # metainit d22 -n d20                  Specify the volume to be mirrored. # vi /etc/vfstab                       Modify entry to point to the mirror volume (d22). # mount /dev/md/dsk/d22                Remount filesystem. # metattach d22 d21                    Add a mirror.

In this case, we add a mirror to an existing filesystem. We use the -f option on the first metainit command to force a volume to be created from an existing filesystem.

If we were mirroring the root filesystem, we would run the metaroot command (specifying the mirror volume as its argument) and then reboot the system, rather than ummounting and remounting the filesystem.

Other volume types can also be mirrored concatenated, striped, etc. using just the final two commands.

You can specify the read and write policies for mirrored volumes using the metaparam command, as in this example:

# metaparam -r geometric -w parallel d22

The -r option specifies the read policy, one of roundrobin (successive read operations go to each disk in turn, which is the default), first (all reads go to the first disk), and geometric (read operations are divided between the component disks by assigning specific disk regions to each one). The geometric read policy can minimize seek times by confining disk head movement to a subset of the disk, which can produce measurable performance improvements for I/O that is seek time-limited (e.g., randomly accessed data, such as a database).

The -w parameter specifies the write policy, one of parallel (write to all disks at the same time, which is the default) and serial. The latter might be used to improve performance when both mirrors are on the same busy disk controller.

The following command will create aRAID 5 volume:

# metainit d30 -r  disks  -i 96k

This creates a RAID 5 volume using a stripe size of 96 KB. The default stripe size is 16 KB, and it must range from 8 KB to 100 KB.

Don't try to access a RAID 5 volume until it has finished initializing. This can take a while. You can check its status with the metastat command.

You can replace a failed RAID 5 component volume using the metareplace command, as in this example:

# metareplace -e d30 c2t5d0s2

Alternatively, you can define a hot spare pool from which disks can be taken as needed for all RAID 5 devices. For example, these commands create a pool named hsp001^[34] and designate it for use with RAID 5 device d30:

^[34] Hot spare pool names must be of the form hspnnn, where nnn ranges from 000 to 999. Why you would need 1000 hot space pools for 128 volumes is a good question.

# metainit hsp001 c3t1d0s2 c3t2d0s2 # metaparam -h hsp001 d30

You can modify the disks in a hot space pool using the metahs command and its -a (add), -r (replace), and -d (delete) options.

The last Volume Manager feature we'll consider is soft partitions. Soft partitions are simply logical partitions (subsets) of a disk. For example, the following command creates a volume consisting of 2 GB from the specified disk:

# metainit d7 -p c2t6d0s2 2g

When used with a new disk, you can add the -e option to the command. This causes the disk to be repartitioned so that all but 4 MB is in slice 0 (the 4 MB is in slice 7 and is used to hold a state database replica). For example, this command performs that repartitioning and then assigns 3 GB of slice 0 to volume d8:

# metainit d8 -p -e c2t5d0s2 3g

Once volumes are created, you can create a UFS filesystem on them using newfs as usual. You can also remove any volume with the metaclear command, which takes the desired volume as its argument. Naturally, any data on the volume will be lost.

The following commands are useful for obtaining information about the Volume Manager and individual volumes:

metadb: List all state database replicas.
metadb -I: Show status of state database replicas.
metastat d n: Show volume status.
metaparam d n: Show volume settings.

10.3.3.8 Linux

Linux systems can use both a logical volume manager and software disk striping and RAID, although the two facilities are separate. They are compatible, however; for example, RAID volumes can be used as components in the logical volume manager.

The Linux Logical Volume Manager (LVM) project has been in existence for several years (its homepage is http://www.sistina.com/products_lvm.htm), and support for the LVM is merged into the Linux 2.4 kernel. Conceptually, the LVM allows you to combine and divide physical disk partitions in a completely flexible manner. The resulting filesystems are dynamically resizable. The current version of the LVM supports up to 99 volume groups and 256 logical volumes. The maximum logical volume size is currently 256 GB.

The logical volume manager is included in some recent Linux distributions (for example, SuSE Linux 6.4 and later). If it is not included in yours, installing it is quite straightforward:

Download the LVM package and the appropriate kernel patch for your system.
Unpack and build the LVM package.
If necessary, patch the kernel source code and build a new kernel, enabling LVM support during the kernel configuration process. One way to do this is to use the make xconfig command. Use the Block Devices button from the main menu.
If you have selected modular support for the LVM, add entries to /etc/modules.conf to enable the modprobe command to load the LVM module at boot time. Here are the needed entries:
```
    alias  block-major-58  lvm-mod     alias  char-major-109  lvm-mod
```
Install the new kernel into the boot directory, and enable its use with LILO or GRUB.
Modify the system startup and shutdown scripts to activate and deactivate the LVM configuration. Add these commands to the startup scripts:
```
    vgscan                     # Search for volume groups     vgchange -a y              # Activate all volume groups
```

Add this command to the shutdown script:

    vgchange -a n              # Deactivate all volume groups

Reboot the system using the new kernel.

The LVM package includes a large number of administrative utilities, each of which is designed to create or manipulate a specific type of LVM entity. For example, the commands vgcreate, vgdisplay, vgchange, and vgremove create, display information about, modify the characteristics of, and delete a volume group (respectively). You can also backup and restore the volume group configurations with vgcfgbackup and vgcfgrestore, change the size of a volume group with vgextend (increase its size by adding disk space to it) and vgreduce (decrease its size), divide and combine volume groups (vgsplit and vgmerge), move a volume group between computer systems (vgexport and vgimport), search all local disks for volume groups (vgscan), and rename a volume group (vgrename). (Many of these commands are similar to the HP-UX equivalents.)

There are similar commands for other LVM entities:

Physical volumes: pvcreate, pvdisplay, pvchange, pvmove, and pvscan.
Logical volumes: lvcreate, lvdisplay, lvchange, lvremove, lvreduce, lvextend, lvscan, and lvrename.

Let's look at some of these commands in action as we create a volume group and some logical volumes and then build filesystems on them.

The first step is to set the partition type of the desired disk partitions to 0x8E. We use fdisk for this task; here is the process for the first disk partition:

# fdisk /dev/sdb1 Command (m for help): t Partition number (1-4): 1 Hex code (type L to list codes): 8e Command (m for help): w

The first time we use the LVM, we need to run vgscan to initialize the facility (among other things, it creates the /etc/lvmtab file). Next, we designate the disk partitions as physical volumes by specifying the desired disk partitions as command arguments to the pvcreate command (/dev/sdc2 is the second partition we will be using in our volume group):

# pvcreate /dev/sdb1 /dev/sdc2 pvcreate -- reinitializing physical volume pvcreate -- physical volume "/dev/sdb1" successfully created ...

We are now ready to create a volume group, which we will name vg1:

# vgcreate vg1 /dev/sdb1 /dev/sdc2 vgcreate -- INFO: using default physical extent size 4 MB vgcreate -- INFO: maximum logical volume size is 255.99 Gigabyte vgcreate -- doing automatic backup of volume group "vg1" vgcreate -- volume group "vg1" successfully created and activated

This command creates the vg1 volume group using the two specified disk partitions. In doing so, it creates/updates the ASCII configuration file /etc/lvmtab (which holds the names of the system's volume groups) and places a binary configuration file into two subdirectories of /etc: lvmtab.d/vg1 and lvmconf/vg1.conf (the latter directory will also store old binary configuration files for this volume group, reflecting changes to its characteristics and components).

The vgcreate command also creates the special file /dev/vg1/group, which can be used to refer to the volume group as a device.

Now we can create two 800 MB logical volumes:

# lvcreate -L 800M -n chem_lv vg1 lvcreate -- doing automatic backup of "vg1" lvcreate -- logical volume "/dev/vg1/chem_lv" successfully created # lvcreate -L 800M -n bio_lv -r 8 -C y vg1 lvcreate -- doing automatic backup of "vg1" lvcreate -- logical volume "/dev/vg1/bio_lv" successfully created

We set the sizes of both logical volumes via the lvcreate command's -L option. In the case of the second logical volume, bio_lv, we also specify that the read-ahead mode chunk size is 8 sectors via -r (the amount of data returned at a time during sequential access) and specify that a contiguous logical volume be created (via the -C y option).

Once again, two new special files are created, each named after the corresponding logical volume and located under the volume group directory in /dev (here, /dev/vg1).

We can now create filesystems using the ordinary mke2fs command, specifying the logical volume as the device on which to build the new filesystem. For example, the following command creates an ext3 filesystem on the bio_lv logical volume:

# mke2fs -j /dev/vg1/bio_lv

Once built, this filesystem may be mounted as usual. You can also build a Reiser filesystem on a logical volume.

In addition to the previously mentioned commands, the LVM provides the e2fsadmin command, which can be used to increase the size of a logical volume and the ext2 or ext3 filesystem it contains a single, nondestructive operation. This utility requires the resize2fs utility (originally developed by PowerQuest as part of its PartitionMagic product and now available under the GPL at http://e2fsprogs.sourceforge.net).

Here is an example of its use; the following command adds 100 MB to the bio_lv logical volume and the filesystem that it contains:

# umount /dev/vg1/bio_lv # e2fsadm /dev/vg1/bio_lv -L+100M e2fsck 1.18, 11-Nov-1999 for EXT2 FS 0.5b, 95/08/09 Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information /dev/vg1/bio_lv: 11/51200 files (0.0% non-contiguous), 6476/819200 blocks lvextend -- extending logical volume "/dev/vg1/bio_lv" to 900 MB lvextend -- doing automatic backup of volume group "vg1" lvextend -- logical volume "/dev/vg1/bio_lv" successfully extended resize2fs 1.19 (13-Jul-2000) Begin pass 1 (max = 5) Extending the inode table     XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Begin pass 3 (max = 25) Scanning inode table          XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX The filesystem on /dev/vg1/bio_lv is now 921600 blocks long. e2fsadm -- ext2fs in logical volume "/dev/vg1/bio_lv"  successfully extended to 900 MB

Note that the filesystem must be unmounted in order to increase its size.

To use the Linux software RAID facility, you must install the component disks, enable RAID support in the kernel and then set up the RAID configuration. You can perform the second task using a utility like make xconfig and selecting the Block Devices category from the main menu. The Multiple devices driver support item is the one that must be enabled to access all of the other RAID-related items. I recommend enabling all of them.

RAID devices use special files of the form /dev/mdn (where n is an integer), and they are defined in the /etc/raidtab configuration file. Once defined, you can create them using the mkraid command and start and stop them with the raidstart and raidstop commands. Alternatively, you can define them with the persistent superblock options, which enables automatic detection and mounting/dismounting of RAID devices by the kernel. In my view, the latter is always the best choice.

The best way to understand the /etc/raidtab file is to examine some sample entries. Here is an entry corresponding to a striped disk using two component disks, which I have annotated:

raiddev /dev/md0                      Defines RAID device 0. raid-level 0                          RAID level. nr-raid-disks 2                       Number of component disks. chunk-size 64                         Stripe size (in KB). persistent-superblock 1               Enable the persistent superblock feature. device /dev/sdc1                      Specify the first component disk ... raid-disk 0                               and number it. device /dev/sdd1                      Same for all remaining component disks. raid-disk 1

If we had wanted to define a two-way mirror set instead of a stripe set, using the same disks, we would omit the chunk-size parameter and change the raid-level parameter from 0 to 1 in the first section, and the rest of the entry would remain the same.

We can set up a RAID 0+1 disk, a mirrored striped disk, in this way:

raiddev /dev/md0 ...Set up the first striped disk. raiddev /dev/md1 ...Set up the second striped disk. raiddev /dev/md2 raid-level 1 nr-raid-disks 2 persistent-superblock 1 device /dev/md0                        The component disks are also md devices. raid-disk 0 device /dev/md1 raid-disk 1

The following entry defines a RAID 5 disk containing 5 component disks, as well as a spare disk to be automatically used should any of the active disks fail:

raiddev /dev/md0 raid-level 5                           Use RAID level 5. nr-raid-disks 5                        Number of active disks in the device. persistent-superblock 1 device /dev/sdc1                       Specify the 5 component disks. raid-disk 0 device /dev/sdd1 raid-disk 1 device /dev/sde1 raid-disk 2 device /dev/sdf1 raid-disk 3 device /dev/sdg1 raid-disk 4 device /dev/sdh1                       Specify a spare disk. spare-disk 0

You can use multiple spare disks if you want to.

RAID devices can be used with the logical volume manager if desired.

10.3.3.9 FreeBSD

FreeBSD provides theVinum Volume Manager. It uses somewhat different concepts than other LVMs. Under Vinum, a drive is a physical disk partition. Disk space is allocated from drives in user-specified chunks known as subdisks. Subdisks in turn are used to define plexes, and one or more plexes makes up a Vinum volume. Multiple plexes within a volume constitute mirrors.

NOTE

figs/armadillo_tip.gif

Be prepared to be very patient when learning Vinum. It is quite inflexible in how it wants operations to be performed. Plan to learn the procedures on a safe test system.

In addition, be aware that the facility is still under development. As of this writing, only the most basic functionality is present.

To use a disk partition with Vinum, it must be prepared as follows:

Create one or more slices on it using fdisk.
Create an initial disk label using disklabel or sysinstall. I prefer the latter. If you choose to use sysinstall, create a single swap partition in each slice that you want to use with Vinum. Ignore the messages about it being unable to start the swap partition (you don't want it to anyway).
Modify the disk label using disklabel -e. Only the partition list at the bottom will need to be changed. It must look like this when you are done:
```
    #        size   offset    fstype   [fsize bsize bps/cpg]       c: 11741184        0    unused              # (Cyl.    0 - 11647)       e: 117411184       0    vinum
```
If you used sysinstall to create the initial disk label, all you have to do is add the final line.

Partition e is somewhat arbitrary, but it works. Note that partition c cannot be used with Vinum.

Once the drives are prepared, the best way to proceed is to create a description file that defines the Vinum entities that you want to create. Here is a file that defines a volume named big:

drive d1 device /dev/da1s1e            Define drives. drive d2 device /dev/da2s1e volume big                             Define volume big.   plex org concat                      Create a concatenated plex.     sd length 500m drive d1            First 500 MB subdisk from drive d1.     sd length 200m drive d2            Second 200 MB subdisk from drive d2.

The file first defines the drives to be used, naming them d1 and d2. Note that this operation needs to be performed only once for a given partition. Future example configurations will omit drive definitions.

The second section of the file defines the volume big as one concatenated plex (org concat). It consists of two subdisks: 500 MB of space from /dev/da1s1e and 200 MB of space from /dev/da2s1e. This disk space will be treated as a single unit.

You can create these entities using the following command:

# vinum create /etc/vinum.big.conf

The final argument specifies the location of the description file.

Once the volume is created, you can create a filesystem on it:

# newfs -v /dev/vinum/big

The device is specified via the file in /dev/vinum named for the volume. The -v option tells newfs not to look for partitions on the specified device. Once newfs completes, the filesystem may be mounted. For it to be detected properly at boot time, however, the following line must be present in /etc/rc.conf:

start_vinum="YES"

This causes the Vinum kernel module to be loaded on boots.

Here is a description file that defines a striped (RAID 1) volume:

volume fast   plex org striped 1024k     sd length 0 drive d1     sd length 0 drive d2

This stripe set consists of two components. The plex line has an additional entry, the stripe size. This value must be a multiple of 512 bytes. The subdisk definitions specify a length of 0; this corresponds to all available space in the device. The actual volume can be created using the vinum create command as before.

If both of these volumes were created, then different areas of the various disk partitions would be used by each one. Vinum drives can be subdivided among different volumes. You can specify the location with the drive when the subdisk is created (see the vinum(8) manual page for details).

The following configuration file creates amirrored volume by defining two plexes:

volume mirror   plex org concat                      First mirror.     sd length 1000m drive d1   plex org concat                      Second mirror.     sd length 1000m drive d2

Creating and activating the mirrored volume requires several vinum commands (the output is not shown):

# vinum create  file                     Create the volume. # vinum  init mirror.p1                  Initialize the subdisk. Wait for command to finish. # vinum start mirror.p1                 Activate the mirror.

When you first create a mirrored volume, the state of the second plex appears in status listings as faulty, and its component subdisk has a status of empty. The vinum init command initializes all of the component subdisks of plex mirror.p1, and the vinum start command regenerates the mirror (actually, creates it for the first time). Both of these commands start background processes to do the actual work, and you must wait for the initialization to finish before running the regeneration. You can check on their status using this command:

# vinum list

Once both of these commands have completed, you can build a filesystem and mount it.

The following description file created aRAID 5 volume named safe:

volume safe   plex org raid5 1024k     sd length 0 drive d1     sd length 0 drive d2     sd length 0 drive d3     sd length 0 drive d4     sd length 0 drive d5

This volume consists of a single plex containing five subdisks. The following commands can be used to create and activate the volume:

# vinum create  file                     Create the volume. # vinum init safe.p0                    Initialize the subdisks.

Once again, the initialization process runs in the background, and you must wait for it to finish before creating a filesystem.

As a final example, consider this description file:

volume zebra   plex org striped 1024k     sd length 200m drive d1     sd length 200m drive d2   plex org striped 1024k     sd length 200m drive d3     sd length 200m drive d4

This file defines a volume named zebra, which is a striped mirrored volume (RAID 0+1). The volume consists of two striped plexes which become mirrors. The following commands are required to create and activate this volume:

# vinum create  file                     Create the volume. # vinum init zebra.p0 zebra.p1          Initialize subdisks. # vinum start zebra.p1                  Regenerate the mirror.

The following commands are useful for displaying Vinum information:

vinum list: Display information about all Vinum entities.
vinum ld: List drives, including current free space.
vinum lv: List volumes.
vinum ls: List subdisks.
vinum ls -v: Display subdisk details, including the plex they are part of and their component drives.
vinum lp: List plexes.
vinum lp -v: Display plex details, including the volume they belong to.

You can follow any of these commands with the name of a specific item to limit the display to its characteristics.

Here is an example of the vinum list command:

4 drives: D d1                State: up       Device /dev/ad1s1e      Avail: 2799/2999 MB (93%) D d2                State: up       Device /dev/ad1s2e      Avail: 2799/2999 MB (93%) D d3                State: up       Device /dev/ad1s3e      Avail: 2799/2999 MB (93%) D d4                State: up       Device /dev/ad1s4e      Avail: 532/732 MB (72%) 1 volumes: V zebra             State: up       Plexes:       2 Size:        400 MB 2 plexes: P zebra.p0        S State: up       Subdisks:     2 Size:        400 MB P zebra.p1        S State: faulty   Subdisks:     2 Size:        400 MB 4 subdisks: S zebra.p0.s0       State: up       PO:        0  B Size:        200 MB S zebra.p0.s1       State: up       PO:     1024 kB Size:        200 MB S zebra.p1.s0       State: R 16%    PO:        0  B Size:        200 MB S zebra.p1.s1       State: R 16%    PO:     1024 kB Size:        200 MB

This display shows the zebra volume we defined earlier. The subdisk initialization has completed. At this moment, the regeneration operation is 16% complete.

10.3.4 Floppy Disks

On systems with floppy disk drives, Unix filesystems may also be created on floppy disks. (Before they can be used, floppy disks must, of course, be formatted.) But why bother? These days, it is usually much more convenient to use floppy disks in one of the following ways:

Mounted as a DOS-type filesystem whose files can then be accessed with standard utilities like cp and ls.
Using special utilities designed to read and write files to and from DOS disks (we'll look at specific examples in a minute).

10.3.4.1 Floppy disk special files

Floppy disks are accessed using the followingspecial files (the default refers to a 1.44 MB 3.5-inch diskette):

AIX	/dev/fd0
FreeBSD	/dev/fd0
HP-UX	/dev/dsk/c0t1d0 (Normal disk naming convention)
Linux	/dev/fd0
Solaris	/dev/diskette
Tru64	/dev/fd0

Floppy disk special files are only occasionally needed on Solaris systems, because these devices are managed by the media handling daemon (discussed later in this chapter).

10.3.4.2 Using DOS disks on Unix systems

Methods for accessing DOS disks vary widely from system to system. In this section, we'll look at formatting diskettes in DOS format and copying files to and from them on each system.

Under HP-UX, the following commands format a DOS floppy disk:

$ mediainit -v -i2 -f16 /dev/rdsk/c0t1d0  $ newfs -n /dev/rdsk/c0t1d0 ibm1440

The -n option on the newfs command prevents boot information from being written to the diskette.

HP-UX provides a number of utilities to access files on DOS diskettes: doscp, dosdf, doschmod, dosls, dosll, dosmkdir, dosrm, and dosrmdir. Here is an example using doscp:

$ doscp /dev/rdsk/c0d1s0:paper.txt paper.new

This command copies the file paper.txt from the diskette to the current HP-UX directory.

On Linux and FreeBSD systems, a similar process is used. These commands format a DOS floppy and write files to it:

Linux

FreeBSD

 # fdformat /dev/fd0  # mkfs -t msdos /dev/fd0 # mount /dev/fd0 /mnt  # cp prop2.txt /mnt # umount /mnt

 # fdformat /dev/fd0 # newfs_msdos /dev/fd0 # mount /dev/fd0 /mnt # cp prop2.txt /mnt # umount /mnt

The Mtools utilities are also available on Linux and FreeBSD systems (described in the next section).

AIX also provides several utilities for accessing DOS disks: dosformat, dosread, doswrite, dosdir, and dosdel. However, they provide only minimal functionality for example, there is no wildcard support so you'll be much happier and work more efficiently if you use the Mtools utilities.

On Solaris systems, diskettes are controlled by the volume management system and its vold daemon. This facility merges the diskette as transparently as possible within the normal Solaris filesystem.

These commands could be used to format a diskette and create a DOS filesystem on it:

$ volcheck  $ fdformat -d -b g04

The volcheck command tells the volume management system to look for new media in the devices that it controls. The fdformat command formats the diskette, giving it a label of g04.

The following commands illustrate the method for copying files to and from diskette:

$ volcheck  $ cp ~/proposals/prop2.txt /floppy/g96  $ cp /floppy/g96/drug888.dat ./data  $ eject

The diskette is mounted in a subdirectory of /floppy named for its label (or in /floppy/unnamed_floppy if it does not have a label). Configuration of vold is discussed later in this chapter.

Tru64 provides no support for DOS diskettes, so you'll need to use the Mtools utilities, to which we will now turn.

10.3.4.3 The Mtools utilities

The Mtools package is available for all the Unix versions we are considering. It is currently maintained by David Niemi and Alain Knaff (see http://mtools.linux.lu).

The package contains a series of utilities for accessing DOS diskettes and their files, modeled after their similarly named DOS counterparts:

mformat: Format a diskette in DOS format.
mlabel: Label a DOS diskette.
mcd: Change the current directory location on the diskette.
mdir: List the contents of a directory on a DOS diskette.
mtype: Display the contents of a DOS file.
mcopy: Copy files between a DOS diskette and Unix.
mdel: Delete file(s) on a DOS diskette.
mren: Rename a file on a DOS diskette.
mmd: Create a subdirectory on a DOS diskette.
mrd: Remove a subdirectory from a DOS diskette.
mattrib: Change DOS file attributes.

Here are some examples of using the Mtools utilities:

$ mdir  Volume in drive A is GIAO24  Directory for A:/ SILVERDAT79    1-29-95   9:36p  PROP43_1 TXT2304    1-29-95   9:33p  REFCARD  DOC73216    1-13-95   5:28p  3 File(s)     1381376 bytes free  $ mren prop43_1.txt prop43_1.old  $ mcopy a:refcard.doc .  Copying REFCARD.DOC  $ mcopy proposal.txt a:  Copying PROPOSAL.TXT  $ mmd data2  $ mcopy gold* a:data2  Copying GOLD.DAT  Copying GOLD1.DAT  $ mcopy "a:\data\*.dat" ./data  Copying NA.DAT  Copying HG.DAT  $ mdel silver.dat

As these examples illustrate, the Mtools utilities are designed to make accessing diskettes as painless as possible. For example, it generally assumes that files being referred to are on the floppy disk. The only time that you have to refer explicitly to the diskette via the a: construct is with the mcopy command, which makes sense because there is no other way to know which direction the copy is taking place. Note also that filenames on diskette are not case-sensitive.

10.3.4.4 Stupid DOS partition tricks

On PC-based Unix systems, hard-disk DOSpartitions can also be mounted within the Unix filesystem. This allows not only for copying files between Unix and the other operating systems, but also for handling the entire partition using Unix utilities. For example, suppose you decide to change the partitioning scheme on your boot disk, decreasing the size of the DOS partition (without affecting the Unix partitions). The following commands will let you do so without reinstalling DOS, Windows, or any installed software:

# mount -t msdos /dev/hdal /mnt        Linux is used as an example. # cd /mnt # tar -c -f /tmp/dos.tar * # unmount /mnt Mess with partitions and/or filesystems. # mount -t msdos /dev/hda1 /mnt # cd /mnt # tar -x -f /tmp/dos.tar # cd /; umount /mnt

You could restore only some of the files from the tar archive if that is what made sense. Many other operations along these lines are also possible: for example, moving the DOS partition from the first hard drive to the second one, copying a DOS partition between systems or across a network, and so on. There are, of course, other ways of accomplishing these same tasks, but this procedure is often much faster.

When the partition in question is the Windows boot partition, this procedure works very well with older and simpler Windows versions such as Windows 98 and Windows ME. For Windows NT and later, you may have to alter the Boot.Ini file to get the system to boot.

10.3.5 CD-ROM Devices

CD-ROM drives are also generally treated in a manner similar to disks. The following special files are used to access SCSI CD-ROM devices:

AIX	/dev/cd0
FreeBSD	/dev/cd0c or /dev/acd0c (SCSI or ATAPI)
Linux	/dev/cdrom
Solaris	/dev/dsk/c0tnd0s02 (Normal disk naming conventions)
HP-UX	/dev/dsk/cmtnd0 (Normal disk naming conventions)
Tru64	/dev/disk/cdrom0c

The following example commands all mount a CD on the various systems:

mount -o ro -v cdrfs /dev/cd0 /mnt         AIX mount -r -t cd9660 /dev/cd0c /mnt          FreeBSD mount -o ro -F cdfs /dev/dsk/c1t2d0 /mnt   HP-UX mount -r -t iso9660 /dev/sonycd_31a /mnt   Linux mount -o ro -t hsfs /dev/c0t2d0s0 /mnt     Solaris mount -r -t cdfs /dev/disk/cdrom0c /mnt    Tru64

Entries can also be added to the filesystem configuration file for CD-ROM filesystems.

10.3.5.1 CD-ROM drives under AIX

On AIX systems, if you add a CD-ROM drive to an existing system, you'll need to create a device for it in this manner:

# mkdev -c cdrom -r cdrom1 -s scsi -p scsi0 -w 5,0  cd0 available

This command adds a CD-ROM device using SCSI ID 5.

Individual CDs are usually mounted via predefined mount points. For example, the following commands create a generic CD-ROM filesystem to be mounted on /cdrom:

# mkdir /cdrom # crfs -v cdrfs -p ro -d cd0 -m /cdrom -A no

This filesystem will be mounted read-only and will not automatically be mounted when the system boots. A CD may now be mounted with the mount /cdrom command.

The lsfs command may be used to list all defined CD-ROM filesystems:

$ lsfs -v cdrfs  Name     Nodename  Mount Pt                   VFS   Size  Options  Auto  Acct /dev/cd0 --       /cdrom                      cdrfs --    ro       no    no

10.3.5.2 The Solaris media-handling daemon

Solaris has a similar media handling facility implemented by the vold daemon. It generally mounts CDs and diskettes in directory trees rooted at /cdrom and /floppy, respectively, creating a subdirectory named for the label on the current media (or unnamed_cdrom and unnamed_floppy for unlabeled ones).

There are two configuration files associated with the volume management facility. /etc/vold.conf specifies the devices that it controls and the filesystem types it supports:

# Volume Daemon Configuration file  # # Database to use (must be first)  db db_mem.so # Labels supported  label dos label_dos.so floppy  label cdrom label_cdrom.so cdrom  label sun label_sun.so floppy # Devices to use  use cdrom drive /dev/dsk/c0t6 dev_cdrom.so cdrom0  use floppy drive /dev/diskette dev_floppy.so floppy0 # Actions  insert /vol*/dev/diskette[0-9]/* user=root /usr/sbin/rmmount  insert /vol*/dev/dsk/* user=root /usr/sbin/rmmount  eject /vol*/dev/diskette[0-9]/* user=root /usr/sbin/rmmount  eject /vol*/dev/dsk/* user=root /usr/sbin/rmmount  notify /vol*/rdsk/* group=tty /usr/lib/vold/volmissing -c # List of file system types unsafe to eject  unsafe ufs hsfs pcfs

The section labeled Actions indicates commands to be run when various events occur media is inserted or removed, for example. The final section lists filesystem types that must be unmounted before being removed and hence will require the user to issue an eject command.

If you want to share mounted CDs via the network, you'll need to add an entry to /etc/rmmount.conf :

# Removable Media Mounter configuration file.  # # File system identification  ident hsfs ident_hsfs.so cdrom  ident ufs ident_ufs.so cdrom floppy  ident pcfs ident_pcfs.so floppy # Actions  action -premount floppy action_wabi.so.1  action cdrom action_filemgr.so  action floppy action_filemgr.so # File System Sharing  share cdrom*  share solaris_2.x* -o ro:phys

File-sharing entries are in the final section of this file. An entry is provided for sharing standard CD-ROM filesystems (mounted at /cdrom/cdrom*). The -o in the second entry in this section passes options to the share command, in this case limiting access. You can modify the provided entry for CD-ROMs if appropriate. Shared CD-ROM filesystems can be mounted by other systems using the mount command and entered into their /etc/vfstab files.

Tru64 also has a vold daemon. However, it is part of its Logical Storage Manager facility and thus performs a completely different function.

10.3.1 Defining Disk Partitions

Figure 10-3. Sample disk partitioning scheme

Table 10-4. Sample disk partitioning scheme

10.3.2 Adding Disks

Finding a Hardware/Software Balance

10.3.2.1 Preparing and connecting the disk

Table 10-5. SCSI versions

Figure 10-4. SCSI connectors

Figure 10-5. SCSI connector pinouts

10.3.2.2 Making special files

10.3.2.3 FreeBSD

10.3.2.4 Linux

Table 10-6. Linux partitioning utility subcommands

10.3.2.4.1 The Reiser filesystem

10.3.2.5 Solaris

10.3.2.6 AIX, HP-UX, and Tru64

10.3.2.7 Remaking an existing filesystem

10.3.3 Logical Volume Managers

10.3.3.1 Disks, volume groups, and logical volumes

Figure 10-6. Logical volume managers illustrated

Table 10-7. LVM terminology

10.3.3.2 Disk striping

10.3.3.3 Disk mirroring and RAID

Table 10-8. Commonly used RAID levels

Figure 10-7. The RAID 5 data distribution scheme

10.3.3.4 AIX

10.3.3.4.1 Replacing a failed disk

10.3.3.4.2 Getting information from the LVM

Table 10-9. AIX LVM informational commands

10.3.3.4.3 Disk striping and disk mirroring

10.3.3.5 HP-UX

10.3.3.5.1 Displaying LVM information

10.3.3.5.2 Disk striping and mirroring

10.3.3.6 Tru64

10.3.3.6.1 AdvFS

10.3.3.6.2 LSM

10.3.3.7 Solaris

10.3.3.8 Linux

10.3.3.9 FreeBSD

10.3.4 Floppy Disks

10.3.4.1 Floppy disk special files

10.3.4.2 Using DOS disks on Unix systems

10.3.4.3 The Mtools utilities

10.3.4.4 Stupid DOS partition tricks

10.3.5 CD-ROM Devices

10.3.5.1 CD-ROM drives under AIX

10.3.5.2 The Solaris media-handling daemon