The externfs Filesystem | User Mode Linux

The `externfs` Filesystem

humfs is a special case of a more general filesystem called externfs. The purpose of externfs is to allow any reasonable external data to be imported as a UML filesystem. externfs doesn't import anything by itselfit simply makes it easy to import external data by implementing an interface, defined by externfs, to the Linux filesystem layer. externfs provides the glue between that interface and the Linux kernel VFS interface, allowing the data to appear to be a Linux filesystem.

This will allow you to mount this data as a UML filesystem and use standard utilities and scripts to examine and manipulate it. The filesystem interface hides the specialized interface normally used to access the data. By providing a common way to access the information, data sources that are normally disjointed and isolated from each other can be made to interoperate. Data can be copied from one database to a completely different database merely by copying files.

The sqlfs example in Chapter 6 as a possible humfs metadata format demonstrates this by allowing you to examine and change a database using normal Linux utilities rather than a SQL monitor. Of course, the SQL interface is still there, but it has been hidden under the Linux filesystem interface by the UML filesystem that imported it.

Essentially any structured data anywhere can be represented somehow as files and directories, and a plugin for externfs that maps the structure onto files and directories will import that data as a UML filesystem.

This is a large universe of possibilities, but which of them will actually prove to be useful? Representing data this way would be useful for any database whose contents are not readily accessible as text. Having the database available as a set of directories and files allows you to use standard utilities such as find and grep on it. It would not be so useful for any database that already uses text, such as any of the ones in /etc (e.g., the password and group files). These can already be easily analyzed and searched with the standard text utilities.

A package database might be a good candidate for this sort of treatment. rpm and dpkg have their own syntaxes for querying their databases. However, having the host's package database, including installed and available packages and the information associated with them, as a set of text files would make it unnecessary to use those syntaxes. Instead, you would use ls, cat, and find to tell you what you need to know.

For example, in order to figure out which package owns a particular file, such as /etc/passwd, you would do something like this:

UML% find /host-packages -name passwd /host-packages/installed/setup-2.5.46-1/files/etc/passwd

The output tells you that /etc/passwd is a part of the setup-2.5.46-1 package. Similarly, you could find the package's description like this:

UML% cat /host-packages/installed/setup-2.5.46-1/description The setup package contains a set of important system configuration and setup files, such as passwd, group, and profile.

There's no reason that the package database filesystem would be limited to importing the host's package database. The package databases of other hosts on the network could also be imported into the UML using a network-aware version of this filesystem. Mounting another host's package database would involve communicating with a daemon on the remote side. So, via this daemon, you could have a set of filesystems such as /packages/my-host, /packages/bob-host, /packages/ jane-host, and /packages/web-server.

Having the package information for all the hosts on the network in one place would turn the UML into a sort of control center for the network in this regard. Then you could perform some useful operations.

Compare the configurations of different machines:

UML% ls -1 /packages/my-host/installed > /tmp/x UML% ls -1 /packages/bob-host/installed > /tmp/y UML% diff /tmp/x /tmp/y

Ensure that all machines on the network have the same versions of their packages installed by comparing the version files of the package subdirectories in the host package filesystems.
Install and delete packages:

UML% rm -rf /packages/my-host/installed/bc-1.06-18 UML% mv firefox-1.0.4-5.i386.rpm /packages/my-host/installed

These two operations would translate into a package removal and a package installation on the host. In the installation example, the firefox RPM file would be copied out to the host and installed. Then a firefox subdirectory would appear in the /packages/ my-host/installed directory.

If you wanted to enforce a policy that all configuration changes to any machine on the network would have to be done from this UML control console, the daemon on each host would maintain a lock on the package database. This would prevent any changes from happening locally. Since these daemons would be controlled from the UML instance, configuration changes to any of the hosts could be done only from the UML instance through this filesystem.

If a number of machines needed to have the same configurations, you could also have them all mounted in the same place in the UML control console. Operations within this filesystem would be multiplexed to all of the hosts. So, installing a new package through this filesystem would result in the package being copied to all of the hosts and installed on all of them. Similarly, removing a package would result in it being removed from all the hosts.

You can consider using a UML as a similar control console for any other system administration database. Using it to manage the host's password or group files is probably not practical, as I mentioned earlier. However, it may be useful to manage the password or group files for a network, if you're not using an existing distributed mechanism, such as NIS, for them.

You could take this control console idea further and use an externfs plugin to front a number of databases on the network, not just one. For example, consider a large organization with several levels of management and an externfs-based filesystem that allows a mirror of this organization to be built in it. So, every manager would be represented by a directory that contains a directory for each person who reports directly to that manager. If some of these reporting people were also managers, there would be another level of directories further down. Hiring a new person would involve creating a directory underneath the hiring manager. The filesystem would see this directory creation and perform the necessary system administration tasks, such as:

Creating login and mail accounts
Adding the new person to the appropriate groups and mailing lists
Updating online organization charts and performing other organization-specific tasks

Similarly, removing a person's directory would result in the reversal of all of these tasks.

Performing these tasks would not need to be done by hand, nor would it require a specialized application to manage the whole process. It would be done by changing files and directories in this special filesystem and tying those changes to the necessary actions on the network. I'm not suggesting that someone would be literally running the mkdir and rmdir utilities in a shell whenever someone is hired or leaves, although that would work. There would likely be a graphical interface for doing this, and it would likely be customized for this task, to simplify the input of the required information. However, putting it in a filesystem makes this information available in a standardized way at a low enough level that any sort of application, from a shell script to a customized graphical interface, can be written to manipulate it.

If the filesystem contains sensitive data, such as pay rates or home addresses, Linux file permissions can help prevent unauthorized people from seeing that information. Each piece of data about an employee could potentially be in its own file, with user and group ownership and permissions that restrict access to people who are allowed to view the information.

This example seems to fit a filesystem particularly well. No doubt there are others. UML's externfs allows this sort of information to be plugged into a UML as a filesystem, where it can be viewed and manipulated by any tools that know how to deal with files and directories.

This scenario is not as far out in left field as it may appear. Practically every Linux system in the world is doing something similar by providing a unified interface to a number of disparate databases. A typical Linux system contains the following:

At least one, and often more, disk-based filesystems such as ext2, ext3, reiserfs, or xfs
A number of virtual, kernel-based filesystems such as procfs, sysfs, and devpts
Usually at least one CD or DVD filesystem
Often some devices such as MP3 players or cameras that represent themselves as storage devices with FAT or HFS filesystems

You can think of all of these as being different kinds of databases to which the Linux VFS layer is providing a uniform interface. This lets you transparently move data between these different databases (as with ls -l /proc > /tmp/processes copying data from the kernel to /tmp ) and transparently search them. You don't need to be concerned about the underlying representation of the data, which differs greatly from filesystem to filesystem.

What I described above is close to the same thing, except that my example uses the Linux VFS interface to provide the same sort of access to a different class of databases: personnel databases, corporate phone books, and so on. In principle, these are no different from the on-disk databases your files are stored in. I'd like to see access to these be as transparent and unified as access to your disks, devices, and internal kernel information is now.

externfs provides the framework for making this access possible. Each different kind of database that needs to be imported into a UML instance would need an externfs plugin that knows how to access it. With that written, the database can be imported as a Linux filesystem. At that point, the files and directories can be rearranged as necessary with Linux bind mounts. In the example above, the overall directory hierarchy can be imported from the corporate personnel database. Information like phone numbers and office locations may be in another database. Those files can be bind-mounted into the employee hierarchy, so that when you look at the directory for an employee, all of that person's information is present there, even though it's coming from a number of different databases.

The infrastructure to provide a transparent, unified interface to these different databases already exists. The one thing lacking is the modules needed to turn them into filesystems.