Chapter 11. Simple File Handling | Linux Application Development (paperback) (2nd Edition)

Files are the most ubiquitous resource abstraction used in the Unix world. Resources such as memory, disk space, devices, and interprocess communication (IPC) channels are represented as files. By providing a uniform abstraction of these resources, Unix reduces the number of software interfaces programmers must master. The resources accessed through file operations are as follows:

regular files	The kind of files most computer users think of. They serve as data repositories that can grow arbitrarily large and allow random access. Unix files are byte-oriented any other logical boundaries are purely application conventions; the kernel knows nothing about them.
pipes	Unix's simplest IPC mechanism. Usually, one process writes information into the pipe while another reads from it. Pipes are what shells use to provide I/O redirection (for example, `ls -lR \| grep notes` or `ls \| more`), and many programs use pipes to feed input to programs that run as their subprocesses. There are two types of pipes: unnamed and named. Unnamed pipes are created as they are needed and disappear once both the read and the write ends of the pipe are closed. Unnamed pipes are so called because they do not exist in the file system and, therefore, have no file name.^[1] Named pipes do have file names, and the file name is used to allow two independent processes to communicate through the pipe (similar to the way Unix domain sockets work^[2]). Pipes are also known as FIFOs because the data is ordered in a first-in/first-out manner.
directories	Special files that consist of a list of files they contain. Old Unix implementations allowed programs to read and write them in exactly the same manner as regular files. To allow better abstraction, a special set of system calls was added to provide directory manipulation, although the directories are still opened and closed like regular files. Those functions are presented in Chapter 14.
device files	Most physical devices are represented as files. There are two types of device files: block devices and character devices. Block device files represent hardware devices^[3] that cannot be read from a byte at a time; they must be read from in multiples of some block size. Under Linux, block devices receive special handling from the kernel^[4] and can contain file systems.^[5] Disk drives, including CDROM drives and RAM disks, are the most common block devices. Character devices can be read from a single character at a time, and the kernel provides no caching or ordering facilities for them. Modems, terminals, printers, sound cards, and mice are all character devices. Traditionally, special directory entries kept in the /dev directory allow user-space processes to access device resources as files.
symbolic links	A special kind of file that contains the path to another file. When a symbolic link (symlink) is opened, the system recognizes it as a symlink, reads its value, and opens the file it references instead of the symlink itself. When the value stored in the symbolic link is used, the system is said to be following the symlink. Unless otherwise noted, system calls are assumed to follow symlinks that are passed to them.
sockets	Like pipes, sockets provide an IPC channel. They are more flexible than pipes, and can create IPC channels between processes running on different machines. Sockets are discussed in Chapter 17.

^[1] Under Linux, the /proc file system includes information on every file currently open on the system. Although this means that unnamed pipes can be found in a file system, they still do not have permanent file names, because they disappear when the processes using the pipe end.

^[2] See Chapter 17 for more information on Unix domain sockets.

^[3] Not all block devices represent actual hardware. A better description of a block device is an entity on which a file system can reside; Linux's loopback block device maps a regular file to a logical block device that allows that file to contain a complete file system.

^[4] Most notably, they are cached and access to them is ordered.

^[5] This is different from some systems that are capable of mounting file systems on character devices, as well as block devices.

In many operating systems, there is a one-to-one correspondence between files and file names. Every file has a file name in a file system, and every file name maps to a single file. Unix divorces the two concepts, allowing for more flexibility.

The only unique identity a file has is its inode (an abbreviation of information node). A file's inode contains all the information about a file, including the access permissions associated with it, its current size, and how many file names it has (which could be zero, one, twenty, or more). There are two types of inodes. The in-core inode is the only type we normally care about; every open file on the system has one. The kernel keeps in-core inodes in memory, and they are the same for all file-system types. The other type of inodes are on-disk inodes. Every file on a file system has an on-disk inode, and their exact structure depends on the type of file system the file is stored on. When a process opens a file on a file system, the on-disk inode is loaded into memory and converted into an in-core inode. When the in-core inode has been modified, it is transformed back into an on-disk inode and stored in the file system.^[6]

^[6] Linux has always used the term inode for both types, while other Unix variants have reserved inode for on-disk inodes and call in-core inodes vnodes. While using the vnode terminology is less confusing, we choose to use inode for both types to keep consistent with Linux standards.

On-disk and in-core inodes do not contain exactly the same information. Only the in-core inode, for example, keeps track of how many processes on the system are currently using the file associated with the inode.

As on-disk and in-core inodes are synchronized by the kernel, most system calls end up updating both inodes. When this is the case, we just refer to updating the inode; it is implied that both the in-core and on-disk are affected. Some files (such as unnamed pipes) do not have any on-disk inode. In these cases, only the in-core inode is updated.

A file name exists only in a directory that relates that file name to the on-disk inode. You can think of the file name as a pointer to the on-disk inode for the file associated with it. The on-disk inode contains the number of file names that refer to that inode, called the link count. When a file is removed, the link count is decremented and if the link count is 0 and no processes already have the file open, the space is freed. If other processes have the file open, the disk space is freed when the final process has closed the file.

All of this means that it is possible to

Have multiple processes access a file that has never existed in a file system (such as a pipe)
Create a file on the disk, remove its directory entry, and continue to read and write from the file
Change /tmp/foo and see the changes immediately in /tmp/bar, if both file names refer to the same inode

Unix has always worked this way, but these operations can be disconcerting to new users and programmers. As long as you keep in mind that a file name is merely a pointer to a file's on-disk inode, and that the inode is the real resource, you should be fine.