Section 11.2. Basic File Operations | Linux Application Development (paperback) (2nd Edition)

11.2. Basic File Operations

As a large proportion of Linux's system calls manipulate files, we begin by showing you the functions that are most widely used. We discuss the more specialized functions later in this chapter. The functions used to read through directories are presented in Chapter 14 to help keep this chapter a bit more concise.

11.2.1. File Descriptors

When a process gains access to a file (usually called opening the file), the kernel returns a file descriptor that the process uses to perform subsequent operations on the file. File descriptors are small, positive integers, which serve as indices into an array of open files the kernel maintains for each process.

The first three file descriptors for a process (0, 1, and 2) have standard usages. The first, 0, is known as standard input (stdin) and is where programs should take their interactive input from. File descriptor 1 is called standard output (stdout), and most output from the program should be directed there. Errors should be sent to standard error (stderr), which is file descriptor 2. The standard C library follows these rules, so gets() and printf() use stdin and stdout, respectively, and these conventions allow shells to properly redirect a process's input and output.

The <unistd.h> header file provides the STDIN_FILENO, STDOUT_FILENO, and STDERR_FILENO macros, which evaluate to the stdin, stdout, and stderr file descriptors, respectively. Using these symbolic names can make code slightly more readable.

Many of the file operations that manipulate a file's inode are available in two forms. The first form takes a file name as an argument. The kernel uses that argument to look up the file's inode and performs the appropriate operation on the inode (this usually includes following symlinks). The second form takes a file descriptor as an argument and performs the operation on the inode it refers to. The two sets of system calls use similar names, with the system calls that expect a file descriptor argument prefixed with the letter f. For example, the chmod() system call changes the access permissions for the file referred to by the passed file name; fchmod() sets the access permissions for the file referred to by the specified file descriptor.

To make the rest of this discussion a bit less verbose, we present both versions of the system calls when they exist but discuss only the first version (which uses a file name).

11.2.2. Closing Files

One of the few operations that is the same for all types of files is closing the file. Here is how to close a file:

 #include <unistd.h> int close(int fd);

This is obviously a pretty basic operation. However, there is one important thing to remember about closing files it could fail. Some systems (most notably, networked file systems such as NFS) do not try to store the final piece of written data in the file system until the file is closed. If that storage operation fails (the remote host may have crashed), then the close() returns an error. If your application is writing data but does not use synchronous writes (see the discussion of O_SYNC in the next section), you should always check the results of file closures. If close() fails, the updated file is corrupted in some unpredictable fashion! Luckily, this happens extremely rarely.

11.2.3. Opening Files in the File System

Although Linux provides many types of files, regular files are by far the most commonly used. Programs, configuration files, and data files all fall under that heading, and most applications do not (explicitly) use any other file type. There are two ways of opening files that have associated file names:

 #include <fcntl.h> #include <unistd.h> int open(char * pathname, int flags, mode_t mode); int creat(char * pathname, mode_t mode);

The open() function returns a file descriptor that references the file pathname. If the return value is less than 0, an error has occurred (as always, errno contains the error code). The flags argument describes the type of access the calling process wants, and also controls various attributes of how the file is opened and manipulated. An access mode must always be provided and is one of O_RDONLY, O_RDWR, and O_WRONLY, which request read-only, read-write, and write-only access, respectively. One or more of the following values may be bitwise OR'ed with the access mode to control other file semantics.

`O_CREAT`	If the file does not already exist, create it as a regular file.
`O_EXCL`	This flag should be used only with `O_CREAT`. When it is specified, `open()` fails if the file already exists. This flag allows a simple locking implementation, but it is unreliable across networked file systems like NFS.^[9]
`O_NOCTTY`	The file being opened does not become the process's controlling terminal (see page 136 for more information on controlling terminals). This flag matters only when a process without any controlling terminal is opening a tty device. If it is specified any other time, it is ignored.
`O_TRUNC`	If the file already exists, the contents are discarded and the file size is set to 0.
`O_APPEND`	All writes to the file occur at the end of the file, although random access reads are still permitted.
`O_NONBLOCK`^[10]	The file is opened in nonblocking mode. Operations on normal files always block because they are stored on local hard disks with predictable response times, but operations on certain file types have unpredictable completion times. For example, reading from a pipe that does not have any data in it blocks the reading process until data becomes available. If `O_NONBLOCK` is specified, the read returns zero bytes rather than block. Files that may take an indeterminate amount of time to perform an operation are called slow files.
`O_SYNC`	Normally, the kernel caches writes and records them to the hardware when it is convenient to do so. Although this implementation greatly increases performance, it is more likely to allow data loss than is immediately writing the data to disk. If `O_SYNC` is specified when a file is opened, all changes to the file are stored on the disk before the kernel returns control to the writing process. This is very important for some applications, such as database systems, in which write ordering is used to prevent data corruption in case of a system failure.

^[9] For more information on file locking, see Chapter 13.

^[10] O_NDELAY is the original name for O_NONBLOCK, but it is now obsolete.

The mode parameter specifies the access permissions for the file if it is being created, and it is modified by the process's current umask. If O_CREAT is not specified, the mode is ignored.

The creat() function is exactly equivalent to

 open(pathname, O_CREAT | O_WRONLY | O_TRUNC, mode)

We do not use creat() in this book because we find open() easier to read and to understand.^[11]

^[11] creat() is misspelled, anyway.

11.2.4. Reading, Writing, and Moving Around

Although there are a few ways to read from and write to files, only the simplest is discussed here.^[12] Reading and writing are nearly identical, so we discuss them simultaneously.

^[12] readv(), writev(), and mmap() are discussed in Chapter 13; sendmsg() and recvmsg() are mentioned in Chapter 17.

 #include <unistd.h> size_t read(int fd, void * buf, size_t length); size_t write(int fd, const void * buf, size_t length);

Both functions take a file descriptor fd, a pointer to a buffer buf, and the length of that buffer. read() reads from the file descriptor and places the data read into the passed buffer; write() writes length bytes from the buffer to the file. Both functions return the number of bytes transferred, or -1 on an error (which implies no bytes were read or stored).

Now that we have covered these system calls, here is a simple example that creates the file hw in the current directory and writes Hello World! into it:

  1: /* hwwrite.c */  2:  3: #include <errno.h>  4: #include <fcntl.h>  5: #include <stdio.h>  6: #include <stdlib.h>  7: #include <unistd.h>  8:  9: int main(void) { 10:     int fd; 11: 12:     /* open the file, creating it if it's not there, and removing 13:        its contents if it is there */ 14:     if ((fd = open("hw", O_TRUNC | O_CREAT | O_WRONLY, 0644)) < 0) { 15:         perror("open"), 16:         exit(1); 17:     } 18: 19:     /* the magic number of 13 is the number of characters which will 20:        be written */ 21:     if (write(fd, "Hello World!\n", 13) != 13) { 22:         perror("write"); 23:         exit(1); 24:     } 25: 26:     close(fd); 27: 28:     return 0; 29: }

Here is what happens when we run hwwrite:

 $ cat hw cat: hw: No such file or directory $ ./hwwrite $ cat hw Hello World! $

Changing this function to read from a file is a simple matter of changing the open() to

 open("hw", O_RDONLY);

and changing the write() of a static string to a read() into a buffer.

Unix files can be divided into two catgories: seekable and nonseekable.^[13] Nonseekable files are first-in/first-out channels that do not support random reads or writes, and data cannot be reread or overwritten. Seekable files allow the reads and the writes to occur anywhere in the file. Pipes and sockets are nonseekable files; block devices and regular files are seekable.

^[13] Although this division is almost clean, TCP sockets support out-of-band data, which makes it a bit dirtier. Out-of-band data is outside the scope of this book; [Stevens, 2004] provides a complete description.

As FIFOs are nonseekable files, it is obvious where read() reads from (the beginning of the file) and write() writes to (the end of the file). Seekable files, on the other hand, have no obvious place for the operations to occur. Instead, both happen at the "current" location in the file and advance the current location after the operation. When a seekable file is initially opened, the current location is at the beginning of the file, or offset 0. If 10 bytes are read, the current position is then at offset 10, and a write of 5 more bytes overwrites the data, starting with the eleventh byte in the file (which is at offset 10, where the current position was). After such a write, the current position becomes offset 15, immediately after the overwritten data.

If the current position is the end of the file and the process tries to read from the file, read() returns 0 rather than an error. If more data is written at the end of the file, the file grows just enough to accommodate the extra data and the current position becomes the new end of the file. Each file descriptor keeps track of an independent current position^[14] (it is not kept in the file's inode), so if a file is opened multiple times by multiple processes (or by the same process, for that matter), reads and writes through one of the file descriptors do not affect the location of reads and writes made through the other file descriptor. Of course, the multiple writes could corrupt the file in other ways, so some sort of locking may be needed in these situations.

^[14] Almost independent; see the discussion of dup() on page 196 for the exceptions to this.

Files opened with O_APPEND have a slightly different behavior. For such files, the current position is moved to the end of the file before the kernel writes any data. After the write, the current position is moved to the end of the newly written data, as normal. For append-only files, this guarantees that the file's current position is always at the end of the file immediately following a write().

Applications that want to read and write data from random locations in the file need to set the current position before reading and writing data, using lseek():

 #include <unistd.h> int lseek(int fd, off_t offset, int whence);

The current position for file fd is moved to offset bytes relative to whence, where whence is one of the following:

`SEEK_SET`^[15]	The beginning of the file
`SEEK_CUR`	The current position in the file
`SEEK_END`	The end of the file

^[15] As most systems define SEEK_SET as 0, it is common to see lseek(fd, offset, 0) used instead of lseek(fd, offset, SEEK_SET). This is not as portable (or readable) as SEEK_SET,but it is fairly common in old code.

For both SEEK_CUR and SEEK_END the offset may be negative. In this case, the current position is moved toward the beginning of the file (from whence) rather than toward the end of the file. For example, the following code moves the current position to five bytes from the end of the file:

 lseek(fd, -5, SEEK_END);

The lseek() system call returns the new current position in the file relative to the beginning of the file, or -1 if an error occurred. Thus, lseek(fd, 0, SEEK_END) is a simple way of finding out how large a file is, but make sure you reset the current position before trying to read from fd.

Although the current position is not disturbed by other processes that access the file at the same time,^[16] that does not mean multiple processes can safely write to a file simultaneously. Imagine the following sequence:

^[16] Well, not usually, anyway. If processes share file descriptors (meaning file descriptors that arose from a single open() call), those processes share the same file structure and the same current position. The most common way for this to happen is for files after a fork(), as discussed on page 197. The other way this can happen is if a file descriptor is passed to another process through a Unix domain socket, which is described on pages 424-425.

Process A	Process B
`lseek(fd, 0, SEEK_END);`
	`lseek(fd, 0, SEEK_END);`
	`write(fd, buf, 10);`
`write(fd, buf, 5);`

In this case process A would have overwritten the first five bytes of process B's data, which is probably not what was intended. If multiple processes need to write to append to a file simultaneously, the O_APPEND flag should be used, which makes the operation atomic.

Under most POSIX systems, processes are allowed to move the current position past the end of the file. The file is grown to the appropriate size, and the current position becomes the new end of the file. The only catch is that most systems do not actually allocate any disk space for the portion of the file that was never written to; they change only the logical size of the file.

Portions of files that are "created" in this manner are known as holes. Reading from a hole in a file returns a buffer full of zeros, and writing to them could fail with an out-of-disk-space error. All of this means that lseek() should not be used to reserve disk space for later use because that space may not be allocated. If your application needs to allocate some disk space for later use, you must use write(). Files with holes in them are often used for files that have data sparsely spaced throughout them, such as files that represent hash tables.

For a simple, shell-based demonstration of file holes, look at the following example (note that /dev/zero is a character device that returns as many zeros as a process tries to read from it).

 $ dd if=/dev/zero of=foo bs=1k count=10 10+0 records in 10+0 records out $ ls -l foo -rw-rw-r--   1 ewt      ewt       10240 Feb  6 21:50 foo $ du foo 10 foo $ dd if=/dev/zero of=bar bs=1k count=1 seek=9 1+0 records in 1+0 records out $ ls -l bar -rw-rw-r--   1 ewt      ewt       10240 Feb  6 21:50 bar $ du bar 1       bar $

Although both foo and bar are 10K in size, bar uses only 1K of disk space because the other 9K were seek() ed over when the file was created instead of written.

11.2.5. Partial Reads and Writes

Although both read() and write() take a parameter that specifies how many bytes to read or write, neither one is guaranteed to process the requested number of bytes, even if no error has occurred. The simplest example of this is trying to read from a regular file that is already positioned at the end of the file. The system cannot actually read any bytes, but it is not exactly an error condition either. Instead, the read() call returns 0 bytes. In the same vein, if the current position was 10 bytes from the end of the file and an attempt was made to read more than 10 bytes from the file, 10 bytes would be read and the read() call would return the value 10. Again, this is not considered an error condition.

The behavior of read() also depends on whether the file was opened with O_NONBLOCK. On many file types, O_NONBLOCK does not make any difference at all. Files for which the system can guarantee an operation's completion in a reasonable amount of time always block on reads and writes; they are sometimes referred to as fast files. This set of files includes local block devices and regular files. For other file types, such as pipes and such character devices as terminals, the process could be waiting for another process (or a human being) to either provide something for the process to read or free resources for the system to use when processing the write() request. In either case, the system has no way of knowing whether it will ever be able to complete the system call. When these files are opened with O_NONBLOCK, for each operation on the file, the system simply does as much as it is able to do immediately, and then returns to the calling process.

Nonblocking I/O is an important topic, and more examples of it are presented in Chapter 13. With the standardization of the poll() system call, however, the need for it (especially for reading) has diminished. If you find yourself using nonblocking I/O extensively, try to rethink your program in terms of poll() to see if you can make it more efficient.

To show a concrete example of reading and writing files, here is a simple reimplementation of cat. It copies stdin to stdout until there is no more input to copy.

  1: /* cat.c */  2:  3: #include <stdio.h>  4: #include <unistd.h>  5:  6: /* While there is data on standard in (fd 0), copy it to standard  7:    out (fd 1). Exit once no more data is available. */  8:  9: int main(void) { 10:     char buf[1024]; 11:     int len; 12: 13:     /* len will be >= 0 while data is available, and read() is 14:        successful */ 15:     while ((len = read(STDIN_FILENO, buf, sizeof(buf))) > 0) { 16:         if (write(1, buf, len) != len) { 17:             perror("write"); 18:             return 1; 19:         } 20:     } 21: 22:     /* len was <= 0; If len = 0, no more data is available. 23:        Otherwise, an error occurred. */ 24:     if (len < 0) { 25:         perror("read"); 26:         return 1; 27:     } 28: 29:     return 0; 30: }

11.2.6. Shortening Files

Although regular files automatically grow when data is written to the end of them, there is no way for the system to automatically shrink files when the data at their end is no longer needed. After all, how would the system know when data becomes extraneous? It is a process's responsibility to notify the system when a file may be truncated at a certain point.

 #include <unistd.h> int truncate(const char * pathname, size_t length); int ftruncate(int fd, size_t length);

The file's size is set to length, and any data in the file past the new end of the file is lost. If length is larger than the current size of the file, the file is actually grown to the indicated length (using holes if possible), although this behavior is not guaranteed by POSIX and should not be relied on in portable programs.

11.2.7. Synchronizing Files

When a program writes data to the file, the data is normally stored in a kernel cache until it gets written to the physical medium (such as a hard drive), but the kernel returns control to that program as soon as the data is copied into the cache. This provides major performance improvements as it allows the kernel to order writes on the disk and to group multiple writes into a single block operation. In the event of a system failure, however, it has a few drawbacks that could be important. For example, an application that assumes data is stored in a database before the index entry for that data is stored might not handle a failure that results in just the index's getting updated.

There are a few mechanisms applications can use to wait for data to get written to the physical medium. The O_SYNC flag, discussed on page 168, causes all writes to the file to block the calling process until the medium has been updated. While this certainly works, it is not a very neat approach. Normally, applications do not need to have every operation synchronized, more often they need to make sure a set of operations has completed before beginning another set. The fsync() and fdatasync() system calls provide this semantic:

 #include <unistd.h> int fsync(int fd); int fdatasync(int fd);

Both system calls suspend the application until all of the data for the file fd has been written. The fsync() also waits for the file's inode information, such as the access time, to get updated.^[17] Neither of these system calls can ensure that the data gets written to nonvolatile storage, however. Modern disk drives have large caches, and a power failure could cause some data stored in those caches to get lost.

^[17] The inode information for files is listed in Table 11.3.

11.2.8. Other Operations

Linux's file model does a good job of standardizing most file operations through generic functions such as read() and write() (for example, writing to a pipe is the same as writing to a file on disk). However, some devices have operations that are poorly modeled by this abstraction. For example, terminal devices, represented as character devices, need to provide a method to change the speed of the terminal, and a CD-ROM drive, represented by a block device, needs to know when it should play an audio track to help increase a programmer's productivity.

All of these miscellaneous operations are accessed through a single system call, ioctl() (short for I/O control), which is prototyped like this:

 #include <sys/ioctl.h> int ioctl(int fd, int request, ...);

although it is almost always used like this:

 int ioctl(int fd, int request, void * arg);

Whenever ioctl() is used, its first argument is the file being manipulated and the second argument specifies what operation is being requested. The final argument is usually a pointer to something, but what that something is, as well as the exact semantics of the return code, depends on what type of file fd refers to and what type of operation was requested. For some operations, arg is a long value instead of a pointer; in these instances, a typecast is normally used. There are many examples of ioctl() in this book, and you do not need to worry about using ioctl() until you come across them.