Section 13.3. File Locking | Linux Application Development (paperback) (2nd Edition)

13.3. File Locking

Although it is common for multiple processes to access a single file, doing so must be done carefully. Many files include complex data structures, and updating those data structures creates the same race conditions involved in signal handlers and shared memory regions.

There are two types of file locking. The most common, advisory locking, is not enforced by the kernel. It is purely a convention that all processes that access the file must follow. The other type, mandatory locking, is enforced by the kernel. When a process locks a file for writing, other processes that attempt to read or write to the file are suspended until the lock is released. Although this may seem like the more obvious method, mandatory locks force the kernel to check for locks on every read() and write(), substantially decreasing the performance of those system calls.

Linux provides two methods of locking files: lock files and record locking.

13.3.1. Lock Files

Lock files are the simplest method of file locking. Each data file that needs locking is associated with a lock file. When that lock file exists, the data file is considered locked and other processes do not access it. When the lock file does not exist, a process creates the lock file and then accesses the file. As long as the procedure for creating the lock file is atomic, ensuring that only one process at a time can "own" the lock file, this method guarantees that only one process accesses the file at a time.

This idea is pretty simple. When a process wants to access a file, it locks the file as follows:

 fd = open("somefile.lck", O_RDONLY, 0644); if (fd >= 0) {     close(fd);     printf("the file is already locked");     return 1; } else {     /* the lock file does not exist, we can lock it and access it */     fd = open("somefile.lck", O_CREAT | O_WRONLY, 0644");     if (fd < 0) {         perror("error creating lock file");         return 1;     }     /* we could write our pid to the file */     close(fd); }

When the process is done with the file, it calls unlink("somefile.lck"); to release the lock.

Although the above code segment may look correct, it would allow multiple processes to lock the same file under some circumstances, which is exactly what locking is supposed to avoid. If a process checks for the lock file's existence, sees that the lock does not exist, and is then interrupted by the kernel to let other processes run, another process could lock the file before the original process creates the lock file! The O_EXCL flag to open() is available to make lock file creation atomic, and hence immune to race conditions. When O_EXCL is specified, open() fails if the file already exists. This simplifies the creation of lock files, which is properly implemented as follows.

 fd = open("somefile.lck", O_WRONLY | O_CREAT | O_EXCL, 0644); if (fd < 0 && errno == EEXIST) {     printf("the file is already locked");     return 1; } else if (fd < 0) {     perror("unexpected error checking lock");     return 1; } /* we could write our pid to the file */ close(fd);

Lock files are used to lock a wide variety of standard Linux files, including serial ports and the /etc/passwd file. Although they work well for many applications, they do suffer from a number of serious drawbacks.

Only one process may have the lock at a time, preventing multiple processes from reading a file simultaneously. If the file is updated atomically,^[17] processes that read the file can ignore the locking issue, but atomic updates are difficult to guarantee for complex file structures.
^[17] /etc/passwd is updated only by processes that create a new copy of the file with the modifications and then replace the original through a rename() system call. As this sequence provides an atomic update, processes may read from /etc/passwd at any time.
The O_EXCL flag is reliable only on local file systems. None of the network file systems supported by Linux preserve O_EXCL semantics between multiple machines that are locking a common file.^[18]
^[18] The Andrew Filesystem (AFS), which is available for Linux but not included in the standard kernel, does support O_EXCL across a network.
The locking is only advisory; processes can update the file despite the existence of a lock.
If the process that holds the lock terminates abnormally, the lock file remains. If the pid of the locking process is stored in the lock file, other processes can check for the existence of the locking process and remove the lock if it has terminated. This is, however, a complex procedure that is no help if the pid is being reused by another process when the check is made.

13.3.2. Record Locking

To overcome the problems inherent with lock files, record locking was added to both System V and BSD 4.3 through the lockf() and flock() system calls, respectively. POSIX defined a third mechanism for record locking that uses the fcntl() system call. Although Linux supports all three interfaces, we discuss only the POSIX interface, as it is now supported by nearly all Unix platforms. The lockf() function is implemented as an interface to fcntl(), however, so the rest of this discussion applies to both techniques.

There are two important distinctions between record locks and lock files. First of all, record locks lock an arbitrary portion of the file. For example, process A may lock bytes 50-200 of a file while another process locks bytes 2,500-3,000, without having the two locks conflict. Fine-grained locking is useful when multiple processes need to update a single file. The other advantage of record locking is that the locks are held by the kernel rather than the file system. When a process is terminated, all the locks it holds are released.

Like lock files, POSIX locks are also advisory. Linux, like System V, provides a mandatory variant of record locking that may be used but is not as portable. File locking may or may not work across networked file systems. Under recent versions of Linux, file locking works across NFS as long as all of the machines participating in the locks are running the NFS locking daemon lockd.

Record locking provides two types of locks: read locks and write locks. Read locks are also known as shared locks, because multiple processes may simultaneously hold read locks over a single region. It is always safe for multiple processes to read a data structure that is not being updated. When a process needs to write to a file, it must get a write lock (or exclusive lock). Only one process may hold a write lock for a record, and no read locks may exist for the record while the write lock is in place. This ensures that a process does not interfere with readers while it is writing to a region.

Multiple locks from a single process never conflict.^[19] If a process has a read lock on bytes 200-250 and tries to get a write lock on the region 200-225, it will succeed. The original lock is moved and becomes a read lock on bytes 226-250, and the new write lock from 200-225 is granted.^[20] This rule prevents a process from forcing itself into a deadlock (although multiple processes can still deadlock).

^[19] This situation is more complicated for threads. Many Linux kernels and libraries treat threads as different processes, which raises the potential of file lock conflicts between threads (which is incompatible with the standard POSIX threads model).

^[20] This lock manipulation happens atomically there is no point at which any part of the region is unlocked.

Linux is moving toward a more conventional thread model which shares file locks between all of the threads of a single process, but threaded programs should use the POSIX thread locking mechanisms instead of relying on either behavior for file locks.

POSIX record locking is done through the fcntl() system call. Recall from Chapter 11 that fcntl() looks like this:

 #include <fcntl.h> int fcntl(int fd, int command, long arg);

For all of the locking operations, the third parameter (arg) is a pointer to a struct flock.

 #include <fcntl.h> struct flock {     short l_type;     short l_whence;     off_t l_start;     off_t l_len;     pid_t l_pid; };

The first element, l_type, tells what type of lock is being set. It is one of the following:

`F_RDLCK`	A read(shared) lock is being set.
`F_WRLCK`	A write (exclusive) lock is being set.
`F_UNLCK`	An existing lock is being removed.

The next two elements, l_whence and l_start, specify where the region begins in the same manner file offsets are passed to lseek(). l_whence tells how l_start is to be interpreted and is one of SEEK_SET, SEEK_CUR, and SEEK_END; see page 171 for details on these values. The next entry, l_len, tells how long, in bytes, the lock is. If l_len is 0, the lock is considered to extend to the end of the file. The final entry, l_pid, is used only when locks are being queried. It is set to the pid of the process that owns the queried lock.

There are three fcntl() commands that pertain to locking the file. The operation is passed as the second argument to fcntl(). fcntl() returns -1 on error and 0 otherwise. The command argument should be set to one of the following:

`F_SETLK`	Sets the lock described by `arg`. If the lock cannot be granted because of a conflict with another process's locks, `EAGAIN` is returned. If the `l_type` is set to `F_UNLCK`, an existing lock is removed.
`F_SETLKW`	Similar to `F_SETLK`, but blocks until the lock is granted. If a signal occurs while the process is blocked, the `fcntl()` call returns `EAGAIN`.
`F_GETLK`	Checks to see if the described lock would be granted. If the lock would be granted, the `struct flock` is unchanged except for `l_type`, which is set to `F_UNLCK`.If the lock would not be granted, `l_pid` is set to the pid of the process that holds the conflicting lock. Success (0) is returned whether or not the lock would be granted.

Although F_GETLK allows a process to check whether a lock would be granted, the following code sequence could still fail to get a lock:

 fcntl(fd, F_GETLK, &lockinfo); if (lockinfo.l_type != F_UNLCK) {     fprintf(stderr, "lock conflict\n");     return 1; } lockinfo.l_type = F_RDLCK; fcntl(fd, F_SETLK, &lockinfo);

Another process could lock the region between the two fcntl() calls, causing the second fcntl() to fail to set the lock.

As a simple example of record locking, here is a program that opens a file, obtains a read lock on the file, frees the read lock, gets a write lock, and then exits. Between each step, the program waits for the user to press return. If it fails to get a lock, it prints the pid of a process that holds a conflicting lock and waits for the user to tell it to try again. Running this sample program in two terminals makes it easy to experiment with POSIX locking rules.

  1: /* lock.c */  2:  3: #include <errno.h>  4: #include <fcntl.h>  5: #include <stdio.h>  6: #include <unistd.h>  7:  8: /* displays the message, and waits for the user to press  9:    return */ 10: void waitforuser(char * message) { 11:     char buf[10]; 12: 13:     printf("%s", message); 14:     fflush(stdout); 15: 16:     fgets(buf, 9, stdin); 17: } 18: 19: /* Gets a lock of the indicated type on the fd which is passed. 20:    The type should be either F_UNLCK, F_RDLCK, or F_WRLCK */ 21: void getlock(int fd, int type) { 22:     struct flock lockinfo; 23:     char message[80]; 24: 25:     /* we'll lock the entire file */ 26:     lockinfo.l_whence = SEEK_SET; 27:     lockinfo.l_start = 0; 28:     lockinfo.l_len = 0; 29: 30:     /* keep trying until we succeed */ 31:     while (1) { 32:         lockinfo.l_type = type; 33:         /* if we get the lock, return immediately */ 34:         if (!fcntl(fd, F_SETLK, &lockinfo)) return; 35: 36:         /* find out who holds the conflicting lock */ 37:         fcntl(fd, F_GETLK, &lockinfo); 38: 39:         /* there's a chance the lock was freed between the F_SETLK 40:            and F_GETLK; make sure there's still a conflict before 41:            complaining about it */ 42:         if (lockinfo.l_type != F_UNLCK) { 43:             sprintf(message, "conflict with process %d... press " 44:                     "<return> to retry:", lockinfo.l_pid); 45:             waitforuser(message); 46:         } 47:     } 48: } 49: 50: int main(void) { 51:     int fd; 52: 53:     /* set up a file to lock */ 54:     fd = open("testlockfile", O_RDWR | O_CREAT, 0666); 55:     if (fd < 0) { 56:         perror("open"); 57:         return 1; 58:     } 59: 60:     printf("getting read lock\n"); 61:     getlock(fd, F_RDLCK); 62:     printf("got read lock\n"); 63: 64:     waitforuser("\npress <return> to continue:"); 65: 66:     printf("releasing lock\n"); 67:     getlock(fd, F_UNLCK); 68: 69:     printf("getting write lock\n"); 70:     getlock(fd, F_WRLCK); 71:     printf("got write lock\n"); 72: 73:     waitforuser("\npress <return> to exit:"); 74: 75:     /* locks are released when the file is closed */ 76: 77:     return 0; 78: }

Locks are treated differently from other file attributes. Locks are associated with a (pid, inode) tuple, unlike most attributes of open files, which are associated with a file descriptor or file structure. This means that if a process

1.	Opens a single file twice, resulting in two different file descriptors
2.	Gets read locks on a single region in both file descriptors
3.	Closes one of the file descriptors

then the file is no longer locked by the process. Only a single read lock was granted because only one (pid, inode) pair was involved (the second lock attempt succeeded because a process's locks can never conflict), and after one of the file descriptors is closed, the process does not have any locks on the file!

After a fork(), the parent process retains its file locks, but the child process does not. If child processes were to inherit locks, two processes would end up with a write lock on the same region of a file, which file locks are supposed to prevent.

File locks are inherited across an exec(), however. While POSIX does not define what happens to locks after an exec(), all variants of Unix preserve them.^[21]

^[21] The effect of fork() and exec() calls on file locks is the biggest difference between POSIX file locking (and hence lockf() file locking) and BSD's flock() file locking.

13.3.3. Mandatory Locks

Both Linux and System V provide mandatory locking, as well as normal locking. Mandatory locks are established and released through the same fcntl() mechanism that is used for advisory record locking. The locks are mandatory if the locked file's setgid bit is set but its group execute bit is not set. If this is not the case, advisory locking is used.

When mandatory locking is enabled, the read() and write() system calls block when they conflict with locks that have been set. If a process tries to write() to a portion of a file that a different process has a read or write lock on, the process without the lock blocks until the lock is released. Similarly, read() calls block on regions that are included in mandatory write locks.

Mandatory record locking causes a larger performance loss than advisory locking, because every read() and write() call must be checked for conflicts with locks. It is also not as portable as POSIX advisory locks, so we do not recommend using mandatory locking in most applications.

13.3.4. Leasing a File

Both advisory and mandatory locking are designed to prevent a process from accessing a file, or part of a file, that another process is using. When a lock is in place, the process that needs access to the file has to wait for the process owning the lock to finish. This structure is fine for most applications, but occasionally a program would like to use a file until someone else needs it, and is willing to give up exclusive access to the file if necessary. To allow this, Linux provides file leases (other systems call these opportunistic locks, or oplocks).^[22]

^[22] By far the most common user of file leases is the samba file server, which uses file leases to allow clients to cache their writes to increase performance.

Putting a lease on a file allows a process to be notified (via a signal) when that file is accessed by another process. There are two types of leases available: read leases and write leases. A read lease causes a signal to be sent when the file is opened for writing, opened with O_TRUNC, or truncate() is called for that file. A write lease also sends a signal when the file is opened for reading.^[23] File leases work only for modifications made to the file by the same system running the application that owns the lease. If the file is a local file (not a file being accessed across the network), any appropriate file access triggers a signal. If the file is being accessed across the network, only processes on the same machine as the leaseholder cause the signal to be sent; accesses from any other machine proceed as if the lease were not in place.

^[23] If it seems a little strange that a write lease notifies the process of opening the file for reading, think of it from the point of view of the process taking out the lease. It would need to know if another process wanted to read from the file only if it was itself writing to that file.

The fcntl() system call is used to create, release, and inquire about file leases. Leases can be placed only on normal files (they cannot be placed on files such as pipes or directories), and write leases are granted only to the owner of a file. The first argument to fcntl() is the file descriptor we are interested in monitoring, and the second argument, command, specifies what operation to perform.

`F_SETLEASE`	A lease is created or released, depending on the value of the final argument to `fcntl()`; `F_RDLCK` creates a read lease, `F_WRLCK` creates a write lock, and `F_UNLCK` releases any lease that may be in place. If a new lease is requested, the new lease replaces any lease already in place. If an error occurs, a negative value is returned; zero or a positive value indicates success.^[24]
`F_GETLEASE`	The type of lease currently in place for the file is returned (one of `F_RDLCK, F_WRLCK`, or `F_UNLCK`).

^[24] Older kernels could return either zero or one on success, while newer ones always return zero on success. In either case, checking for negative or nonnegative works fine.

When one of the monitored events occurs on a leased file, the kernel sends the process holding the lease a signal. By default, SIGIO is sent, but the process can choose which signal is sent for that file by calling fcntl() with the second argument set to F_SETSIG and the final argument set to the signal that should be used instead.

Using F_SETSIG has one other important effect. By default, no siginfo_t is passed to the handler when SIGIO is delivered. If F_SETSIG has been used, even if the signal the kernel is told to deliver is SIGIO, and SA_SIGINFO was specified when the signal handler was registered, the file descriptor whose lease triggered the event is passed to the signal handler as the si_fd member of the siginfo_t passed to the signal handler. This allows a single signal to be used for leases on multiple files, with si_fd letting the signal handler know which file needs attention.^[25]

^[25] If one signal is used for leases on multiple files, make sure the signal is a real-time signal so that multiple lease events are queued. If a regular signal is used, signals may get lost if the lease events occur close together.

The only two system calls that can cause a signal to be sent for a leased file are open() and truncate(). When they are called by a process on a file that has a lease in place, those system calls block^[26] and the process holding the lease is sent a signal. The open() or truncate() completes after the lease has been removed from the file (or the file is closed by the process holding the lease, which causes the lease to be released). If the process holding the lease does not remove the release within the amount of time specified in the file /proc/sys/fs/lease-break-time, the kernel breaks the lease and lets the triggering system call complete.

^[26] Unless O_NONBLOCK was specified as a flag to open(), in which case EWOULDBLOCK would be returned.

Here is an example of using file leases to be notified when another process needs to access a file. It takes a list of files from the command line and places write leases on each of them. When another process wants to access the file (even for reading, since a write lock was used) the program releases its lock on the file, allowing that other process to continue. It also prints a message saying which file was released.

  1: /* leases.c */  2:  3: #define _GNU_SOURCE  4:  5: #include <fcntl.h>  6: #include <signal.h>  7: #include <stdio.h>  8: #include <string.h>  9: #include <unistd.h> 10: 11: const char ** fileNames; 12: int numFiles; 13: 14: void handler(int sig, siginfo_t * siginfo, void * context) { 15:     /* When a lease is up, print a message and close the file. 16:        We assume that the first file we open will get file 17:        descriptor 3, the next 4, and so on. */ 18: 19:     write(1, "releasing ", 10); 20:     write(1, fileNames[siginfo->si_fd - 3], 21:           strlen(fileNames[siginfo->si_fd - 3])); 22:     write(1, "\n", 1); 23:     fcntl(siginfo->si_fd, F_SETLEASE, F_UNLCK); 24:     close(siginfo->si_fd); 25:     numFiles--; 26: } 27: 28: int main(int argc, const char ** argv) { 29:     int fd; 30:     const char ** file; 31:     struct sigaction act; 32: 33:     if (argc < 2) { 34:         fprintf(stderr, "usage: %s <filename>+\n", argv[0]); 35:         return 1; 36:     } 37: 38:     /* Register the signal handler. Specifying SA_SIGINFO lets 39:        the handler learn which file descriptor had the lease 40:        expire. */ 41:     act.sa_sigaction = handler; 42:     act.sa_flags = SA_SIGINFO; 43:     sigemptyset(&act.sa_mask); 44:     sigaction(SIGRTMIN, &act, NULL); 45: 46:     /* Store the list of filenames in a global variable so that 47:        the signal handler can access it. */ 48:     fileNames = argv + 1; 49:     numFiles = argc - 1; 50: 51:     /* Open the files, set the signal to use, and create the 52:        lease */ 53:     for (file = fileNames; *file; file++) { 54:         if ((fd = open(*file, O_RDONLY)) < 0) { 55:             perror("open"); 56:             return 1; 57:         } 58: 59:         /* We have to use F_SETSIG for the siginfo structure to 60:            get filled in properly */ 61:         if (fcntl(fd, F_SETSIG, SIGRTMIN) < 0) { 62:             perror("F_SETSIG"); 63:             return 1; 64:         } 65: 66:         if (fcntl(fd, F_SETLEASE, F_WRLCK) < 0) { 67:             perror("F_SETLEASE"); 68:             return 1; 69:         } 70:     } 71: 72:     /* As long as files remain open, wait for signals. */ 73:     while (numFiles) 74:         pause(); 75: 76:     return 0; 77: }