Section 3.2. First Implementation Past the Post | Open Sources 2.0: The Continuing Evolution

3.2. First Implementation Past the Post

Any application program dealing with multiple access to files has to deal with file locking. File locking has several potential strategies, ranging from the "lock this file for my exclusive use" method, to the "lock these 4 bytes at offset 23 as I'm going to be reading from them soon" level of granularity. POSIX implements this kind of functionality via the fcntl() call, a sort of jack-of-all-trades for manipulating files (hence "fcntl file control"). It's not important to know exactly how to program this call. Suffice it to say that a code fragment to set up such a byte range lock looks something like this:

int fd = open("/path/to/file", O_RDWR);

Now, set up the struct flock structure to describe the kind of byte range lock we need:

int ret = fcntl(fd, F_SETLKW, &flock_struct);

If ret is zero, we got the lock. Looks simple, right? The byte range lock we got on the region of the file is advisory. This means that other processes can ignore it and are not restricted in terms of reading or writing the byte range covered by the region (that's a difference from the Win32 way of doing things, in which locks are mandatory; if a lock is in place on a region, no other process can write to that region, even if it doesn't test for locks). An existing lock can be detected by another process doing its own fcntl( ) call, asking to lock its own region of interest. Another useful feature is that once the file descriptor open on the file (int fd in the previous example) is closed, the lock is silently removed. This is perfectly acceptable and a rational way of specifying a file locking primitive; just what you'd want.

However, modern Unix processes are not single threaded. They commonly consist of a collection of separate threads of execution, separately scheduled by the kernel. Because the lock primitive has a per-process scope, this means that if separate threads in the same process ask for a lock over the same area, it won't conflict. In addition, because the number of lock requests by a single process over the same region is not recorded (according to the spec), you can lock the region 10 times, but you need to unlock it only once. This is sometimes what you want, but not always: consider a library routine that needs to access a region of a file but doesn't know if the calling processes have the file open. Even if an open file descriptor is passed into the library, the library code can't take any locks. It can never know if it is safe to unlock again without race conditions.

This is an example of a POSIX interface not being future proofed against modern techniques such as threading. A simple amendment to the original primitive allowing a user-defined "locking context" (like a process ID) to be entered in the struct flock structure used to define the lock would have fixed this problem, along with extra flags allowing the number of locks per context to be recorded if needed.

But it gets worse. Consider the following code:

int second_fd; int ret; struct flock lock; int fd = open("/path/to/file", O_RDWR); /* Set up the "struct flock" structure to describe the kind of byte range lock we need. */ lock.l_type = F_WRLCK; lock.l_whence = SEEK_SET; lock.l_start = 0; lock.l_len = 4; lock.l_pid = getpid( ); ret = fcntl(fd, F_SETLKW, &lock); /* Assume we got the lock above (ie. ret == 0). */ /* Get a second file descriptor open on the original file. Assume this succeeds. */ second_fd = dup(fd); /* Now immediately close it again. */ ret = close(second_fd);

What do you think the effect of this code on the lock created on the first file descriptor should be (so long as the close() call returns zero)? If you think it should be silently removed when the second file descriptor is closed, congratulationsyou have the same warped mind as the people who implemented the POSIX spec. Yes, that's correct. Any successful close( ) call on any file descriptor referencing a file with locks will drop all the locks on that file, even if they were obtained on another, still-open file descriptor.

Let me be clear: this behavior is never what you want. Even experienced programmers are surprised by this behavior, because it makes no sense. After I've described this to Linux kernel hackers their responsse have been that of stunned silence, followed by "but why would it do that"?^[2]

^[2] To discover if this functionality was actually correctly used by any application program or if anything really depended on it, Andrew Tridgell, the original author of Samba, once hacked the kernel on his Linux laptop to write a kernel debug message if ever this condition occurred. After a week of continuous use, he found one message logged. When he investigated, it turned out to be a bug in the exportfs NFS file exporting command, whereby a library routine was opening and closing the /etc/exports file that had been opened and locked by the main exportfs code. Obviously, the authors didn't expect it to do that either.

The reason is historical and in my opinion, reflects a flaw in the POSIX standards process, one that hopefully won't be repeated in the future. By talking to longtime BSD hacker and POSIX standards committee member, Kirk McKusick (he of the BSD daemon artwork), I finally tracked down why this insane behavior was standardized by the POSIX committee. As he recalls, AT&T took the current behavior to the standards committee as a proposal for byte range locking, as this was how their current code implementation worked. The committee asked other ISVs if this was how locking should be done. The ISVs who cared about byte range locking were the large (at the time) database vendors, such as Oracle, Sybase, and Informix. All these companies did byte range locking within their own applications, and none of them depended on, or needed, the underlying operating system to provide locking services for them. So their unanimous answer was "we don't care." In the absence of any strong negative feedback on a proposal, the committee added it "as is" and took as the desired behavior the specifics of the first implementation, the brain-dead one from AT&T.

The "first implementation past the post" style of standardization has saddled POSIX systems with one of the most broken locking implementations in computing history. My hope is that eventually Linux will provide a sane superset of this functionality that can be adopted by other Unixes and eventually find its way back into POSIX.

OK, having dumped on POSIX enough, let's look at one of the things that POSIX really got right and that is an example worth following in the future.