Section 9.2. Understanding System Calls | Linux Application Development (paperback) (2nd Edition)

9.2. Understanding System Calls

This book mentions system calls (syscalls, for short) repeatedly because they are fundamental to the programming environment. At first glance, they look just like normal C function calls. That is no accident; they are function calls, just a special variety. To understand the difference, you need to have a basic understanding of the structure of the operating system.

Although there are many pieces of code that make up the Linux operating system (utility programs, applications, programming libraries, device drivers, file systems, memory management, and so on), all those pieces run in one of two contexts: user mode or kernel mode.

When you write a program, the code that you write runs in user mode. Device drivers and file systems, by contrast, run in kernel mode. In user mode, programs are strictly protected from damaging each other or the rest of the system. Code that runs in kernel mode has full access to the machine to do, and break, anything.

For a device driver to manipulate the hardware device it is designed to control, it needs full access to it. The device needs to be protected from arbitrary programs so that programs cannot damage themselves or each other by damaging or confusing the device. The memory it runs in is also protected from the ravages of arbitrary programs.

All this code running in kernel mode exists solely to provide services to code running in user mode. A system call is how application code running in user mode requests protected code running in kernel mode to provide a service.

Take allocating memory, for instance. It is protected, kernel-mode code that must allocate the physical memory for the process, but it is the process itself that must ask for the memory. As another example, take file systems, which need to be protected to maintain coherent data on disk (or over the network), but it is your everyday, run-of-the-mill process that actually needs to read files from the file system.

The ugly details of calling through the user/kernel space barrier are mostly hidden in the C library. Calling through that barrier does not use normal function calls; it uses an ugly interface that is optimized for speed and has significant restrictions. The C library hides most of the interface from you by providing you with normal C functions wrapped around the system calls. However, you will be able to use the functions better if you have some idea what is going on underneath.

9.2.1. System Call Limitations

Kernel mode is protected from the rampages of user mode. One of those protections is that the kind of data that can be passed back and forth between kernel mode and user mode is limited to what can be easily verified, and it follows strict conventions.

Each argument that is passed from user mode to kernel mode is the same length, which is almost always the native word size used by the machine for pointers. This size is big enough to pass long integer arguments, as well as pointers. char and short variables will be promoted to a larger type by C before being passed in.
The return type is limited to a signed word. The first few hundred small negative integers are reserved as error codes and have a common meaning across system calls. This means that system calls that return a pointer cannot return a few pointers to the top of available virtual memory. Fortunately, those addresses are in reserved space and would never be returned anyway, so the signed words that are returned can be cast to pointers without a problem.

Unlike the C calling convention, in which C structures can be passed by value on the stack, you cannot pass a structure by value from user mode to kernel mode, nor can the kernel return a structure to user mode. You can pass large data items only by reference. Pass pointers to structures, just as you always pass pointers to anything that may be modified.

9.2.2. System Call Return Codes

The return codes that are reserved across all system calls are universal error return codes, which are all small negative numbers. The C library checks for errors whenever a system call returns. If an error has occurred, the library stuffs the value of the error in the global variable errno.^[1] Most of the time, to check for errors, all you have to do is see if the return code was negative. The error codes are defined in <errno.h>, and you can compare errno to any error number defined there that you want to handle in a special way.

^[1] If you are using threads, the library actually keeps the error where an errno() function that knows what thread is current can get at it, because different threads might have different current error return codes. But you can ignore that because it ends up working the same way as an errno variable.

The errno variable has another use. The C library provides three ways to get at strings designed to describe the error you have just encountered:

 perror()

Prints an error message. Pass it a string with information about what the code in question was trying to do.
 if ((file = open(DB_PATH, O_RDONLY)) < 0) {     perror("could not open database file"); } 
This causes perror() to print an error describing the error that just occurred, along with the explanation of what it was trying to do, like this:
 could not open database file: No such file or directory 
It is generally a good idea to make your arguments to perror() unique throughout your program so that when you get bug reports with a report from perror(), you know exactly where to start looking. Note that there is no newline character in the string passed to perror(). You are passing it only one part of a line, and it prints the newline itself.

 strerror()

Returns a statically allocated string describing the error passed as the only argument. Use this when building your own version of perror(), for instance. If you want to save a copy of the string, use strdup() to do so; the string returned by strerror() will be overwritten on the next call to strerror().
 if ((file = open(DB_PATH, O_RDONLY)) < 0) {     fprintf(stderr,             "could not open database file %s, %s\n",             DB_PATH, strerror(errno)); } 

 sys_errlist

A poor alternative to strerror().sys_errlist is an array of size sys_nerr pointers to static, read-only character strings that describe errors. An attempt to write to those strings causes a segmentation violation and a core dump.
 if ((file = open(DB_PATH, O_RDONLY)) < 0) {     if (errno < sys_nerr) {         fprintf(stderr, "could not open database file %s, %s\n",                 DB_PATH, sys_errlist[errno]);     } } 
This is neither standard nor portable, and it is mentioned here only because you are likely to find code that relies on it. Convert each such instance to use strerror() and you will do the world a service.

If you are not going to use errno immediately after generating the error, you must save a copy. Any library function might reset it to any value, because it may make system calls that you do not know are being made, and some library functions may set errno without making any system calls.

9.2.3. Using System Calls

The interface that you as a programmer are expected to work with is the set of C library wrappers for the system calls. Therefore, we use system call through the rest of this book to mean the C wrapper function that you call to perform a system call, rather than the ugly interface that the C library kindly hides from you.

Most, but not all, system calls are declared in <unistd.h>. The <unistd.h> file is really a catch-all for system calls that do not seem to fit anywhere else. In order to determine which include files to use, you will generally need to use the system man pages. Although the function descriptions in the man pages are often terse, the man pages do accurately state, right at the top, which include files need to be included to use the function.

There is one snag here that is endemic to Unix systems. The system calls are documented in a separate manual page section from the library functions, and you will be using library functions to access system calls. Where the library functions differ from the system calls, there are separate man pages for the library functions and the system calls. This would not be so bad except that if there are two man pages for a function, you will nearly always want to read the one describing the library function with that name. But the system calls are documented in section 2, and the library functions in section 3, and because man gives lower numbers precedence, you will consistently be shown the wrong function.

You should not simply just specify the section number, however. System calls that use the most minimal wrappers in the C library are not documented as part of the C library, so man 3 function will not find them. In order to make sure you have read all the information you need, first look up the man page without specifying the section. If it is a section 2 man page, check to see if there is a section 3 man page by the same name. If, as happens with open(), you get a section 1 man page, look explicitly in sections 2 and 3.

There is, fortunately, another way around this problem. Many versions of the man program, including the one on most Linux systems, allow you to specify an alternate search path for man pages. Read the man man man page to determine if your version of man supports the MANSECT environment variable and the -S argument to the man command. If so, you can set MANSECT to something like 3:2:1:4:5:6:7:8:tcl:n:l:p:o. Look at your man configuration file (/etc/man.config on most Linux systems) to determine the current setting of MANSECT.

Most system calls return 0 to indicate success, and they return a negative value to indicate an error. Because of this, in many cases, a simple form of error handling is appropriate.

 if (ioctl(fd, FN, data)) {   /* error handling based on errno */ }

Also common is the following form.

 if (ioctl(fd, FN, data) < 0) {   /* error handling based on errno */ }

For the system calls that return 0 on success, these two cases are identical. In your own code, choose what suits you best. Be aware that you will see all sorts of conventions in others' code.

9.2.4. Common Error Return Codes

There are many commonly occurring error codes that you likely have seen error messages from before. Some of these explanations may seem confusing. Without knowing what you can do on a Linux system, it is hard to understand the errors you might get while you are working on one. Read this list now to get a sense of what errors exist, and then read it again after you have read this whole book, to gain a more thorough understanding.

For many of the error return codes, we give a sample system call or two likely to trigger the error message in common circumstances. This does not mean that those system calls are the only ones that trigger those errors. Consider them examples to elucidate the output of perror(), the brief descriptions in <asm/errno.h>, or man 3 errno.

Use the man pages to determine which errors to expect from a specific system call. In particular, use man 3 errno to get a list of error codes defined by POSIX. However, understand that this sometimes changes, and the man pages may not be completely up to date. If a system call returns an error code that you do not expect, presume that the man page is out of date rather than that the system call is broken. The Linux source code is maintained more carefully than the documentation.

`E2BIG`	The argument list is too long. When trying to `exec()` a new process, there is a limit to the length of the argument list you can give. See Chapter 10.
`EACCES`	Access would be denied. This is returned by the `access()` system call, explained in Chapter 11, and is more an informational return code than a true error condition.
`EAGAIN`	Returned when you attempt to do nonblocking I/O and no data is available. `EWOULDBLOCK` is a synonym for `EAGAIN`. If you had been doing blocking I/O, the system call would have blocked and waited for data.
`EBADF`	Bad file number. You have passed a file number that does not reference an open file to `read(), close(), ioctl()`, or another system call that takes a file number argument.
`EBUSY`	The `mount()` system call returns this error if you attempt to mount a file system that is already mounted or un-mount a file system that is currently in use.
`ECHILD`	No child processes. Returned by the `wait()` family of system calls. See Chapter 10.
`EDOM`	Not a system call error, but an error from the system's C library. `EDOM` is set by math functions if an argument is out of range. (This is `EINVAL` for the function's domain.) For example, the `sqrt()` function does not know about complex numbers and therefore does not approve of a negative argument.
`EEXIST`	Returned by `creat(), mknod()`, or `mkdir()` if the file already exists, or by `open()` in the same case if you specified `O_CREAT` and `O_EXCL`.
`EFAULT`	A bad pointer (one that points to inaccessible memory) was passed as an argument to a system call. Accessing the same pointer from within the user-space program that made the system call would result in a segmentation fault.
`EFBIG`	Returned by `write()` if you attempt to write a file longer than the file system can logically handle (this does not include simple physical space restrictions).^[2]
`EINTR`	System call was interrupted. Interruptible system calls are explained in Chapter 12.
`EINVAL`	Returned if the system call received an invalid argument.
`EIO`	I/O error. This is usually generated by a device driver to indicate a hardware error or unrecoverable problem communicating with the device.
`EISDIR`	Returned by system calls that require a file name, such as `unlink()`, if the final pathname component is a directory rather than a file and the operation in question cannot be applied to a directory.
`ELOOP`	Returned by system calls that take a path if too many symbolic links in a row (that is, symbolic links pointing to symbolic links pointing to symbolic links pointing to...) were encountered while parsing the path. The current limit is 16 symbolic links in a row.
`EMFILE`	Returned if no more files can be opened by the calling process.
`EMLINK`	Returned by `link()` if the file being linked to already has the maximum number of links for the file system it is on (32,000 is currently the maximum on the standard Linux file system).
`ENAMETOOLONG`	A pathname was too long, either for the entire system or for the file system you were trying to access.
`ENFILE`	Returned if no more files can be opened by any process on the system.
`ENODEV`	Returned by `mount()` if the requested file system type is not available. Returned by `open()` if you attempt to open a special file for a device that does not have an associated device driver in the kernel.
`ENOENT`	No such file or directory. Returned when you try to access a file or directory that does not exist.
`ENOEXEC`	Executable format error. This might happen if you attempt to run an (obsolete) a.out binary on a system without support for a.out binaries. It will also happen if you attempt to run an ELF binary built for another CPU architecture.
`ENOMEM`	Out of memory. Returned by the `brk()` and `mmap()` functions if they fail to allocate memory.
`ENOSPC`	Returned by `write()` if you attempt to write a file longer than the file system has space for.
`ENOSYS`	System call is not implemented. Usually caused by running a recent binary on an old kernel that does not implement the system call.
`ENOTBLK`	The `mount()` system call returns this error if you attempt to mount as a file system a file that is not a block device special file.
`ENOTDIR`	An intermediate pathname component (that is, a directory name specified as part of a path) exists, but it is not a directory. Returned by any system call that takes a file name.
`ENOTEMPTY`	Returned by `rmdir()` if the directory you are trying to remove is not empty.
`ENOTTY`	Generally occurs when an application that is attempting to do terminal control is run with its input or output set to a pipe, but it can happen whenever you try to perform an operation on the wrong type of device. The standard error message for this, "not a typewriter," is rather misleading.
`ENXIO`	No such device or address. Usually generated by attempting to open a device special file that is associated with a piece of hardware that is not installed or configured.
`EPERM`	The process has insufficient permissions to complete the operation. This error most commonly happens with file operations. See Chapter 11.
`EPIPE`	Returned by `write()` if the reading end of the pipe or socket is closed and `SIGPIPE` is caught or ignored. See Chapter 12.
`ERANGE`	Not a system call error, `ERANGE` is set by math functions if the result is not representable by its return type, and by some functions if they are passed too short a buffer for a string return value. (This is `EINVAL` for the range.)
`EROFS`	Returned by `write()` if you attempt to write to a read-only file system.
`ESPIPE`	Returned by `lseek()` if you attempt to seek on a nonseekable file descriptor (including file descriptors for pipes, named pipes, and sockets). See Chapter 11 and Chapter 17.
`ESRCH`	No such process. See Chapter 10.
`ETXTBSY`	Returned by `open()` if you attempt to open, with write mode enabled, an executable file or shared library that is currently being run, or any other file which has been mapped into memory with the `MAP_DENYWRITE` flag set (see page 270). To work around this, rename the file, then make a new copy with the same name as the old one and work with the new copy. See Chapter 11 and its discussion of inodes for why this happens.
`EXDEV`	Returned by `link()` if the source and destination files are not on the same file system.

^[2] It also occurs if you write a file longer than your soft resource limit for file size and have changed the default disposition of the SIGXFSZ signal. See Table 10.2 and page 221 for more information on file size limits.

A few other relatively common error return codes happen only in regard to networking; see page 469 for more information.