Section 1.4. System Calls


[Page 26 (continued)]

1.4. System Calls

Armed with our general knowledge of how MINIX 3 deals with processes and files, we can now begin to look at the interface between the operating system and its application programs, that is, the set of system calls. Although this discussion specifically refers to POSIX (International Standard 9945-1), hence also to MINI 3, UNIX, and Linux, most other modern operating systems have system calls that perform the same functions, even if the details differ. Since the actual mechanics of issuing a system call are highly machine dependent, and often must be expressed in assembly code, a procedure library is provided to make it possible to make system calls from C programs.

It is useful to keep the following in mind: any single-CPU computer can execute only one instruction at a time. If a process is running a user program in user mode and needs a system service, such as reading data from a file, it has to execute a trap or system call instruction to transfer control to the operating system. The operating system then figures out what the calling process wants by inspecting the parameters. Then it carries out the system call and returns control to the instruction following the system call. In a sense, making a system call is like making a special kind of procedure call, only system calls enter the kernel or other privileged operating system components and procedure calls do not.


[Page 27]

To make the system call mechanism clearer, let us take a quick look at read. It has three parameters: the first one specifying the file, the second one specifying the buffer, and the third one specifying the number of bytes to read. A call to read from a C program might look like this:

count = read(fd, buffer, nbytes); 


The system call (and the library procedure) return the number of bytes actually read in count. This value is normally the same as nbytes, but may be smaller, if, for example, end-of-file is encountered while reading.

If the system call cannot be carried out, either due to an invalid parameter or a disk error, count is set to 1, and the error number is put in a global variable, errno. Programs should always check the results of a system call to see if an error occurred.

MINIX 3 has a total of 53 main system calls. These are listed in Fig. 1-9, grouped for convenience in six categories. A few other calls exist, but they have very specialized uses so we will omit them here. In the following sections we will briefly examine each of the calls of Fig. 1-9 to see what it does. To a large extent, the services offered by these calls determine most of what the operating system has to do, since the resource management on personal computers is minimal (at least compared to big machines with many users).

Figure 1-9. The main MINIX system calls. fd is a file descriptor; n is a byte count.
(This item is displayed on page 28 in the print version)

Process management

pid = fork()

Create a child process identical to the parent

 

pid = waitpid(pid, &statloc, opts)

Wait for a child to terminate

 

s = wait(&status)

Old version of waitpid

 

s = execve(name, argv, envp)

Replace a process core image

 

exit(status)

Terminate process execution and return status

 

size = brk(addr)

Set the size of the data segment

 

pid = getpid()

Return the caller's process id

 

pid = getpgrp()

Return the id of the caller's process group

 

pid = setsid()

Create a new session and return its proc. group id

 

l = ptrace(req, pid, addr, data)

Used for debugging

Signals

s = sigaction(sig, &act, &oldact)

Define action to take on signals

 

s = sigreturn(&context)

Return from a signal

 

s = sigprocmask(how, &set, &old)

Examine or change the signal mask

 

s = sigpending(set)

Get the set of blocked signals

 

s = sigsuspend(sigmask)

Replace the signal mask and suspend the process

 

s = kill(pid, sig)

Send a signal to a process

 

residual = alarm(seconds)

Set the alarm clock

 

s = pause()

Suspend the caller until the next signal

File Management

fd = creat(name, mode)

Obsolete way to create a new file

 

fd = mknod(name, mode, addr)

Create a regular, special, or directory i-node

 

fd = open(file, how, ...)

Open a file for reading, writing or both

 

s = close(fd)

Close an open file

 

n = read(fd, buffer, nbytes)

Read data from a file into a buffer

 

n = write(fd, buffer, nbytes)

Write data from a buffer into a file

 

pos = lseek(fd, offset, whence)

Move the file pointer

 

s = stat(name, &buf)

Get a file's status information

 

s = fstat(fd, &buf)

Get a file's status information

 

fd = dup(fd)

Allocate a new file descriptor for an open file

 

s = pipe(&fd[0])

Create a pipe

 

s = ioctl(fd, request, argp)

Perform special operations on a file

 

s = access(name, amode)

Check a file's accessibility

 

s = rename(old, new)

Give a file a new name

 

s = fcntl(fd, cmd, ...)

File locking and other operations

Dir. & File System Mgt.

s = mkdir(name, mode)

Create a new directory

 

s = rmdir(name)

Remove an empty directory

 

s = link(name1, name2)

Create a new entry, name2, pointing to name1

 

s = unlink(name)

Remove a directory entry

 

s = mount(special, name, flag)

Mount a file system

 

s = umount(special)

Unmount a file system

 

s = sync()

Flush all cached blocks to the disk

 

s = chdir(dirname)

Change the working directory

 

s = chroot(dirname)

Change the root directory

Protection

s = chmod(name, mode)

Change a file's protection bits

 

uid = getuid()

Get the caller's uid

 

gid = getgid()

Get the caller's gid

 

s = setuid(uid)

Set the caller's uid

 

s = setgid(gid)

Set the caller's gid

 

s = chown(name, owner, group)

Change a file's owner and group

 

oldmask = umask(complmode)

Change the mode mask

Time Management

seconds = time(&seconds)

Get the elapsed time since Jan. 1, 1970

 

s = stime(tp)

Set the elapsed time since Jan. 1, 1970

 

s = utime(file, timep)

Set a file's "last access" time

 

s = times(buffer)

Get the user and system times used so far


This is a good place to point out that the mapping of POSIX procedure calls onto system calls is not necessarily one-to-one. The POSIX standard specifies a number of procedures that a conformant system must supply, but it does not specify whether they are system calls, library calls, or something else. In some cases, the POSIX procedures are supported as library routines in MINIX 3. In others, several required procedures are only minor variations of one another, and one system call handles all of them.

1.4.1. System Calls for Process Management

The first group of calls in Fig. 1-9 deals with process management. Fork is a good place to start the discussion. Fork is the only way to create a new process in MINIX 3. It creates an exact duplicate of the original process, including all the file descriptors, registerseverything. After the fork, the original process and the copy (the parent and child) go their separate ways. All the variables have identical values at the time of the fork, but since the parent's data are copied to create the child, subsequent changes in one of them do not affect the other one. (The program text, which is unchangeable, is shared between parent and child.) The fork call returns a value, which is zero in the child and equal to the child's process identifier or PID in the parent. Using the returned PID, the two processes can see which one is the parent process and which one is the child process.


[Page 29]

In most cases, after a fork, the child will need to execute different code from the parent. Consider the shell. It reads a command from the terminal, forks off a child process, waits for the child to execute the command, and then reads the next command when the child terminates. To wait for the child to finish, the parent executes a waitpid system call, which just waits until the child terminates (any child if more than one exists). Waitpid can wait for a specific child, or for any old child by setting the first parameter to 1. When waitpid completes, the address pointed to by the second parameter, statloc, will be set to the child's exit status (normal or abnormal termination and exit value). Various options are also provided, specified by the third parameter. The waitpid call replaces the previous wait call, which is now obsolete but is provided for reasons of backward compatibility.

Now consider how fork is used by the shell. When a command is typed, the shell forks off a new process. This child process must execute the user command. It does this by using the execve system call, which causes its entire core image to be replaced by the file named in its first parameter. (Actually, the system call itself is exec, but several different library procedures call it with different parameters and slightly different names. We will treat these as system calls here.)A highly simplified shell illustrating the use of fork, waitpid, and execve is shown in Fig. 1-10.

Figure 1-10. A stripped-down shell. Throughout this book, TRUE is assumed to be defined as 1.

#define TRUE 1 while (TRUE){                             /* repeat forever */      type_prompt();                       /* display prompt on the screen */      read_command(command, parameters);   /* read input from terminal */      if (fork() != 0){                    /* fork off child process */           /* Parent code. */           waitpid(1, &status, 0);        /* wait for child to exit */      } else {          /* Child code. */          execve(command, parameters, 0);  /* execute command */      } } 

In the most general case, execve has three parameters: the name of the file to be executed, a pointer to the argument array, and a pointer to the environment array. These will be described shortly. Various library routines, including execl, execv, execle, and execve, are provided to allow the parameters to be omitted or specified in various ways. Throughout this book we will use the name exec to represent the system call invoked by all of these.


[Page 30]

Let us consider the case of a command such as

cp file1 file2 


used to copy file1 to file2. After the shell has forked, the child process locates and executes the file cp and passes to it the names of the source and target files.

The main program of cp (and main program of most other C programs) contains the declaration

main(argc, argv, envp) 


where argc is a count of the number of items on the command line, including the program name. For the example above, argc is 3.

The second parameter, argv, is a pointer to an array. Element i of that array is a pointer to the i-th string on the command line. In our example, argv[0] would point to the string "cp", argv[1] would point to the string "file1", and argv[2] would point to the string "file2".

The third parameter of main, envp, is a pointer to the environment, an array of strings containing assignments of the form name=value used to pass information such as the terminal type and home directory name to a program. In Fig. 1-10, no environment is passed to the child, so the third parameter of execve is a zero.

If exec seems complicated, do not despair; it is (semantically) the most complex of all the POSIX system calls. All the other ones are much simpler. As an example of a simple one, consider exit, which processes should use when they are finished executing. It has one parameter, the exit status (0 to 255), which is returned to the parent via statloc in the waitpid system call. The low-order byte of status contains the termination status, with 0 being normal termination and the other values being various error conditions. The high-order byte contains the child's exit status (0 to 255). For example, if a parent process executes the statement

n = waitpid(1, &statloc, options); 


it will be suspended until some child process terminates. If the child exits with, say, 4 as the parameter to exit, the parent will be awakened with n set to the child's PID and statloc set to 0x0400 (the C convention of prefixing hexadecimal constants with 0x will be used throughout this book).

Processes in MINIX 3 have their memory divided up into three segments: the text segment (i.e., the program code), the data segment (i.e., the variables), and the stack segment. The data segment grows upward and the stack grows downward, as shown in Fig. 1-11. Between them is a gap of unused address space. The stack grows into the gap automatically, as needed, but expansion of the data segment is done explicitly by using a system call, brk, which specifies the new address where the data segment is to end. This address may be more than the current value (data segment is growing) or less than the current value (data segment is shrinking). The parameter must, of course, be less than the stack pointer or the data and stack segments would overlap, which is forbidden.


[Page 31]

Figure 1-11. Processes have three segments: text, data, and stack. In this example, all three are in one address space, but separate instruction and data space is also supported.


As a convenience for programmers, a library routine sbrk is provided that also changes the size of the data segment, only its parameter is the number of bytes to add to the data segment (negative parameters make the data segment smaller). It works by keeping track of the current size of the data segment, which is the value returned by brk, computing the new size, and making a call asking for that number of bytes. The brk and sbrk calls, however, are not defined by the POSIX standard. Programmers are encouraged to use the malloc library procedure for dynamically allocating storage, and the underlying implementation of malloc was not thought to be a suitable subject for standardization since few programmers use it directly.

The next process system call is also the simplest, getpid. It just returns the caller's PID. Remember that in fork, only the parent was given the child's PID. If the child wants to find out its own PID, it must use getpid. The getpgrp call returns the PID of the caller's process group. setsid creates a new session and sets the process group's PID to the caller's. Sessions are related to an optional feature of POSIX, job control, which is not supported by MINIX 3 and which will not concern us further.

The last process management system call, ptrace, is used by debugging programs to control the program being debugged. It allows the debugger to read and write the controlled process' memory and manage it in other ways.

1.4.2. System Calls for Signaling

Although most forms of interprocess communication are planned, situations exist in which unexpected communication is needed. For example, if a user accidently tells a text editor to list the entire contents of a very long file, and then realizes the error, some way is needed to interrupt the editor. In MINIX 3, the user can hit the CTRL-C key on the keyboard, which sends a signal to the editor. The editor catches the signal and stops the print-out. Signals can also be used to report certain traps detected by the hardware, such as illegal instruction or floating point overflow. Timeouts are also implemented as signals.


[Page 32]

When a signal is sent to a process that has not announced its willingness to accept that signal, the process is simply killed without further ado. To avoid this fate, a process can use the sigaction system call to announce that it is prepared to accept some signal type, and to provide the address of the signal handling procedure and a place to store the address of the current one. After a sigaction call, if a signal of the relevant type is generated (e.g., by pressing CTRL-C), the state of the process is pushed onto its own stack, and then the signal handler is called. It may run for as long as it wants to and perform any system calls it wants to. In practice, though, signal handlers are usually fairly short. When the signal handling procedure is done, it calls sigreturn to continue where it left off before the signal. The sigaction call replaces the older signal call, which is now provided as a library procedure, however, for backward compatibility.

Signals can be blocked in MINIX 3. A blocked signal is held pending until it is unblocked. It is not delivered, but also not lost. The sigprocmask call allows a process to define the set of blocked signals by presenting the kernel with a bitmap. It is also possible for a process to ask for the set of signals currently pending but not allowed to be delivered due to their being blocked. The sigpending call returns this set as a bitmap. Finally, the sigsuspend call allows a process to atomically set the bitmap of blocked signals and suspend itself.

Instead of providing a function to catch a signal, the program may also specify the constant SIG_IGN to have all subsequent signals of the specified type ignored, or SIG_DFL to restore the default action of the signal when it occurs. The default action is either to kill the process or ignore the signal, depending upon the signal. As an example of how SIG_IGN is used, consider what happens when the shell forks off a background process as a result of

command & 


It would be undesirable for a SIGINT signal (generated by pressing CTRL-C) to affect the background process, so after the fork but before the exec, the shell does

sigaction(SIGINT, SIG_IGN, NULL); 


and

sigaction(SIGQUIT, SIG_IGN, NULL); 


to disable the SIGINT and SIGQUIT signals. (SIGQUIT is generated by CTRL-\; it is the same as SIGINT generated by CTRL-C except that if it is not caught or ignored it makes a core dump of the process killed.) For foreground processes (no ampersand), these signals are not ignored.


[Page 33]

Hitting CTRL-C is not the only way to send a signal. The kill system call allows a process to signal another process (provided they have the same UID unrelated processes cannot signal each other). Getting back to the example of background processes used above, suppose a background process is started up, but later it is decided that the process should be terminated. SIGINT and SIGQUIT have been disabled, so something else is needed. The solution is to use the kill program, which uses the kill system call to send a signal to any process. By sending signal 9 (SIGKILL), to a background process, that process can be killed. SIGKILL cannot be caught or ignored.

For many real-time applications, a process needs to be interrupted after a specific time interval to do something, such as to retransmit a potentially lost packet over an unreliable communication line. To handle this situation, the alarm system call has been provided. The parameter specifies an interval, in seconds, after which a SIGALRM signal is sent to the process. A process may only have one alarm outstanding at any instant. If an alarm call is made with a parameter of 10 seconds, and then 3 seconds later another alarm call is made with a parameter of 20 seconds, only one signal will be generated, 20 seconds after the second call. The first signal is canceled by the second call to alarm. If the parameter to alarm is zero, any pending alarm signal is canceled. If an alarm signal is not caught, the default action is taken and the signaled process is killed.

It sometimes occurs that a process has nothing to do until a signal arrives. For example, consider a computer-aided-instruction program that is testing reading speed and comprehension. It displays some text on the screen and then calls alarm to signal it after 30 seconds. While the student is reading the text, the program has nothing to do. It could sit in a tight loop doing nothing, but that would waste CPU time that another process or user might need. A better idea is to use pause, which tells MINIX 3 to suspend the process until the next signal.

1.4.3. System Calls for File Management

Many system calls relate to the file system. In this section we will look at calls that operate on individual files; in the next one we will examine those that involve directories or the file system as a whole. To create a new file, the creat call is used (why the call is creat and not create has been lost in the mists of time). Its parameters provide the name of the file and the protection mode. Thus

fd = creat("abc", 0751); 


creates a file called abc with mode 0751 octal (in C, a leading zero means that a constant is in octal). The low-order 9 bits of 0751 specify the rwx bits for the owner (7 means read-write-execute permission), his group (5 means read-execute), and others (1 means execute only).

Creat not only creates a new file but also opens it for writing, regardless of the file's mode. The file descriptor returned, fd, can be used to write the file. If a creat is done on an existing file, that file is truncated to length 0, provided, of course, that the permissions are all right. The creat call is obsolete, as open can now create new files, but it has been included for backward compatibility.


[Page 34]

Special files are created using mknod rather than creat. A typical call is

fd = mknod("/dev/ttyc2", 020744, 0x0402); 


which creates a file named /dev/ttyc2 (the usual name for console 2) and gives it mode 020744 octal (a character special file with protection bits rwxr--r--). The third parameter contains the major device (4) in the high-order byte and the minor device (2) in the low-order byte. The major device could have been anything, but a file named /dev/ttyc2 ought to be minor device 2. Calls to mknod fail unless the caller is the superuser.

To read or write an existing file, the file must first be opened using open. This call specifies the file name to be opened, either as an absolute path name or relative to the working directory, and a code of O_RDONLY, O_WRONLY, or O_RDWR, meaning open for reading, writing, or both. The file descriptor returned can then be used for reading or writing. Afterward, the file can be closed by close, which makes the file descriptor available for reuse on a subsequent creat or open.

The most heavily used calls are undoubtedly read and write. We saw read earlier; write has the same parameters.

Although most programs read and write files sequentially, for some applications programs need to be able to access any part of a file at random. Associated with each file is a pointer that indicates the current position in the file. When reading (writing) sequentially, it normally points to the next byte to be read (written). The lseek call changes the value of the position pointer, so that subsequent calls to read or write can begin anywhere in the file, or even beyond the end.

lseek has three parameters: the first is the file descriptor for the file, the second is a file position, and the third tells whether the file position is relative to the beginning of the file, the current position, or the end of the file. The value returned by lseek is the absolute position in the file after changing the pointer.

For each file, MINIX 3 keeps track of the file mode (regular file, special file, directory, and so on), size, time of last modification, and other information. Programs can ask to see this information via the stat and fstat system calls. These differ only in that the former specifies the file by name, whereas the latter takes a file descriptor, making it useful for open files, especially standard input and standard output, whose names may not be known. Both calls provide as the second parameter a pointer to a structure where the information is to be put. The structure is shown in Fig. 1-12.

Figure 1-12. The structure used to return information for the stat and fstat system calls. In the actual code, symbolic names are used for some of the types.
(This item is displayed on page 35 in the print version)

struct stat{    short st_dev;                      /* device where i-node belongs */    unsigned short st_ino;             /* i-node number */    unsigned short st_mode;            /* mode word */    short st_nlink;                    /* number of links */    short st_uid;                      /* user id */    short st_gid;                      /* group id */    short st_rdev;                     /* major/minor device for special files */    long st_size;                      /* file size */    long st_atime;                     /* time of last access */    long st_mtime;                     /* time of last modification */    long st_ctime;                     /* time of last change to i-node */ }; 

When manipulating file descriptors, the dup call is occasionally helpful. Consider, for example, a program that needs to close standard output (file descriptor 1), substitute another file as standard output, call a function that writes some output onto standard output, and then restore the original situation. Just closing file descriptor 1 and then opening a new file will make the new file standard output (assuming standard input, file descriptor 0, is in use), but it will be impossible to restore the original situation later.


[Page 35]

The solution is first to execute the statement

fd = dup(1); 


which uses the dup system call to allocate a new file descriptor, fd, and arrange for it to correspond to the same file as standard output. Then standard output can be closed and a new file opened and used. When it is time to restore the original situation, file descriptor 1 can be closed, and then

n = dup(fd); 


executed to assign the lowest file descriptor, namely, 1, to the same file as fd. Finally, fd can be closed and we are back where we started.

The dup call has a variant that allows an arbitrary unassigned file descriptor to be made to refer to a given open file. It is called by

dup2(fd, fd2); 


where fd refers to an open file and fd2 is the unassigned file descriptor that is to be made to refer to the same file as fd. Thus if fd refers to standard input (file descriptor 0) and fd2 is 4, after the call, file descriptors 0 and 4 will both refer to standard input.

Interprocess communication in MINIX 3 uses pipes, as described earlier. When a user types

cat file1 file2 | sort 


the shell creates a pipe and arranges for standard output of the first process to write to the pipe, so standard input of the second process can read from it. The pipe system call creates a pipe and returns two file descriptors, one for writing and one for reading. The call is


[Page 36]

pipe(&fd[0]); 


where fd is an array of two integers and fd[0] is the file descriptor for reading and fd[1] is the one for writing. Typically, a fork comes next, and the parent closes the file descriptor for reading and the child closes the file descriptor for writing (or vice versa), so when they are done, one process can read the pipe and the other can write on it.

Figure 1-13 depicts a skeleton procedure that creates two processes, with the output of the first one piped into the second one. (A more realistic example would do error checking and handle arguments.) First a pipe is created, and then the procedure forks, with the parent eventually becoming the first process in the pipeline and the child process becoming the second one. Since the files to be executed, process1 and process2, do not know that they are part of a pipeline, it is essential that the file descriptors be manipulated so that the first process' standard output be the pipe and the second one's standard input be the pipe. The parent first closes off the file descriptor for reading from the pipe. Then it closes standard output and does a DUP call that allows file descriptor 1 to write on the pipe. It is important to realize that dup always returns the lowest available file descriptor, in this case, 1. Then the program closes the other pipe file descriptor.

Figure 1-13. A skeleton for setting up a two-process pipeline.
(This item is displayed on page 37 in the print version)

#define STD_INPUT0                      /* file descriptor for standard input */ #define STD_OUTPUT1                     /* file descriptor for standard output */ pipeline(process1, process2) char *process1, *process2;              /* pointers to program names */ {  int fd[2];  pipe(&fd[0]);                          /* create a pipe */  if (fork() != 0) {       /* The parent process executes these statements. */       close(fd[0]);                     /* process 1 does not need to read from pipe */       close(STD_OUTPUT);                /* prepare for new standard output */       dup(fd[1]);                       /* set standard output to fd[1] */       close(fd[1]);                     /* this file descriptor not needed any more */       execl(process1, process1, 0);  } else {       /* The child process executes these statements. */       close(fd[1]);                     /* process 2 does not need to write to pipe */       close(STD_INPUT);                 /* prepare for new standard input */       dup(fd[0]);                       /* set standard input to fd[0] */       close(fd[0]);                     /* this file descriptor not needed any more */       execl(process2, process2, 0);  } } 

After the exec call, the process started will have file descriptors 0 and 2 be unchanged, and file descriptor 1 for writing on the pipe. The child code is analogous. The parameter to execl is repeated because the first one is the file to be executed and the second one is the first parameter, which most programs expect to be the file name.

The next system call, ioctl, is potentially applicable to all special files. It is, for instance, used by block device drivers like the SCSI driver to control tape and CD-ROM devices. Its main use, however, is with special character files, primarily terminals. POSIX defines a number of functions which the library translates into ioctl calls. The tcgetattr and tcsetattr library functions use ioctl to change the characters used for correcting typing errors on the terminal, changing the terminal mode, and so forth.

Traditionally, there are three terminal modes, cooked, raw, and cbreak. Cooked mode is the normal terminal mode, in which the erase and kill characters work normally, CTRL-S and CTRL-Q can be used for stopping and starting terminal output, CTRL-D means end of file, CTRL-C generates an interrupt signal, and CTRL-\ generates a quit signal to force a core dump.

In raw mode, all of these functions are disabled; consequently, every character is passed directly to the program with no special processing. Furthermore, in raw mode, a read from the terminal will give the program any characters that have been typed, even a partial line, rather than waiting for a complete line to be typed, as in cooked mode. Screen editors often use this mode.


[Page 37]

Cbreak mode is in between. The erase and kill characters for editing are disabled, as is CTRL-D, but CTRL-S, CTRL-Q, CTRL-C, and CTRL-\ are enabled. Like raw mode, partial lines can be returned to programs (if intraline editing is turned off there is no need to wait until a whole line has been receivedthe user cannot change his mind and delete it, as he can in cooked mode).

POSIX does not use the terms cooked, raw, and cbreak. In POSIX terminology canonical mode corresponds to cooked mode. In this mode there are eleven special characters defined, and input is by lines. In noncanonical mode a minimum number of characters to accept and a time, specified in units of 1/10th of a second, determine how a read will be satisfied. Under POSIX there is a great deal of flexibility, and various flags can be set to make noncanonical mode behave like either cbreak or raw mode. The older terms are more descriptive, and we will continue to use them informally.

Ioctl has three parameters, for example a call to tcsetattr to set terminal parameters will result in

ioctl(fd, TCSETS, &termios); 


The first parameter specifies a file, the second one specifies an operation, and the third one is the address of the POSIX structure that contains flags and the array of control characters. Other operation codes instruct the system to postpone the changes until all output has been sent, cause unread input to be discarded, and return the current values.


[Page 38]

The access system call is used to determine whether a certain file access is permitted by the protection system. It is needed because some programs can run using a different user's UID. This SETUID mechanism will be described later.

The rename system call is used to give a file a new name. The parameters specify the old and new names.

Finally, the fcntl call is used to control files, somewhat analogous to ioctl (i.e., both of them are horrible hacks). It has several options, the most important of which is for advisory file locking. Using fcntl, it is possible for a process to lock and unlock parts of files and test part of a file to see if it is locked. The call does not enforce any lock semantics. Programs must do this themselves.

1.4.4. System Calls for Directory Management

In this section we will look at some system calls that relate more to directories or the file system as a whole, rather than just to one specific file as in the previous section. The first two calls, mkdir and rmdir, create and remove empty directories, respectively. The next call is link. Its purpose is to allow the same file to appear under two or more names, often in different directories. A typical use is to allow several members of the same programming team to share a common file, with each of them having the file appear in his own directory, possibly under different names. Sharing a file is not the same as giving every team member a private copy, because having a shared file means that changes that any member of the team makes are instantly visible to the other membersthere is only one file. When copies are made of a file, subsequent changes made to one copy do not affect the other ones.

To see how link works, consider the situation of Fig. 1-14(a). Here are two users, ast and jim, each having their own directories with some files. If ast now executes a program containing the system call

link("/usr/jim/memo", "/usr/ast/note"); 


the file memo in jim's directory is now entered into ast's directory under the name note. Thereafter, /usr/jim/memo and /usr/ast/note refer to the same file.

Figure 1-14. (a) Two directories before linking /usr/jim/memo to ast's directory. (b) The same directories after linking.
(This item is displayed on page 39 in the print version)


Understanding how link works will probably make it clearer what it does. Every file in UNIX has a unique number, its i-number, that identifies it. This inumber is an index into a table of i-nodes, one per file, telling who owns the file, where its disk blocks are, and so on. A directory is simply a file containing a set of (i-number, ASCII name) pairs. In the first versions of UNIX, each directory entry was 16 bytes2 bytes for the i-number and 14 bytes for the name. A more complicated structure is needed to support long file names, but conceptually a directory is still a set of (i-number, ASCII name) pairs. In Fig. 1-14, mail has inumber 16, and so on. What link does is simply create a new directory entry with a (possibly new) name, using the i-number of an existing file. In Fig. 1-14(b), two entries have the same i-number (70) and thus refer to the same file. If either one is later removed, using the unlink system call, the other one remains. If both are removed, UNIX sees that no entries to the file exist (a field in the i-node keeps track of the number of directory entries pointing to the file), so the file is removed from the disk.


[Page 39]

As we have mentioned earlier, the mount system call allows two file systems to be merged into one. A common situation is to have the root file system containing the binary (executable) versions of the common commands and other heavily used files, on a hard disk. The user can then insert a CD-ROM with files to be read into the CD-ROM drive.

By executing the mount system call, the CD-ROM file system can be attached to the root file system, as shown in Fig. 1-15. A typical statement in C to perform the mount is

mount("/dev/cdrom0", "/mnt", 0); 


where the first parameter is the name of a block special file for CD-ROM drive 0, the second parameter is the place in the tree where it is to be mounted, and the third one tells whether the file system is to be mounted read-write or read-only.

Figure 1-15. (a) File system before the mount. (b) File system after the mount.


After the mount call, a file on CD-ROM drive 0 can be accessed by just using its path from the root directory or the working directory, without regard to which drive it is on. In fact, second, third, and fourth drives can also be mounted anywhere in the tree. The mount call makes it possible to integrate removable media into a single integrated file hierarchy, without having to worry about which device a file is on. Although this example involves CD-ROMs, hard disks or portions of hard disks (often called partitions or minor devices) can also be mounted this way. When a file system is no longer needed, it can be unmounted with the umount system call.


[Page 40]

MINIX 3 maintains a block cache cache of recently used blocks in main memory to avoid having to read them from the disk if they are used again quickly. If a block in the cache is modified (by a write on a file) and the system crashes before the modified block is written out to disk, the file system will be damaged. To limit the potential damage, it is important to flush the cache periodically, so that the amount of data lost by a crash will be small. The system call sync tells MINIX 3 to write out all the cache blocks that have been modified since being read in. When MINIX 3 is started up, a program called update is started as a background process to do a sync every 30 seconds, to keep flushing the cache.

Two other calls that relate to directories are chdir and chroot. The former changes the working directory and the latter changes the root directory. After the call

chdir("/usr/ast/test"); 


an open on the file xyz will open /usr/ast/test/xyz. chroot works in an analogous way. Once a process has told the system to change its root directory, all absolute path names (path names beginning with a "/") will start at the new root. Why would you want to do that? For securityserver programs for protocols such as FTP (File Transfer Protocol) and HTTP (HyperText Transfer Protocol) do this so remote users of these services can access only the portions of a file system below the new root. Only superusers may execute chroot, and even superusers do not do it very often.

1.4.5. System Calls for Protection

In MINIX 3 every file has an 11-bit mode used for protection. Nine of these bits are the read-write-execute bits for the owner, group, and others. The chmod system call makes it possible to change the mode of a file. For example, to make a file read-only by everyone except the owner, one could execute

chmod("file", 0644); 


The other two protection bits, 02000 and 04000, are the SETGID (set-group-id) and SETUID (set-user-id) bits, respectively. When any user executes a program with the SETUID bit on, for the duration of that process the user's effective UID is changed to that of the file's owner. This feature is heavily used to allow users to execute programs that perform superuser only functions, such as creating directories. Creating a directory uses mknod, which is for the superuser only. By arranging for the mkdir program to be owned by the superuser and have mode 04755, ordinary users can be given the power to execute mknod but in a highly restricted way.


[Page 41]

When a process executes a file that has the SETUID or SETGID bit on in its mode, it acquires an effective UID or GID different from its real UID or GID. It is sometimes important for a process to find out what its real and effective UID or GID is. The system calls getuid and getgid have been provided to supply this information. Each call returns both the real and effective UID or GID, so four library routines are needed to extract the proper information: getuid, getgid, geteuid, and getegid. The first two get the real UID/GID, and the last two the effective ones.

Ordinary users cannot change their UID, except by executing programs with the SETUID bit on, but the superuser has another possibility: the setuid system call, which sets both the effective and real UIDs. setgid sets both GIDs. The superuser can also change the owner of a file with the chown system call. In short, the superuser has plenty of opportunity for violating all the protection rules, which explains why so many students devote so much of their time to trying to become superuser.

The last two system calls in this category can be executed by ordinary user processes. The first one, umask, sets an internal bit mask within the system, which is used to mask off mode bits when a file is created. After the call

umask(022); 


the mode supplied by creat and mknod will have the 022 bits masked off before being used. Thus the call

creat("file", 0777); 


will set the mode to 0755 rather than 0777. Since the bit mask is inherited by child processes, if the shell does a umask just after login, none of the user's processes in that session will accidently create files that other people can write on.

When a program owned by the root has the SETUID bit on, it can access any file, because its effective UID is the superuser. Frequently it is useful for the program to know if the person who called the program has permission to access a given file. If the program just tries the access, it will always succeed, and thus learn nothing.

What is needed is a way to see if the access is permitted for the real UID. The access system call provides a way to find out. The mode parameter is 4 to check for read access, 2 for write access, and 1 for execute access. Combinations of these values are also allowed. For example, with mode equal to 6, the call returns 0 if both read and write access are allowed for the real ID; otherwise1 is returned. With mode equal to 0, a check is made to see if the file exists and the directories leading up to it can be searched.


[Page 42]

Although the protection mechanisms of all UNIX-like operating systems are generally similar, there are some differences and inconsistencies that lead to security vulnerabilities. See Chen et al. (2002) for a discussion.

1.4.6. System Calls for Time Management

MINIX 3 has four system calls that involve the time-of-day clock. Time just returns the current time in seconds, with 0 corresponding to Jan. 1, 1970 at midnight (just as the day was starting, not ending). Of course, the system clock must be set at some point in order to allow it to be read later, so stime has been provided to let the clock be set (by the superuser). The third time call is utime, which allows the owner of a file (or the superuser) to change the time stored in a file's i-node. Application of this system call is fairly limited, but a few programs need it, for example, touch, which sets the file's time to the current time.

Finally, we have times, which returns the accounting information to a process, so it can see how much CPU time it has used directly, and how much CPU time the system itself has expended on its behalf (handling its system calls). The total user and system times used by all of its children combined are also returned.




Operating Systems Design and Implementation
Operating Systems Design and Implementation (3rd Edition)
ISBN: 0131429388
EAN: 2147483647
Year: 2006
Pages: 102

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net