Systrace | Absolute OpenBSD: Unix for the Practical Paranoid

One of the more exciting features in OpenBSD is systrace(1), a system call access manager. With systrace, a system administrator can say which system calls can be made by which programs, and how those calls can be made. Proper use of systrace can greatly reduce the risks inherent in running poorly written or exploitable programs. Systrace policies can confine users in a manner completely independent of UNIX permissions. You can even define the errors that the system calls return when access is denied, to allow programs to fail in the desired manner. Using systrace requires a practical understanding of system calls, what programs must have to work properly, and how these things interact with security. While these are often considered advanced system administration skills, even junior administrators can learn them.

Systrace has several important pieces: policies, the policy-generation tools, the runtime access management tool, and the sysadmin real-time interface.

System Calls

Sysadmins fling the term "system calls" around a lot, but many of them don't know exactly what it means. A system call is a function that lets you talk to the operating system kernel. If you want to allocate memory, open a TCP/IP port, or perform input/output on the disk, that's a system call. System calls are documented in section 2 of the online manual.

UNIX also supports a wide variety of C library calls, which are often confused with system calls but are actually just standardized routines for things that could be written within a program. You could easily write a function to compute square roots within a program, for example, but you could not write a function to allocate memory without using a system call. If you're in doubt whether a particular function is a system call or a C library function, check the online manual.

Some standards define "functions" that must be provided by the operating system. These functions might be system calls, or library calls. How a function is provided is considered an implementation detail, and an operating system can move a function between the system calls and the C library as appropriate.

You may find an occasional system call that is not documented in the online manual, such as break(). You'll need to dig into other resources to identify these calls. (Break() in particular is a very old system call that is used within libc, but not by programmers, so it seems to have escaped being documented in the man pages.)

Systrace Policies

A policy is just a description of system calls that a particular program may execute and how those calls can be made. While this sounds simple, as with many other things, the details can cause sleepless nights and caffeine overdoses.

Systrace(1) describes the complete systrace language. System calls have been used for many different things, and their usage has been expanded over the last 30 years, until it would seem impossible to write a permissions file to describe all the permutations that systems calls can appear in. It turns out to be very possible, but admittedly difficult. Fortunately, almost all of the system calls ever needed can be described by a small subset of the complete syntax.

A policy rule describes a permitted system call and the manner it can be made in. It has the general format shown here:

 1 abi-2 sycall: 3 term1 4 comparison-operator 5 term2 then 6 permit

OpenBSD can run binaries from a variety of operating systems, as discussed in Chapter 13, and systrace can theoretically support all of them. At the time of this writing, only the native and Linux ABIs are supported. Each ABI has its own list of system calls, and you must tell systrace 1 which ABI this system call applies to.

The 2 syscall is the name of the system call.

While you could simple "permit" or "deny" this system call, it's frequently more useful to allow execution conditionally. You could allow Apache to open a TCP/IP socket, but only if it's trying to open port 80. That's where the comparison comes in. If the 3 first term 4 matches the 5 second term in a particular way, you can decide 6 to permit or deny the system call request. Systrace supports the following commonly used comparison operators:

The "match" operator matches if the two terms are the same as per regular filename globbing. This allows use of the asterisk (*) wildcard in terms.
The "eq" operator matches if the two terms are exactly identical.
The "sub" operator matches if the second term is a substring of the first.
The "re" argument lets you specify a grep(1)-style regular expression as your second term. The rule matches if the first term matches the regex.

Systrace also supports "neq" (not equal to), "nsub" (not a substring), and "ipath" (argument in this path), but these terms are only rarely used.

The simplest way to see how systrace policies work is to look at some sample policy statements. Many of the following samples are taken from the default policy for named(8), kept in /etc/systrace/usr_sbin_named.

Sample Systrace Policy Rules

Before reviewing the named policy, let's review some commonly known about the nameserver daemon's system access requirements. Zone transfers occur on TCP port 53, while basic lookup services are provided on UDP port 53. OpenBSD chroots named into /var/named by default and logs everything to /var/log/ messages. We might expect system calls to allow this access.

Now, let's see how the reality of system calls compares to our expectations.

Permitting System Calls

 1 native- 2 accept:: 3 permit

When named(8) tries to use the 2 accept() system call, under the 1 native ABI, it is 3 allowed. What is accept()? Run "man 2 accept," and you'll see that this accepts connections on an existing socket. A nameserver will obviously have to accept connections on a network socket!

Using Match Comparisons

Here's a rule for bind(), the system call that lets a program listen on a TCP/ IP port.

 native-bind: 1 sockaddr match "inet-*:53" then permit

What the heck is a 1 "sockaddr"? Check bind(2), and you'll see that this is a variable used as an argument to the bind() system call. This is where things start to get scary; how do you know what the argument should look like? If you're not a programmer, your best bet is to read existing systrace policies to see how other people use this. In this particular case, the program may bind to port 53, over both TCP and UDP protocols. If an attacker had an exploit to make named(8) attach a command prompt on a high-numbered port, this systrace policy would prevent that exploit from working -— without changing a single line of named(8) code!

Using eq Comparisons

Here, we compare the argument used by a system call to a path on the system. If they match, the system call is permitted.

 native-chdir: filename eq "/" then permit

At first glance, this would seem insensible. If the program tries to go to the root directory or to the directory "/namedb", systrace will allow it. Why would you possibly want to allow named to access to the root directory, however? Well, on OpenBSD, named(8) runs in a chroot jail. The program can certainly access the root directory of the chroot! This is one example of how system calls can be very confusing.

Using sub Comparisons

The named policy doesn't have any sub comparisons, so here's one taken from the lpd(8) policy.

 native-connect: sockaddr sub ":515" then permit

All this is looking for is a string that contains the characters ":515". We could use a wildcard and a "match" comparison, but the sub comparison takes fewer system resources than the match comparison.

Using re Comparisons

If at all possible, avoid "re" comparisons. They use a lot of system time, especially when "match" or "substring" would do almost as well. Here's a policy that lets a command call execve(2), if the filename begins with the string "make".

 native-execve: filename re 1 "/make$" then permit

If you don't recognize the 1 regular expression, go read grep(1).

Syscall Aliases

Systrace groups certain system calls with very similar functions into aliases. You can disable this functionality with a command-line switch and only use the exact system calls you specify, but in most cases these aliases are quite useful and shrink your policies considerably. The two aliases are "fsread" and "fswrite."

Fsread is an alias for stat(), lstat(), readlink(), and access(), under the native and Linux ABIs. fswrite is an alias for unlink(), mkdir(), and rmdir(), in both the native and Linux ABIs. As open() can be used to either read or write a file, it is aliased by both fsread and fswrite, depending on how it is called.

Optional Arguments

Systrace can log successful system calls, and can also give different errors as you decide.

If you put an error code name in square brackets after the "deny" keyword, that error code will be returned to the program when it tries to access that system call. Programs will behave differently depending on the error that they receive; named will react differently to a "permission denied" error than it will to an "out of memory" error. You can get a complete list of error codes from errno(2). Use the error name, not the error number. For example, here we return an error for nonexistent files:

 filename sub "<non-existent filename>" then deny[enoent]

If you put the word "log" at the end of your rule, successful system calls will be logged. For example, if we wanted to log each time named(8) attached to port 53, we could edit the policy statement for the bind() call to read:

 native-bind: sockaddr match "inet-*:53" then permit log

Filtering by User and Group

You can also choose to filter rules based on user ID and group ID, using the "uid" and "gid" keywords. You must use the numbers, as given in /etc/group and /etc/passwd. Here, we allow a program to use the setgid() system call, if it's trying to change to the group ID 70.

 native-setgid: gid eq "70" then permit

Privilege Elevation

Finally, systrace allows a program running as a regular user to perform certain behaviors as another user, using the "as userid" keywords. We could run named(8) entirely as a regular user if we changed the bind() call to something like this:

 native-bind: sockaddr match "inet-*:53" then permit as root

While I expect to see this being used more and more in OpenBSD installations in the future, systrace is new enough that it has not yet been made the default. Also, systrace does impose a minor amount of overhead, and policies can vary widely from environment to environment, so it may never be the default.

Making a Systrace Policy File

Each systrace policy file is in a file named after the full path of the program, replacing slashes with underscores. For example, our policy for /usr/sbin/ named is called usr_sbin_named. The policy file starts with a policy and an ABI statement.

 1 Policy: /usr/sbin/named, 2 Emulation: native

The 1 "Policy" statement gives the full path to the program this policy is for. You can't fool systrace(1) by giving the same name to a program elsewhere on the system. 2 The "Emulation" entry shows which ABI this policy is for.

Syscalls Without Rules

If a program running under systrace(1) tries to make a system call that is not listed in the policy, the system call will be denied. Systrace denies all actions that are not explicitly permitted and logs the rejection to syslog. If a program running under systrace has a problem, check /var/log/messages to discover what system call the program wants and decide whether you want to add it to your policy, reconfigure the program, or live with the error.