Section 5.5. Least Privilege Interfaces | Solaris Internals: Solaris 10 and OpenSolaris Kernel Architecture (2nd Edition)

5.5. Least Privilege Interfaces

In this section we describe the details of how the implementation of process privileges is layered in the kernel and the interfaces offered between various subsystems.

5.5.1. The Conspiracy of Bit Sets and Constants

The most convenient data structure for privileges and privilege sets are integers and bit sets.

A few words of memory are indexed by bit and a privilege number as an index. This is how the implementation in Trusted Solaris works as well as most implementations that are based on the defunct POS I X draft P1003.2c capabilities.

It was clear to the design team that bit sets with implementation visible sizes and manifest constants will soon be an impediment to future expansion and interoperability. Either memory is wasted by picking a likely upper bound for privilege sets sizes or, more likely, a number that is too small is picked.

For efficiency, the core kernel still operates with fixed-size bit sets and manifest constants. But the constants and set sizes are not visible to user applications or DDI-compliant kernel modules at compile time. The data structures exported by the kernel are all self-describing; they contain a header with a field that denotes the size of the header as well as the size of additional data following the header.

The kernel publishes all relevant information to user processes in that fashion. The following parameters of the system are not fixed in any of the kernel/userland or kernel/kernel interfaces:

Number of privilege sets
Size of the privilege sets
Names of the privilege sets
Number of privileges
Name-to-number mapping of privileges

These parameters are then used to configure the library interfaces to the privilege system and should not be directly used by user programs.

Applications use privilege names and privilege set names as strings, and the library takes care of name/number mappings in most cases; in general, applications do not need to convert individual privileges to numbers.

The current implementation does fix all these values in the kernel, but they could be made fully dynamic. For example, if we choose to allow kernel modules to allocate preposterous numbers of privileges, we could introduce a kernel parameter that would grow privilege sets to a specific size at boot.

The privilege names are not localized; the conversion of privileges and privilege sets to and from strings is locale neutral.

The library routine priv_gettext(3C), which maps privileges to a descriptive text, is localized. It obtains the textual description from /etc/security/priv names in the C locale and uses an algorithm similar to the one for magic(4) to find a localized version of the text messages.

Another important outcome of the early design reviews was that privilege sets as bits should not exist in a permanent form anywhere, except when accompanied by information sufficient for interpretation of the bits. We use this in one place, process core dumps.

5.5.2. Privilege Names and Constants

The privileges used in Solaris are described in privileges(5). The privileges are part of the interface specification in several ways. They are available as manifest integer constants for use in the core kernel; they are available as manifest string constants as a public interface.

The names of privileges are looked up in tables maintained by the kernel; there is an obvious mapping between the manifest constants and the actual privilege names. The manifest constants are upper case and prefixed with PRIV. The strings themselves are lower case without the prefix. The name lookup routines are case insensitive; they also accept the "priv" prefix which is stripped before doing the actual lookup.

Some of the privileges are very specific; we believe they should be classified as stable. Some privileges are evolving because they are too generic (PRIV_SYS_CONFIG) or might be made obsolete (PRIV_SYS_SUSER_COMPAT). The privilege constants are all classified as stable.

Privileges are logically grouped according to the scope of the privilege.

FILE privileges operate on file system objects. The subgroup FILE_DAC over-rides discretionary access control on files.
IPC privileges override IPC object access controls.
NET privileges give access to specific network functionality.
PROC privileges allow processes to modify restricted properties of the process itself and give access to features with a process scope, such as high resolution timers, locked memory, etc.
The SYS family gives processes unrestricted access to various system properties.

5.5.3. Kernel Data Structures

The privilege sets are in one place: cred_t. That data structure currently carries the information about process privileges; it is also the data structure available at those locations where we need to test for privileges. In Solaris 10, accessor functions are provided to access and manipulate the cred_t; <sys/cred.h> is essentially reduced to this:

typedef struct cred cred_t; int prochasprocperm(struct proc *, struct proc *, const cred_t *); int supgroupmember(gid_t, const cred_t *); uint_t crgetref(const cred_t *); uid_t crgetuid(const cred_t *); uid_t crgetruid(const cred_t *); uid_t crgetsuid(const cred_t *); gid_t crgetgid(const cred_t *); gid_t crgetrgid(const cred_t *); gid_t crgetsgid(const cred_t *); const gid_t *crgetgroups(const cred_t *); int crgetngroups(const cred_t *); int crsetresuid(cred_t *, uid_t, uid_t, uid_t); int crsetresgid(cred_t *, gid_t, gid_t, gid_t); int crsetugid(cred_t *, uid_t, gid_t); int crsetgroups(cred_t *, int, gid_t *);

For the most part, these are the obvious accessor functions; they should have a classification of Public, Evolving and should be made part of the Solaris specific DDI/DKI. The function crgetref(), an implementation artifact, and the crset*() functions are Consolidation Private.

The new function prochasprocperm() behaves like the original hasprocperm() but takes two processes as argument. This allows the function to check for identical processes, session IDs, and so on.

The function supgroupmember() behaves like groupmember() but checks only the supplemental groups and not the effective group ID. The existing interface did not allow exec(2) to correctly determine whether a process was set-gid.

By making cred_t an incomplete type, we guarantee that all code that declares objects of type cred_t and all code that dereferences cred_t breaks on the first recompile, requiring further investigation by the developer. Note that such uses were suspect already since cred_t is a dynamically sized structure. The intention is that this further investigation leads to the use of the proper interfaces as defined in <sys/cred.h>.

A second reason that cred_t is opaque is that we want the implementation to evolve. Inside the kernel, process privileges are carried around as bit sets for efficiency. Although we don't want to fix privilege numbers, the size of privilege sets, or even the number of privilege sets, we do want to carry them all directly in cred_t. We also like to enable other Solaris projects, to extend the credential with data types of their choosing, and to allow certain data structures that logically belong in cred_t to be moved there.

The full cred_t is defined in <sys/cred_impl.h>.

struct cred {         uint_t          cr_ref;         /* reference count */         uid_t           cr_uid;         /* effective user id */         gid_t           cr_gid;         /* effective group id */         uid_t           cr_ruid;        /* real user id */         gid_t           cr_rgid;        /* real group id */         uid_t           cr_suid;        /* "saved" user id (from exec) */         gid_t           cr_sgid;        /* "saved" group id (from exec) */         uint_t          cr_ngroups;     /* number of groups returned by */                                         /* crgroups() */         cred_priv_t     cr_priv;        /* privileges */         projid_t        cr_projid;      /* project */         struct zone     *cr_zone;       /* pointer to per-zone structure */         gid_t           cr_groups[1];   /* cr_groups size not fixed */                                         /* audit info is defined dynamically */                                         /* and valid only when audit enabled */         /* auditinfo_addr_t     cr_auinfo;      audit info */ }; extern int ngroups_max;

This new definition is not binary compatible because the cr_groups field was moved down.

The privilege sets are defined in <sys/priv_impl.h>.

typedef uint32_t priv_chunk_t; /*  * priv_set_t is a structure holding a set of privileges  */ struct priv_set {         priv_chunk_t pbits[PRIV_SETSIZE]; }; typedef struct cred_priv_s {         priv_set_t      crprivs[PRIV_NSET];       /* Priv sets */         uint_t          crpriv_flags;             /* Privilege flags */ } cred_priv_t;

The manifest constants PRIV_SETSIZE and PRIV_NSET are generated at kernel compile time and sized according to the number of actually defined privileges and sets. For all privileges, a manifest constant is generated as well; all privilege manifest constants are Consolidation Private. They are included in the generated header file <sys/priv const.h>, which is shipped to allow kernel browsers to continue to compile.

Another existing kernel data structure, the STREAMS data block dblk_t, is changed; the db uid field is replaced by a db credp field, allowing us to base security policy decisions on the sender of the message, rather than on the credentials of the process opening the devices. This lets us return to BSD socket semantics and use the privileges at bind(3socket) time rather than the privileges at socket(3socket) time to determine whether a bind() command to a privileged port can succeed. This makes Solaris more compatible with other UNIX socket implementations.

By reclaiming an unused field, dropping db_uid, and slightly rearranging the nonpublic fields of dblk_t, we succeeded in shrinking dblk_t by 8 bytes. Macros were defined to access the field, and new functions were defined to allocate mblk_t with data blocks initialized either with a credential or from a template message block.

5.5.4. Kernel Interfaces

At the heart of the privilege code are the priv_policy* routines. These routines are passed a credential, a privilege to check, and possibly some additional information for debugging. The functions handle auditing, logging, and debugging, and also take care of the antiquated ASU flag for acct(2) accounting. On failure, the missing privilege is recorded in the lwp structure.

int priv_policy(const cred_t *, int, int, const char *); boolean_t priv_policy_only(const cred_t *, int); boolean_t priv_policy_choice(const cred_t *, int);

These functions are generally not called directly from kernel modules, because they require inlining privilege constants. Additional functions, secpolicy_name, are now used instead of direct calls. Most of the functions map directly onto a single privilege, but in an N-to-M and not a one-to-one mapping. The secpolicy_vnode_setattr() function moves all policy decisions typically found in VOP_SETATTR to a single function. The side effect of these changes is that all file systems using these new interfaces will make identical policy decisions. The functions are defined in <sys/policy.h>.

int secpolicy_acct(const cred_t *); int secpolicy_allow_setid(const cred_t *, uid_t, boolean_t); int secpolicy_audit_config(const cred_t *); int secpolicy_audit_getattr(const cred_t *); int secpolicy_audit_modify(const cred_t *); int secpolicy_chroot(const cred_t *); int secpolicy_clock_highres(const cred_t *); int secpolicy_console(const cred_t *); int secpolicy_coreadm(const cred_t *); int secpolicy_dispadm(const cred_t *); int secpolicy_excl_open(const cred_t *); int secpolicy_fs_config(const cred_t *); int secpolicy_fs_linkdir(const cred_t *); int secpolicy_fs_minfree(const cred_t *); int secpolicy_fs_mount(const cred_t *, vnode_t *); int secpolicy_fs_quota(const cred_t *); int secpolicy_ipc_access(const cred_t *, const struct kipc_perm *, mode_t); int secpolicy_ipc_config(const cred_t *); int secpolicy_ipc_owner(const cred_t *, const struct kipc_perm *); int secpolicy_lock_memory(const cred_t *); int secpolicy_modctl(const cred_t *, int); int secpolicy_net(const cred_t *, int, boolean_t); int secpolicy_net_config(const cred_t *, boolean_t); int secpolicy_net_privaddr(const cred_t *, in_port_t); int secpolicy_net_rawaccess(const cred_t *); int secpolicy_newproc(const cred_t *); int secpolicy_nfs(const cred_t *); int secpolicy_pcfs_modify_bootpartition(const cred_t *); int secpolicy_ponline(const cred_t *); int secpolicy_power_mgmt(const cred_t *); int secpolicy_proc_access(const cred_t *); int secpolicy_proc_excl_open(const cred_t *); int secpolicy_proc_owner(const cred_t *, const cred_t *, int); int secpolicy_pset(const cred_t *); int secpolicy_rctlsys(const cred_t *); int secpolicy_resource(const cred_t *); int secpolicy_rpcmod_open(const cred_t *); int secpolicy_rsm_access(const cred_t *, uid_t, mode_t); int secpolicy_setpriority(const cred_t *); int secpolicy_settime(const cred_t *); int secpolicy_spec_open(const cred_t *, struct snode *, int, vtype_t); int secpolicy_sti(const cred_t *cr); int secpolicy_sys_config(const cred_t *, boolean_t); int secpolicy_sys_devices(const cred_t *); int secpolicy_tasksys(const cred_t *); int secpolicy_vnode_access(const cred_t *, vnode_t *, uid_t, mode_t); int secpolicy_vnode_create_gid(const cred_t *); int secpolicy_vnode_owner(const cred_t *, uid_t); int secpolicy_vnode_remove(const cred_t *); int secpolicy_vnode_setdac(const cred_t *); int secpolicy_vnode_setid_retain(const cred_t *, boolean_t); int secpolicy_vnode_setids_setgids(const cred_t *, gid_t); int secpolicy_vnode_stky_modify(const cred_t *cr); int secpolicy_basic_exec(const cred_t *); int secpolicy_basic_fork(const cred_t *); int secpolicy_basic_proc(const cred_t *); int secpolicy_basic_link(const cred_t *); int secpolicy_vnode_setattr(const cred_t *, struct vnode *, struct vattr *, const struct vattr *, int, int iaccess(/* void *, int, cred_t **/), void *);

The privilege checks in the kernel are all replaced with calls to the appropriate secpolicy*() functions, to which are passed sufficient arguments, always including the current process credential and often more information, such as a pointer to the object on which the operation is performed. Different security policy functions may map to a check for the presence of the same privilege.

Privilege numbers, privilege set numbers, and set sizes are Consolidation Private. Other components must call the kernel policy functions; if a driver needs to obtain the number of a specific privilege, the driver can look up its number by using priv_getbyname(9f).

The function priv_getbyname(9f) also enables kernel modules to allocate new privileges by specifying PRIV_ALLOC as a flags argument. Privileges allocated with this function are limited in size to PRIVNAME_MAX(32) characters and can only contain alphanumeric characters and underscores. Privilege names are case insensitive but case preserving. The number of slots for allocating new privileges is limited both by the number of unaccounted bits in the bit sets and by the amount of memory reserved for the additional privilege names. While we advise an algorithm to pick unique names, nonunique privilege names will not cause fatal clashes of any kind; the "clashing" privilege allows a process to perform both restricted operations, adding just a little bit to the "Least" in Least Privilege.

At this time, the secpolicy functions are a private implementation detail, and the interface is unstable. The priv_policy functions are intended for public consumption.