SunOS 4.x multiprocessor systems | PANIC! UNIX System Crash Dump Analysis Handbook (Bk/CD-ROM)

The SunOS 4.x multiprocessor kernels are a half-step toward a real multiprocessor system. In order to make a real multiprocessing kernel, a major rewrite of the kernel code would have been required. This was not really practical in the time available (the rewrite was called "Solaris 2"), so the old single-processor kernel was modified in an uncomplicated way to make it functional with more than one CPU. The idea was very simple: Since having more than one processor in the kernel at one time would cause lots of synchronization problems, don't let that happen. A "lock" was built at the kernel entry points so that only one processor could actually be executing kernel code at any time. Any other processor attempting to handle system requests and enter the kernel would be forced to wait until the first one was through and had released the lock. This was somewhat inefficient if several CPUs were attempting to enter the kernel at the same time, because only one would make it and the other(s) would just spin, waiting for the lock. However, for systems with a lot of CPU- intensive user processes, this would provide effective multiprocessing of user tasks on a single-threaded kernel.

Multiprocessor debugging on a 4.x system, then, is quite similar to debugging on a single processor system. At most, one CPU will actually be doing work in the kernel, so if there is a crash, there should only be one processor, one stack, and one set of data to worry about. The state of the other CPUs should be either idle, in user mode, or blocked at the kernel lock waiting to get in.

There are only a couple of changes to the kernel you should be aware of. The first is obviously the kernel lock and the code to handle it. The second is a per-CPU state structure. It was necessary to keep some information around specific to each processor, and a single structure was allocated to hold this. Although there are multiple copies of the structure (obviously at different addresses) in kernel space, to make the kernel code work without change, each structure, when active, is mapped in to a fixed kernel address to make all of the variables accessible at the same location.

SunOS 4.x lock code

Kernel lock manipulation is done in assembly language routines, that is, very low level kernel code. The main functions are:

klock_enter() ” Enter/grab the kernel lock. Spin if it's otherwise occupied.
klock_knock() ” Try it. Return a true/false indication (we got the lock, or somebody else has it already). Do not just sit there and spin.
klock_exit() ” Release the lock.

The kernel lock itself, klock , is a word containing a lock byte (the first, or high-order byte of the word) and an indication of which CPU owns it. The possible values are:

00000000 ” The lock is free
ff000000 ” Somebody owns it but may not have finished setting things up
ff000008 ” Owned by CPU #0
ff000009 ” Owned by CPU #1
ff00000a ” Owned by CPU #2
ff00000b ” Owned by CPU #3

If you are working with a kernel crash on a multiprocessor machine, one of the first things to check is that the lock is owned by the correct CPU. With more than one processor working in the kernel, all bets are off, and data corruption is not only possible but likely. One possible example of this is where data in a CPU register (a parameter, for instance) does not match the value actually in the memory location where it came from. This could be caused by hardware, an interrupt service routine improperly masked out, or more than one CPU in the kernel at once.

SunOS 4.x CPU structure

The PerCPU structure is defined in /sys/sun4m/OBJ/percpu_str.h (this header file is automatically generated as a part of the original kernel build). The file defines a structure that is maintained for each CPU in the system, identifying the CPU (by ID number) and containing the kernel stack for this processor, a pointer to the current running process ( masterprocp and uUNIX ), and a lot of other data areas with specific information for this particular processor. A couple of magic fields to note: cpuid , cpu_enable , cpu_exists , and cpu_supv , which identify the CPU number, whether or not it is turned on, if it exists, and if this particular one is in supervisor (kernel) mode.

These structures appear in an array, PerCPU , one megabyte apart (starting at address VA_PERCPU ), and one of them also shows up at a fixed virtual address ( VA_PERCPUME ). These addresses are defined in /sys/sun4m/devaddr.h. Unfortunately, there is not a macro to dump these structures out in a readable form, and they are much too large to blindly scan through. Think of this as "an exercise for the reader" to test out your macro construction skills.