Looking at instructions in memory | PANIC! UNIX System Crash Dump Analysis Handbook (Bk/CD-ROM)

For our purposes, we're mostly going to be looking at a dead program (or kernel) to see where we were and what the code was when the problem occurred. This means we're going to be looking at the machine language that was being executed and reading it as assembly code.

The adb utility has commands to examine memory and display the contents as instruction codes. adb performs disassembly of the binary back into mnemonics that are intelligible. The i format code does this for one instruction, one word. This is normally coupled with the a command to cause each word to be displayed with the associated symbolic address. Thus, an adb command to dump out 10 instructions in a row, starting at main , would look like:

 main,10?ai

Now let's look at the actual instructions you are most likely to encounter when looking at a program or a kernel that faulted and died.

How load and store instructions can go wrong

There are two instruction types that are used to transfer data between the registers and main memory. These are the load and store instructions.

Load instructions get a byte, word, or some other data type from memory and place it into a register. Store instructions put the register's data back into memory. Each of these operations requires a source and a destination. For load, the source will be a memory location, and the destination will be a register. For store instructions, the source is a register, and the destination is a memory location.

Since the instruction itself must fit into exactly one 32-bit word, how do an instruction operation code, and destination register, and a 32-bit memory address all fit? They don't.

As we've mentioned, any instruction that references memory must have the address of the desired memory address already stored in a register. The memory access instruction then uses that register indirectly to reference the memory cell . In addition, there must be some information on the data type, whether signed, unsigned, short, long, and so on. This is normally encoded in the instruction itself, which means that we have a separate instruction code for loading a byte, as opposed to loading a full 32-bit word.

Any time you get a bad memory reference, the actual instruction where the fault occurred is almost guaranteed to be a load or a store instruction, since these are the ones that actually touch memory to get or save data.

For example, let's take a user program that died because of a segmentation violation (a bad memory reference). Looking at the PC (program counter) value will provide the address of the instruction that used the bad address, as shown below:

 Hiya...  adb a.out core   $r   ... {register output trimmed} ...  pc=0x4680

Looking at the instruction at that address with adb , we find:

  0x1067c?i  0x1067c:       ld [%i0], %l3

This is a common load instruction. It says to take a full 32-bit word from memory and put it into a register. The source is pointed to by the address in register %i0, and the destination is the local register %l3 . Because the program faulted, the address used was incorrect; thus, the value in register %i0 refers to a memory location that is probably not mapped in. The location is not a part of the data segment of the executing program, a page fault could not resolve the address, and the program was terminated . These errors are commonly known as data faults .

Some of the load and store instructions you may encounter are:

ld ” Load a full word
ldh ” Load a half-word (16 bits)
lduh ” Load a half-word as an unsigned value (clear the upper bits to zero)
ldb ” Load a byte
ldub ” Load an unsigned byte
st ” Store a full word
sth ” Store a half-word
stb ” Store a byte
ldstub ” Load/store unsigned byte (used in Solaris 2 lock manipulation)

Use of some of these instructions (the full- and half-word loads and stores) may generate another type of fault, or trap, similar to that caused by a completely wild pointer: an alignment error. Again, the SPARC architecture imposes certain restrictions, and one of these is that data must be aligned in memory just as instructions must be.

If the instruction is a full-word load ( ld ), then the address used must be a multiple of 4, referencing a full word on a 4-byte boundary. Addresses that reference a half-word (a 16-bit quantity) must be aligned properly on an even 2-byte boundary. Bytes, of course, don't require any alignment, so this particular fault will never occur with one of the byte-specific instructions. An instruction using an address that is within bounds, but is improperly aligned, will fail. In this case, the program (or the kernel) will be terminated with an address alignment fault.

How branch instructions can go wrong

Another possible cause of addressing errors might be due to a transfer of control (a jump, or branch, or call to a function) where the destination did not exist, a jump to code that was not there. Of the three types of instructions for transferring control, only one is likely to result in an error.

The first, a conditional branch, is used normally after a test for some particular situation, such as two numbers being equal or the result of an arithmetic operation being non-zero . It is associated with loops and if statements and normally performs a relatively short branch to different code. Pointers don't enter into it. This branch is generated by the compiler and will only cause problems if the code should be present but somehow isn't. Of course, if the program uses self-modifying code or generates its own instructions, all bets are off!

The second, a call instruction, will transfer control to a function. Like a conditional branch, this uses a compiler-generated offset from the current location and is unlikely to jump to nonexistent code.

The third instruction is a long jump, or jump-and-link, jmpl . The jmpl instruction obtains the destination address from a register. Of them all, this is the one that will result in an error either because the address is way out of range or because it is not an even multiple of 4. The jmpl instruction is used in two places.

The first place is when calling a function where the address of the function is contained in a variable, as shown below.

 (*func_ptr)(p1, p2, p3);

The second case, used in a very specific form, is when returning from function calls back to the invoking code. This latter case will generally be displayed by adb as a retl or return instruction, but it's really the same instruction code. If the faulting instruction appears to be a jump in the middle of a sequence of code, look for a bad function address as a parameter, or in an array, or as a structure element. If the instruction is decoded as a retl (usually immediately following a restore instruction) it is possible that you have stack corruption, because this is where the return address usually is obtained. This may be due to random pointers (a very hard thing to track down) or perhaps local arrays or strings that went past the expected maximum without checking and overwrote important stack structures.

These errors are detected at a different point in the execution of an instruction, because they result from an error during the fetch of an instruction code. These may appear as a text fault , an illegal attempt to access text or code.

How other instructions can go wrong

There are, of course, many more operations in the SPARC instruction set. Most of these deal with data manipulation in registers, for example, adding, testing, or multiplying. These operations do not reference memory. Instead, they require that the data be present in a register first, and the resulting value will also be placed into a register.

Register-only instructions will obviously not cause faults due to illegal memory addresses. The only cases where specific data values might cause a fault or a trap to occur would be divide-by-zero, various floating-point errors, or specific illegal or unexpected trap instructions.

You may also have the odd illegal instruction to deal with, but that's normally due to the program attempting to execute data as if it were code. These are pretty unusual, especially in kernel code. The most frequent errors occur when wild pointers are dereferenced and the machine tries to find data that just isn't there.

Finding trouble

These are the most common instructions that generate traps (faults) and abort the program or panic the system. Now you should be able to recognize the bad instruction and identify the address. Unfortunately, this is only a part of the story; many times you will be forced to backtrack in the code to see why the address in that register is wrong.

Reading assembly code is not a trivial exercise, but with this information and a bit of practice, you can often match the assembly to C source code (if you have it available) and identify the area or even the line of code where the problem occurred.