Outline of an Assembly Language Program | Programming from the Ground Up

Take a look at the program we just entered. At the beginning there are lots of lines that begin with hashes (#). These are comments. Comments are not translated by the assembler. They are used only for the programmer to talk to anyone who looks at the code in the future. Most programs you write will be modified by others. Get into the habit of writing comments in your code that will help them understand both why the program exists and how it works. Always include the following in your comments:

The purpose of the code
An overview of the processing involved
Anything strange your program does and why it does it^[3]

After the comments, the next line says

  .section .data

Anything starting with a period isn't directly translated into a machine instruction. Instead, it's an instruction to the assembler itself. These are called assembler directives or pseudo-operations because they are handled by the assembler and are not actually run by the computer. The .section command breaks your program up into sections. This command starts the data section, where you list any memory storage you will need for data. Our program doesn't use any, so we don't need the section. It's just here for completeness. Almost every program you write in the future will have data.

Right after this you have

  .section .text

which starts the text section. The text section of a program is where the program instructions live.

The next instruction is

  .globl _start

This instructs the assembler that _start is important to remember. _start is a symbol, which means that it is going to be replaced by something else either during assembly or linking. Symbols are generally used to mark locations of programs or data, so you can refer to them by name instead of by their location number. Imagine if you had to refer to every memory location by its address. First of all, it would be very confusing because you would have to memorize or look up the numeric memory address of every piece of code or data. In addition, every time you had to insert a piece of data or code you would have to change all the addresses in your program! Symbols are used so that the assembler and linker can take care of keeping track of addresses, and you can concentrate on writing your program.

.globl means that the assembler shouldn't discard this symbol after assembly, because the linker will need it. _start is a special symbol that always needs to be marked with .globl because it marks the location of the start of the program. Without marking this location in this way, when the computer loads your program it won't know where to begin running your program.

The next line

 _start:

defines the value of the _start label. A label is a symbol followed by a colon. Labels define a symbol's value. When the assembler is assembling the program, it has to assign each data value and instruction an address. Labels tell the assembler to make the symbol's value be wherever the next instruction or data element will be. This way, if the actual physical location of the data or instruction changes, you don't have to rewrite any references to it - the symbol automatically gets the new value.

Now we get into actual computer instructions. The first such instruction is this:

 movl $l, %eax

When the program runs, this instruction transfers the number 1 into the %eax register. In assembly language, many instructions have operands. movl has two operands - the source and the destination. In this case, the source is the literal number 1, and the destination is the %eax register. Operands can be numbers, memory location references, or registers. Different instructions allow different types of operands. See Appendix B for more information on which instructions take which kinds of operands.

On most instructions which have two operands, the first one is the source operand and the second one is the destination. Note that in these cases, the source operand is not modified at all. Other instructions of this type are, for example, addl, subl, and imull. These add/subtract/multiply the source operand from/to/by the destination operand and and save the result in the destination operand. Other instructions may have an operand hardcoded in. idivl, for example, requires that the dividend be in %eax, and %edx be zero, and the quotient is then transferred to %eax and the remainder to %edx. However, the divisor can be any register or memory location.

On x86 processors, there are several general-purpose registers^[4] (all of which can be used with movl):

%eax
%ebx
%ecx
%edx
%edi
%esi

In addition to these general-purpose registers, there are also several special-purpose registers, including:

%ebp
%esp
%eip
%eflags

We'll discuss these later, just be aware that they exist.^[5] Some of these registers, like %eip and %eflags can only be accessed through special instructions. The others can be accessed using the same instructions as general-purpose registers, but they have special meanings, special uses, or are simply faster when used in a specific way.

So, the movl instruction moves the number 1 into %eax. The dollar-sign in front of the one indicates that we want to use immediate mode addressing (refer back to the Section called Data Accessing Methods in Chapter 2). Without the dollar-sign it would do direct addressing, loading whatever number is at address 1. We want the actual number 1 loaded in, so we have to use immediate mode.

The reason we are moving the number 1 into %eax is because we are preparing to call the Linux Kernel. The number 1 is the number of the exit system call. We will discuss system calls in more depth soon, but basically they are requests for the operating system's help. Normal programs can't do everything. Many operations such as calling other programs, dealing with files, and exiting have to be handled by the operating system through system calls. When you make a system call, which we will do shortly, the system call number has to be loaded into %eax (for a complete listing of system calls and their numbers, see Appendix C). Depending on the system call, other registers may have to have values in them as well. Note that system calls is not the only use or even the main use of registers. It is just the one we are dealing with in this first program. Later programs will use registers for regular computation.

The operating system, however, usually needs more information than just which call to make. For example, when dealing with files, the operating system needs to know which file you are dealing with, what data you want to write, and other details. The extra details, called parameters are stored in other registers. In the case of the exit system call, the operating system requires a status code be loaded in %ebx. This value is then returned to the system. This is the value you retrieved when you typed echo $?. So, we load %ebx with 0 by typing the following:

 movl $0, %ebx

Now, loading registers with these numbers doesn't do anything itself. Registers are used for all sorts of things besides system calls. They are where all program logic such as addition, subtraction, and comparisons take place. Linux simply requires that certain registers be loaded with certain parameter values before making a system call. %eax is always required to be loaded with the system call number. For the other registers, however, each system call has different requirements. In the exit system call, %ebx is required to be loaded with the exit status. We will discuss different system calls as they are needed. For a list of common system calls and what is required to be in each register, see Appendix C

The next instruction is the "magic" one. It looks like this:

  int $0x80

The int stands for interrupt. The 0x80 is the interrupt number to use.^[6] An interrupt interrupts the normal program flow, and transfers control from our program to Linux so that it will do a system call.^[7]. You can think of it as like signaling Batman(or Larry-Boy^[8], if you prefer). You need something done, you send the signal, and then he comes to the rescue. You don't care how he does his work - it's more or less magic - and when he's done you're back in control. In this case, all we're doing is asking Linux to terminate the program, in which case we won't be back in control. If we didn't signal the interrupt, then no system call would have been performed.

Quick System Call Review: To recap - Operating System features are accessed through system calls. These are invoked by setting up the registers in a special way and issuing the instruction int $0x80. Linux knows which system call we want to access by what we stored in the %eax register. Each system call has other requirements as to what needs to be stored in the other registers. System call number 1 is the exit system call, which requires the status code to be placed in %ebx.

Now that you've assembled, linked, run, and examined the program, you should make some basic edits. Do things like change the number that is loaded into %ebx, and watch it come out at the end with echo $?. Don't forget to assemble and link it again before running it. Add some comments. Don't worry, the worse thing that would happen is that the program won't assemble or link, or will freeze your screen. That's just part of learning!

^[3]You'll find that many programs end up doing things strange ways. Usually there is a reason for that, but, unfortunately, programmers never document such things in their comments. So, future programmers either have to learn the reason the hard way by modifying the code and watching it break, or just leaving it alone whether it is still needed or not. You should always document any strange behavior your program performs. Unfortunately, figuring out what is strange and what is straightforward comes mostly with experience.

^[4]Note that on x86 processors, even the general-purpose registers have some special purposes, or used to before it went 32-bit. However, these are general-purpose registers for most instructions. Each of them has at least one instruction where it is used in a special way. However, for most of them, those instructions aren't covered in this book.

^[5]You may be wondering, why do all of these registers begin with the letter e? The reason is that early generations of x86 processors were 16 bits rather than 32 bits. Therefore, the registers were only half the length they are now. In later generations of x86 processors, the size of the registers doubled. They kept the old names to refer to the first half of the register, and added an e to refer to the extended versions of the register. Usually you will only use the extended versions. Newer models also offer a 64-bit mode, which doubles the size of these registers yet again and uses an r prefix to indicate the larger registers (i.e. %rax is the 64-bit version of %eax). However, these processors are not widely used, and are not covered in this book.

^[6]You may be wondering why it's 0x80 instead of just 80. The reason is that the number is written in hexadecimal. In hexadecimal, a single digit can hold 16 values instead of the normal 10. This is done by utilizing the letters a through f in addition to the regular digits. a represents 10, b represents 11, and so on. 0x10 represents the number 16, and so on. This will be discussed more in depth later, but just be aware that numbers starting with 0x are in hexadecimal. Tacking on an H at the end is also sometimes used instead, but we won't do that in this book. For more information about this, see Chapter 10

^[7]Actually, the interrupt transfers control to whoever set up an interrupt handler for the interrupt number. In the case of Linux, all of them are set to be handled by the Linux kernel.

^[8]If you don't watch Veggie Tales, you should. Start with Dave and the Giant Pickle.