Section 21.1. Programming with gcc

21.1. Programming with gcc

The C programming language is by far the most often used in Unix software development. Perhaps this is because the Unix system was originally developed in C; it is the native tongue of Unix. Unix C compilers have traditionally defined the interface standards for other languages and tools, such as linkers, debuggers, and so on. Conventions set forth by the original C compilers have remained fairly consistent across the Unix programming board.

gcc is one of the most versatile and advanced compilers around. Unlike other C compilers (such as those shipped with the original AT&T or BSD distributions, or those available from various third-party vendors), gcc supports all the modern C standards currently in usesuch as the ANSI C standardas well as many extensions specific to gcc. Happily, however, gcc provides features to make it compatible with older C compilers and older styles of C programming. There is even a tool called protoize that can help you write function prototypes for old-style C programs.

gcc is also a C++ compiler. For those who prefer the more modern object-oriented environment, C++ is supported with all the bells and whistlesincluding most of the C++ introduced when the C++ standard was released, such as method templates. Complete C++ class libraries are provided as well, such as the Standard Template Library (STL).

For those with a taste for the particularly esoteric, gcc also supports Objective-C, an object-oriented C spinoff that never gained much popularity but may see a second spring due to its usage in Mac OS X. And there is gcj, which compiles Java code to machine code. But the fun doesn't stop there, as we'll see.

In this section, we cover the use of gcc to compile and link programs under Linux. We assume you are familiar with programming in C/C++, but we don't assume you're accustomed to the Unix programming environment. That's what we introduce here.

The latest gcc version at the time of this writing is Version 4.0. However, this is still quite new, sometimes a bit unstable, and, since it is a lot stricter about syntax than previous versions, will not compile some older code. Many developers therefore use either a version of the 3.3 series (with 3.3.5 being the current one at the time of this writing) or Version 3.4. We suggest sticking with either of those unless you know exactly what you are doing.

A word about terminology ahead: Because gcc can these days compile so much more than C (for example, C++, Java, and some other programming languages), it is considered to be the abbreviation for GNU Compiler Collection. But if you speak about just the C compiler, gcc is taken to mean GNU C Compiler.

21.1.1. Quick Overview

Before imparting all the gritty details of gcc, we present a simple example and walk through the steps of compiling a C program on a Unix system.

Let's say you have the following bit of code, an encore of the much overused "Hello, World!" program (not that it bears repeating):

 #include <stdio.h> int main(  ) {   (void)printf("Hello, World!\n");   return 0; /* Just to be nice */ }

Several steps are required to compile this program into a living, breathing executable. You can accomplish most of these steps through a single gcc command, but we've left the specifics for later in the chapter.

First, the gcc compiler must generate an object file from this source code . The object file is essentially the machine-code equivalent of the C source. It contains code to set up the main( ) calling stack, a call to the printf( ) function, and code to return the value of 0.

The next step is to link the object file to produce an executable. As you might guess, this is done by the linker. The job of the linker is to take object files , merge them with code from libraries , and spit out an executable. The object code from the previous source does not make a complete executable. First and foremost, the code for printf( ) must be linked in. Also, various initialization routines, invisible to the mortal programmer, must be appended to the executable.

Where does the code for printf( ) come from? Answer: the libraries. It is impossible to talk for long about gcc without mentioning them. A library is essentially a collection of many object files, including an index. When searching for the code for printf( ), the linker looks at the index for each library it's been told to link against. It finds the object file containing the printf( ) function and extracts that object file (the entire object file, which may contain much more than just the printf( ) function) and links it to the executable.

In reality, things are more complicated than this. Linux supports two kinds of libraries: static and shared. What we have described in this example are static libraries : libraries where the actual code for called subroutines is appended to the executable. However, the code for subroutines such as printf( ) can be quite lengthy. Because many programs use common subroutines from the libraries, it doesn't make sense for each executable to contain its own copy of the library code. That's where shared libraries come in.^[*]

^[*] It should be noted that some very knowledgeable programmers consider shared libraries harmful, for reasons too involved to be explained here. They say that we shouldn't need to bother in a time when most computers ship with 80-GB hard disks and at least 256 MB of memory preinstalled.

With shared libraries, all the common subroutine code is contained in a single library "image file" on disk. When a program is linked with a shared library, stub code is appended to the executable, instead of actual subroutine code. This stub code tells the program loader where to find the library code on disk, in the image file, at runtime. Therefore, when our friendly "Hello, World!" program is executed, the program loader notices that the program has been linked against a shared library. It then finds the shared library image and loads code for library routines, such as printf( ), along with the code for the program itself. The stub code tells the loader where to find the code for printf( ) in the image file.

Even this is an oversimplification of what's really going on. Linux shared libraries use jump tables that allow the libraries to be upgraded and their contents to be jumbled around, without requiring the executables using these libraries to be relinked. The stub code in the executable actually looks up another reference in the library itselfin the jump table. In this way, the library contents and the corresponding jump tables can be changed, but the executable stub code can remain the same.

Shared libraries also have another advantage: their ability to be upgraded. When someone fixes a bug in printf( ) (or worse, a security hole), you only need to upgrade the one library. You don't have to relink every single program on your system.

But don't allow yourself to be befuddled by all this abstract information. In time, we'll approach a real-life example and show you how to compile, link, and debug your programs. It's actually very simple; the gcc compiler takes care of most of the details for you. However, it helps to understand what's going on behind the scenes.

21.1.2. gcc Features

gcc has more features than we could possibly enumerate here. The gcc manual page and Info document give an eyeful of interesting information about this compiler. Later in this section, we give you a comprehensive overview of the most useful gcc features to get you started. With this in hand, you should be able to figure out for yourself how to get the many other facilities to work to your advantage.

For starters, gcc supports the standard C syntax currently in use, specified for the most part by the ANSI C standard. The most important feature of this standard is function prototyping. That is, when defining a function foo( ), which returns an int and takes two arguments, a (of type char *) and b (of type double), the function may be defined like this:

 int foo(char *a, double b) {   /* your code here... */ }

This contrasts with the older, nonprototype function definition syntax, which looks like this:

 int foo(a, b) char *a; double b; {   /* your code   here... */ }

and is also supported by gcc. Of course, ANSI C defines many other conventions, but this is the one most obvious to the new programmer. Anyone familiar with C programming style in modern books, such as the second edition of Kernighan and Ritchie's The C Programming Language (Prentice Hall), can program using gcc with no problem.

The gcc compiler boasts quite an impressive optimizer. Whereas most C compilers allow you to use the single switch -O to specify optimization, gcc supports multiple levels of optimization. At the highest level, gcc pulls tricks out of its sleeve, such as allowing code and static data to be shared. That is, if you have a static string in your program such as Hello, World!, and the ASCII encoding of that string happens to coincide with a sequence of instruction code in your program, gcc allows the string data and the corresponding code to share the same storage. How clever is that!

Of course, gcc allows you to compile debugging information into object files, which aids a debugger (and hence, the programmer) in tracing through the program. The compiler inserts markers in the object file, allowing the debugger to locate specific lines, variables, and functions in the compiled program. Therefore, when using a debugger such as gdb (which we talk about later in the chapter), you can step through the compiled program and view the original source text simultaneously.

Among the other tricks gcc offers is the ability to generate assembly code with the flick of a switch (literally). Instead of telling gcc to compile your source to machine code, you can ask it to stop at the assembly-language level, which is much easier for humans to comprehend. This happens to be a nice way to learn the intricacies of protected-mode assembly programming under Linux: write some C code, have gcc translate it into assembly language for you, and study that.

gcc includes its own assembler (which can be used independently of gcc and is called gas) (even though the binary often is just called as on Linux, since there cannot be confusion with other assemblers as on other Unix operating systems such as Solaris), just in case you're wondering how this assembly-language code might get assembled. In fact, you can include inline assembly code in your C source, in case you need to invoke some particularly nasty magic but don't want to write exclusively in assembly.

21.1.3. Basic gcc Usage

By now, you must be itching to know how to invoke all these wonderful features. It is important, especially to novice Unix and C programmers, to know how to use gcc effectively. Using a command-line compiler such as gcc is quite different from, say, using an integrated development environment (IDE) such as Visual Studio or C++ Builder under Windows. Even though the language syntax is similar, the methods used to compile and link programs are not at all the same.

A number of IDEs are available for Linux now. These include the popular open source IDE KDevelop, discussed later in this chapter. For Java, Eclipse (http://www.eclipse.org) is the leading choice among programmers who like IDEs.

Let's return to our innocent-looking "Hello, World! " example. How would you go about compiling and linking this program?

The first step, of course, is to enter the source code. You accomplish this with a text editor, such as Emacs or vi. The would-be programmer should enter the source code and save it in a file named something like hello.c. (As with most C compilers, gcc is picky about the filename extension: that is how it distinguishes C source from assembly source from object files, and so on. Use the .c extension for standard C source.)

To compile and link the program to the executable hello, the programmer would use the command:

 papaya$ gcc -o hello hello.c

and (barring any errors), in one fell swoop, gcc compiles the source into an object file, links against the appropriate libraries, and spits out the executable hello, ready to run. In fact, the wary programmer might want to test it:

 papaya$ ./hello Hello, World! papaya$

As friendly as can be expected.

Obviously, quite a few things took place behind the scenes when executing this single gcc command. First of all, gcc had to compile your source file, hello.c, into an object file, hello.o. Next, it had to link hello.o against the standard libraries and produce an executable.

By default, gcc assumes that you want not only to compile the source files you specify, but also to have them linked together (with each other and with the standard libraries) to produce an executable. First, gcc compiles any source files into object files. Next, it automatically invokes the linker to glue all the object files and libraries into an executable. (That's right, the linker is a separate program, called ld, not part of gcc itselfalthough it can be said that gcc and ld are close friends.) gcc also knows about the standard libraries used by most programs and tells ld to link against them. You can, of course, override these defaults in various ways.

You can pass multiple filenames in one gcc command, but on large projects you'll find it more natural to compile a few files at a time and keep the .o object files around. If you want only to compile a source file into an object file and forego the linking process, use the -c switch with gcc , as in the following example:

 papaya$ gcc -c hello.c

This produces the object file hello.o and nothing else.

By default, the linker produces an executable named, of all things, a.out. This is just a bit of left-over gunk from early implementations of Unix, and nothing to write home about. By using the -o switch with gcc, you can force the resulting executable to be named something different, in this case, hello.

21.1.4. Using Multiple Source Files

The next step on your path to gcc enlightenment is to understand how to compile programs using multiple source files . Let's say you have a program consisting of two source files, foo.c and bar.c. Naturally, you would use one or more header files (such as foo.h) containing function declarations shared between the two programs. In this way, code in foo.c knows about functions in bar.c, and vice versa.

To compile these two source files and link them together (along with the libraries, of course) to produce the executable baz, you'd use the command:

 papaya$ gcc -o baz foo.c bar.c

This is roughly equivalent to the following three commands:

 papaya$ gcc -c foo.c papaya$ gcc -c bar.c papaya$ gcc -o baz foo.o bar.o

gcc acts as a nice frontend to the linker and other "hidden" utilities invoked during compilation.

Of course, compiling a program using multiple source files in one command can be time-consuming. If you had, say, five or more source files in your program, the gcc command in the previous example would recompile each source file in turn before linking the executable. This can be a large waste of time, especially if you only made modifications to a single source file since the last compilation. There would be no reason to recompile the other source files, as their up-to-date object files are still intact.

The answer to this problem is to use a project manager such as make. We talk about make later in the chapter, in "Makefiles."

21.1.5. Optimizing

Telling gcc to optimize your code as it compiles is a simple matter; just use the -O switch on the gcc command line:

 papaya$ gcc -O -o fishsticks fishsticks.c

As we mentioned not long ago, gcc supports different levels of optimization. Using -O2 instead of -O will turn on several "expensive" optimizations that may cause compilation to run more slowly but will (hopefully) greatly enhance performance of your code.

You may notice in your dealings with Linux that a number of programs are compiled using the switch -O6 (the Linux kernel being a good example). The current version of gcc does not support optimization up to -O6, so this defaults to (presently) the equivalent of -O2. However, -O6 is sometimes used for compatibility with future versions of gcc to ensure that the greatest level of optimization is used.

21.1.6. Enabling Debugging Code

The -g switch to gcc turns on debugging code in your compiled object files. That is, extra information is added to the object file, as well as the resulting executable, allowing the program to be traced with a debugger such as gdb. The downside to using debugging code is that it greatly increases the size of the resulting object files. It's usually best to use -g only while developing and testing your programs and to leave it out for the "final" compilation.

Happily, debug-enabled code is not incompatible with code optimization. This means that you can safely use the command:

 papaya$ gcc -O -g -o mumble mumble.c

However, certain optimizations enabled by -O or -O2 may cause the program to appear to behave erratically while under a debugger. It is usually best to use either -O or -g, not both.

21.1.7. More Fun with Libraries

Before we leave the realm of gcc, a few words on linking and libraries are in order. For one thing, it's easy for you to create your own libraries. If you have a set of routines you use often, you may wish to group them into a set of source files, compile each source file into an object file, and then create a library from the object files. This saves you from having to compile these routines individually for each program in which you use them.

Let's say you have a set of source files containing oft-used routines, such as:

 float square(float x) {   /* Code for square(  )... */ } int factorial(int x, int n) {   /* Code for factorial(  )... */ }

and so on (of course, the gcc standard libraries provide analogs to these common routines, so don't be misled by our choice of example). Furthermore, let's say that the code for square( ), which both takes and returns a float, is in the file square.c and that the code for factorial( ) is in factorial.c. Simple enough, right?

To produce a library containing these routines, all you do is compile each source file, as so:

 papaya$ gcc -c square.c factorial.c

which leaves you with square.o and factorial.o. Next, create a library from the object files. As it turns out, a library is just an archive file created using ar (a close counterpart to tar). Let's call our library libstuff.a and create it this way:

 papaya$ ar r libstuff.a square.o factorial.o

When updating a library such as this, you may need to delete the old libstuff.a, if it exists. The last step is to generate an index for the library, which enables the linker to find routines within the library. To do this, use the ranlib command, as so:

 papaya$ ranlib libstuff.a

This command adds information to the library itself; no separate index file is created. You could also combine the two steps of running ar and ranlib by using the s command to ar:

 papaya$ ar rs libstuff.a square.o factorial.o

Now you have libstuff.a, a static library containing your routines. Before you can link programs against it, you'll need to create a header file describing the contents of the library. For example, we could create libstuff.h with the contents:

 /* libstuff.h: routines in libstuff.a */ extern float square(float); extern int factorial(int, int);

Every source file that uses routines from libstuff.a should contain an #include "libstuff.h" line, as you would do with standard header files.

Now that we have our library and header file, how do we compile programs to use them? First, we need to put the library and header file someplace where the compiler can find them. Many users place personal libraries in the directory lib in their home directory, and personal include files under include. Assuming we have done so, we can compile the mythical program wibble.c using the following command:

 papaya$ gcc -I../include -L../lib -o wibble wibble.c -lstuff

The -I option tells gcc to add the directory ../include to the include path it uses to search for include files. -L is similar, in that it tells gcc to add the directory ../lib to the library path.

The last argument on the command line is -lstuff, which tells the linker to link against the library libstuff.a (wherever it may be along the library path). The lib at the beginning of the filename is assumed for libraries.

Any time you wish to link against libraries other than the standard ones, you should use the -l switch on the gcc command line. For example, if you wish to use math routines (specified in math.h), you should add -lm to the end of the gcc command, which links against libm. Note, however, that the order of -l options is significant. For example, if our libstuff library used routines found in libm, you must include -lm after -lstuff on the command line:

 papaya$ gcc -Iinclude -Llib -o wibble wibble.c -lstuff -lm

This forces the linker to link libm after libstuff, allowing those unresolved references in libstuff to be taken care of.

Where does gcc look for libraries? By default, libraries are searched for in a number of locations, the most important of which is /usr/lib. If you take a glance at the contents of /usr/lib, you'll notice it contains many library filessome of which have filenames ending in .a, others with filenames ending in .so.version. The .a files are static libraries, as is the case with our libstuff.a. The .so files are shared libraries , which contain code to be linked at runtime, as well as the stub code required for the runtime linker (ld.so) to locate the shared library.

At runtime, the program loader looks for shared library images in several places, including /lib. If you look at /lib, you'll see files such as libc.so.6. This is the image file containing the code for the libc shared library (one of the standard libraries, which most programs are linked against).

By default, the linker attempts to link against shared libraries . However, static libraries are used in several casesfor example, when there are no shared libraries with the specified name anywhere in the library search path. You can also specify that static libraries should be linked by using the -static switch with gcc.

21.1.7.1. Creating shared libraries

Now that you know how to create and use static libraries, it's very easy to take the step to shared libraries. Shared libraries have a number of advantages. They reduce memory consumption if used by more than one process, and they reduce the size of the executable. Furthermore, they make developing easier: when you use shared libraries and change some things in a library, you do not need to recompile and relink your application each time. You need to recompile only if you make incompatible changes, such as adding arguments to a call or changing the size of a struct.

Before you start doing all your development work with shared libraries, though, be warned that debugging with them is slightly more difficult than with static libraries because the debugger usually used on Linux, gdb, has some problems with shared libraries.

Code that goes into a shared library needs to be position-independent. This is just a convention for object code that makes it possible to use the code in shared libraries. You make gcc emit position-independent code by passing it one of the command-line switches -fpic or -fPIC. The former is preferred, unless the modules have grown so large that the relocatable code table is simply too small, in which case the compiler will emit an error message and you have to use -fPIC. To repeat our example from the last section:

 papaya$ gcc -c -fpic square.c factorial.c

This being done, it is just a simple step to generate a shared library:^[*]

^[*] In the ancient days of Linux, creating a shared library was a daunting task of which even wizards were afraid. The advent of the ELF object-file format reduced this task to picking the right compiler switch. Things sure have improved!

 papaya$ gcc -shared -o libstuff.so square.o factorial.o

Note the compiler switch -shared. There is no indexing step as with static libraries.

Using our newly created shared library is even simpler. The shared library doesn't require any change to the compile command:

 papaya$ gcc -I../include -L../lib -o wibble wibble.c -lstuff -lm

You might wonder what the linker does if a shared library libstuff.so and a static library libstuff.a are available. In this case, the linker always picks the shared library. To make it use the static one, you will have to name it explicitly on the command line:

 papaya$ gcc -I../include -L../lib -o wibble wibble.c libstuff.a -lm

Another very useful tool for working with shared libraries is ldd. It tells you which shared libraries an executable program uses. Here's an example:

 papaya$ ldd wibble         linux-gate.so.1 =>  (0xffffe000)         libstuff.so => libstuff.so (0x400af000)         libm.so.5 => /lib/libm.so.5 (0x400ba000)         libc.so.5 => /lib/libc.so.5 (0x400c3000)

The three fields in each line are the name of the library, the full path to the instance of the library that is used, and where in the virtual address space the library is mapped to. The first line is something arcane, part of the Linux loader implementation that you can happily ignore.

If ldd outputs not found for a certain library, you are in trouble and won't be able to run the program in question. You will have to search for a copy of that library. Perhaps it is a library shipped with your distribution that you opted not to install, or it is already on your hard disk but the loader (the part of the system that loads every executable program) cannot find it.

In the latter situation, try locating the libraries yourself and find out whether they're in a nonstandard directory. By default, the loader looks only in /lib and /usr/lib. If you have libraries in another directory, create an environment variable LD_LIBRARY_PATH and add the directories separated by colons. If you believe that everything is set up correctly, and the library in question still cannot be found, run the command ldconfig as root, which refreshes the linker system cache.

21.1.8. Using C++

If you prefer object-oriented programming, gcc provides complete support for C++ as well as Objective-C. There are only a few considerations you need to be aware of when doing C++ programming with gcc.

First, C++ source filenames should end in the extension .cpp (most often used), .C, or .cc. This distinguishes them from regular C source filenames, which end in .c. It is actually possible to tell gcc to compile even files ending in .c as C++ files, by using the command-line parameter -x c++, but that is not recommended, as it is likely to confuse you.

Second, you should use the g++ shell script in lieu of gcc when compiling C++ code. g++ is simply a shell script that invokes gcc with a number of additional arguments, specifying a link against the C++ standard libraries, for example. g++ takes the same arguments and options as gcc.

If you do not use g++, you'll need to be sure to link against the C++ libraries in order to use any of the basic C++ classes, such as the cout and cin I/O objects. Also be sure you have actually installed the C++ libraries and include files. Some distributions contain only the standard C libraries. gcc will be able to compile your C++ programs fine, but without the C++ libraries, you'll end up with linker errors whenever you attempt to use standard objects.