Understanding Compiled Languages

The SimCom virtual computer is difficult to program. You have two options for specifying an instruction: You can enter a byte value, or you can specify an opcode and an address. You've probably found that specifying an opcode and an address is a much better approach, but it's still not very intuitive.

The language of programming with opcodes is known as assembly language. Every line of an assembly language program roughly corresponds to a computer instruction in memory. Real computers have much more memory than SimCom, and assembly programs that make real machines do something useful can be quite long. Such programs are created using a text editor. The resulting file is known as source code, and it must be translated into the appropriate binary values before it can be executed by a computer.

Conceivably, this translation could be done by people. In fact, in the very earliest days of programming, that's how it was done. However, computers can do a much better job of it. Any program that translates assembly code into computer base-2 code is called an assembler. Figure 2.1 shows the flow from assembly language source code to executable computer code.

Figure 2.1: Assembly language

After a program has been assembled, it must be loaded into memory and the computer must be told to execute it. This is the job of the operating system.

Assembly programming has many shortcomings, all of which result from being too close to the underlying architecture. When you are forced to think in terms of the interrelationships among hardware components, it is difficult to also consider the domain of the problem you are trying to solve. For example, if you want to write a program to model weather patterns, you will be better off thinking about air currents and water vapor, not about opcodes and registers. To do that, you need a compiled language.

Compiled vs. Assembly Languages

A compiled language is like assembly language in the sense that a source program is created using a text editor, and the source must be translated into computer binary. The difference is that, unlike assembly code, a line of source code generally does not correspond to a single instruction. In fact, one great benefit of compiled languages is that you don't need to know anything at all about the underlying hardware.

Figure 2.2 shows the flow from compiled language source code to executable computer code.

Figure 2.2: Compiled language

Each type of computer (Pentium or SPARC, for example) has its own instruction set and architecture, and hence its own assembly language. However, a compiled language can run on any target machine, provided there's a compiler that can translate it into the target machine's instruction set. For example, there are compilers that translate C++ into Pentium code, and other C++ compilers that produce code for SPARC processors.

Software can be developed much more efficiently with a compiled language than with assembly language. Moreover, in theory a company only needs to develop one version of a software product. When the product is finished, one compiler can be used to produce PC code, another compiler can be used to produce Macintosh code, and so on.

That's the theory. In practice it doesn't work so well. Certainly compiled languages are phenomenally more efficient for development than assembly languages. However, the ideal of developing once and compiling many times is just an ideal. There are differences among target computers that should be negligible, but are in fact significant. Source code that runs flawlessly on one platform may require considerable tweaking to run on a different platform. Multiple versions of source code have to be maintained. The process can get extremely expensive.

The Java Virtual Machine

Java is an interpreted compiled language. This means the compiler does not generate code that is specific to any particular processor. Instead, the compiler generates code for an imaginary processor: a virtual machine. The compiler does almost all the work. It checks for grammatical correctness, analyzes the structure of the source code, and breaks the source down into elementary units. It does everything except create code that can be run by a computer that exists in the physical world. The Java compiler's output is called bytecode, which is the binary format that is understood by the Java Virtual Machine, or JVM.

The JVM is a program that executes bytecode instructions. Like SimCom in Chapter 1, the JVM's architecture is usually implemented in software rather than being built from circuit components. The JVM itself runs on physical hardware, so there is one version for Windows platforms, one for SPARC platforms, one for Mac platforms, and so on.

When you run a Java application, you are really running the JVM, which in turn loads and executes the bytecode for your application. All JVMs for all platforms execute bytecode in the same way. This means that with Java, you do not have to maintain different versions of source code for different platforms. One of the Java slogans is, "Write once, run anywhere." And it works. With Java, a program has exactly one version of source code. The result of compiling the source—the bytecode—will run on any platform for which a JVM is available.

Figure 2.3 shows the evolution of a Java application from source code through execution.

click to expand
Figure 2.3: Evolution of a Java application

In Figure 2.3, the source code is the file GreatStuff.java. All Java source files have to end with .java or the compiler won't touch them. The compiler produces one or more files of bytecode output. The bytecode files, also known as class files, always end with .class. To run a Java program, you type java classname, where classname is the name of the class file that contains the starting point of the program. Note that here you omit the .class suffix. java is the name of the JVM program which will read and execute the bytecode class file.

Now that you've seen how the Java compiler and Virtual Machine fit into the big picture, it's time to get acquainted with a fundamental concept of Java programming: data types.