5. | Bug Patterns In Java

Specification as Monolithic Treatise

The traditional method of software engineering is to develop a thorough specification of the system's functionality before entering a single line of code. This specification is made as formal as possible, so as to minimize ambiguities. The programmers then slog through the various details of this specification (often a large book) as they implement the system.

This method of specification was adapted from other engineering disciplines, where it can be extremely costly to make any changes to a specification after deployment begins. Microprocessor design is one of these disciplines. Currently, the specifications of microprocessors are interpreted and analyzed automatically. In fact, many aspects of a microprocessor design can be proven sound by unaided machines. But such techniques would be impossible if the specification weren't formalized.

In the software arena, where changes to a specification after deployment aren't nearly as costly, it's natural to question whether this style of up-front, formal specification is so useful. To consider this question, let's first examine how well that specification style works for a particular type of software artifact: a programming language.

Among software systems, programming languages are most similar to microprocessors in terms of the cost of modifying a specification. The cost of making even minor modifications to a language design after people have begun using it can be especially high. All the existing programs written in that language will have to be modified and recompiled.

As we might expect, the specifications of programming languages, compared with other software systems, are often quite formal, especially in the case of syntax. Virtually all modern programming languages have a formally specified syntax. Most parsers are constructed through the use of automatic parser-generators that read in such grammars and produce full-fledged parsers as output.

Tip

The specifications of virtually all modern programming languages contain a formally specified syntax.

What about language semantics? Let's take a look at the following four languages, all either currently or formerly popular, and examine the relative degree of precision in the semantic specification for each:

C++
Python
ML
Pascal

For each language, let's look at how the degree of precision in the specification has helped determine its effectiveness.

C++

The C++ language specification leaves many parts of the specification implementation dependent, and even declines to define the behavior of many valid C++ programs. Although the designers of C++ would claim that the programs for which C++ semantics is undefined are not valid C++ programs, it is impossible in principle for a machine to determine automatically whether a program is valid under this criteria, implying that many (most?) real world software applications written in C++ are not valid C++ programs.

The result is that many C++ programs don't behave as intended when ported from one platform to another.

Python

The Python Language Reference is an informal language specification that leaves many details implementation dependent. In this case, the decision not to use a formal specification was made deliberately, with full awareness of the formalisms available for language semantics. In the words of Guido van Rossum, Python's inventor:

While I am trying to be as precise as possible, I chose to use English rather than formal specifications for everything except syntax and lexical analysis. This should make the document more understandable to the average reader, but will leave room for ambiguities. Consequently, if you were coming from Mars and tried to re-implement Python from this document alone, you might have to guess things and in fact you would probably end up implementing quite a different language. On the other hand, if you are using Python and wonder what the precise rules about a particular area of the language are, you should definitely be able to find them here.

But the ambiguities of the English language aren't just a problem for Martians. Various implementations of Python, such as JPython and CPython, have faced a formidable challenge in providing compatible behavior across platforms. This problem would be much worse if it weren't for the relative simplicity and elegance of the Python language.

ML

The ML specification formally defines the full operational semantics of the language. Consequently, ML programmers enjoy an unprecedented level of precision and cross-platform standardization. The formal specification of ML has even allowed computer scientists to discover subtle inconsistencies in the ML type system, and correct them. Such inconsistencies no doubt exist in many other languages, but they are difficult to find without a formal specification.

Pascal

Pascal is one language that suffered for quite a while with an inconsistency in its specification: the rules for determining type equivalence were left unspecified. For example, consider the following two Pascal types:

 type complex = record                   left : integer;                   right : integer;                 end; type coordinate = record                       left : integer;                       right : integer;                     end;

Are these two types identical? Clearly, they contain the same types of subcomponents. Depending on how we define the language, a value of type coordinate may be passed to a procedure that takes an argument of type complex, and vice versa. If so, our language would be said to use structural equivalence when identifying types. Alternatively, we might define two types as equivalent if and only if they have the same name. Then our language would use name equivalence.

What choice is made in the Pascal language specification? Originally, no choice was made at all; nobody had yet realized that there was more than one way to define type equivalence! As a result, each implementation team for Pascal had to make this choice on its own (and, often, those teams didn't realize they were making a choice either). But the result of this ambiguity is that Pascal code written for one implementation can behave in a drastically different fashion on others.