Specification as Monolithic TreatiseThe traditional method of software engineering is to develop a thorough specification of the system's functionality before entering a single line of code. This specification is made as formal as possible, so as to minimize ambiguities. The programmers then slog through the various details of this specification (often a large book) as they implement the system. This method of specification was adapted from other engineering disciplines, where it can be extremely costly to make any changes to a specification after deployment begins. Microprocessor design is one of these disciplines. Currently, the specifications of microprocessors are interpreted and analyzed automatically. In fact, many aspects of a microprocessor design can be proven sound by unaided machines. But such techniques would be impossible if the specification weren't formalized. In the software arena, where changes to a specification after deployment aren't nearly as costly, it's natural to question whether this style of up-front, formal specification is so useful. To consider this question, let's first examine how well that specification style works for a particular type of software artifact: a programming language. Among software systems, programming languages are most similar to microprocessors in terms of the cost of modifying a specification. The cost of making even minor modifications to a language design after people have begun using it can be especially high. All the existing programs written in that language will have to be modified and recompiled. As we might expect, the specifications of programming languages, compared with other software systems, are often quite formal, especially in the case of syntax. Virtually all modern programming languages have a formally specified syntax. Most parsers are constructed through the use of automatic parser-generators that read in such grammars and produce full-fledged parsers as output.
What about language semantics? Let's take a look at the following four languages, all either currently or formerly popular, and examine the relative degree of precision in the semantic specification for each:
For each language, let's look at how the degree of precision in the specification has helped determine its effectiveness.
C++The C++ language specification leaves many parts of the specification implementation dependent, and even declines to define the behavior of many valid C++ programs. Although the designers of C++ would claim that the programs for which C++ semantics is undefined are not valid C++ programs, it is impossible in principle for a machine to determine automatically whether a program is valid under this criteria, implying that many (most?) real world software applications written in C++ are not valid C++ programs. The result is that many C++ programs don't behave as intended when ported from one platform to another.
PythonThe Python Language Reference is an informal language specification that leaves many details implementation dependent. In this case, the decision not to use a formal specification was made deliberately, with full awareness of the formalisms available for language semantics. In the words of Guido van Rossum, Python's inventor:
But the ambiguities of the English language aren't just a problem for Martians. Various implementations of Python, such as JPython and CPython, have faced a formidable challenge in providing compatible behavior across platforms. This problem would be much worse if it weren't for the relative simplicity and elegance of the Python language.
MLThe ML specification formally defines the full operational semantics of the language. Consequently, ML programmers enjoy an unprecedented level of precision and cross-platform standardization. The formal specification of ML has even allowed computer scientists to discover subtle inconsistencies in the ML type system, and correct them. Such inconsistencies no doubt exist in many other languages, but they are difficult to find without a formal specification.
PascalPascal is one language that suffered for quite a while with an inconsistency in its specification: the rules for determining type equivalence were left unspecified. For example, consider the following two Pascal types:
type complex = record left : integer; right : integer; end; type coordinate = record left : integer; right : integer; end; Are these two types identical? Clearly, they contain the same types of subcomponents. Depending on how we define the language, a value of type coordinate may be passed to a procedure that takes an argument of type complex, and vice versa. If so, our language would be said to use structural equivalence when identifying types. Alternatively, we might define two types as equivalent if and only if they have the same name. Then our language would use name equivalence. What choice is made in the Pascal language specification? Originally, no choice was made at all; nobody had yet realized that there was more than one way to define type equivalence! As a result, each implementation team for Pascal had to make this choice on its own (and, often, those teams didn't realize they were making a choice either). But the result of this ambiguity is that Pascal code written for one implementation can behave in a drastically different fashion on others. |