60. | Bug Patterns In Java

About This Bug Pattern

Many programs need to intensively access and manipulate internally stored data to perform various complex tasks. This data might be retrieved from a large data structure in memory, from a database, or over a network.

This type of program is highly susceptible to a crash caused by corrupt internal data. I call this bug pattern the Saboteur Data pattern because such data can stay in the system indefinitely, much like a Cold War sleeper spy, causing no trouble until the particularly troublesome bit of data is accessed. The corrupt data then explodes like a bomb.

The Symptoms

A program that stores and manipulates complex input data crashes unexpectedly while performing a task similar to other tasks that cause no problem.

A Syntactic Cause

Suppose we have a JDBC application that stores a database table called Mapping that maps String names to sets of elements. Each element of each set refers to a key stored in another table, Properties, containing various known properties of these elements. (JDBC serves up a common API for connecting to and eliciting services from databases on a variety of platforms. See the Resources chapter for more information on this powerful API.)

Let's say that both Mapping and Properties are initially read from a text file developed by an outside source (outside meaning any data source not generated by our JDBC application itself) in which each line starts with a name and is followed by a representation of the corresponding set, as follows:

Listing 13-1: Data from an Outside Source Text File

 In the Mapping file: apples {macintosh, gala, golden-delicious} trees  {elm, beech, maple, pine, birch} rocks  {quartz, limestone, marble, diamond}  ... In the Properties file: macintosh {color: red, taste: sour} gala      {color: red, taste: sweet} diamond   {color: clear, rigidity: hard, value: high}  ...

The Mapping and Properties table entries could be parsed and passed to a method that inserts them into a database. But there are potential pitfalls in this approach. For example, let's suppose that we have written a class that handles a JDBC-compliant database. Following the JDBC API, we could define a PreparedStatement object and use it to pass information into the database, like so:

Listing 13-2: Defining PreparedStatement Object for Passing Data

 PreparedStatement insertionStmt =   con.prepareStatement("INSERT INTO MAPPING VALUES(?,?)"); ... public void insertEntry(String domain, String range)   throws SQLException {   insertionStatement.setString(1, domain);   insertionStatement.setString(2, range);   insertionStatement.executeUpdate(); }

Inserting two Strings this way may or may not be all right, depending on how the Strings are obtained from the text file. Suppose, for example, that a simple regular expression-matching tool is used to split each line into two Strings:

One String contains all the characters before the first space.
One String contains all the characters after the first space.

Such a rudimentary parse of the text file would not catch minor corruption in the data. For example, consider what would happen if one of the lines were in the following form:

 trees  {elm, beech, maple, pine birch}

The comma between pine and birch is missing. An error such as this can easily result from a bug in the tool that generates the file or from manual editing of the file.

At any rate, the data would enter the database in its corrupt form, waiting silently to be accessed. If the method used to access data expects entries to be separated by commas and spaces, it will crash when reading this entry.

If the program simply distinguishes the elements of the set by commas alone, an even more serious error can occur. The system could interpret pine birch as a single type of tree (a single entry of data) and propagate the bug further into the computation.

A Semantic Cause

Our example is one in which a simple, syntactic constraint of the data was violated. Of course, that's not the only way in which the data might be corrupted. Semantic-level constraints can be violated as well. In our example, one expectation of the data in the Mapping table is that every element in each set is a domain entry in the Properties table. If this invariant were violated, we might end up trying to read an element in the Properties table that wasn't there, causing an exception to be thrown.

In this chapter we use database entries as examples, but a Saboteur Data bug can come at you in a variety of ways—as many ways as there are data-input avenues. When data is read by a program, whether it is from a file, a keyboard, a microphone, a network port, or a digital camera, the potential for a saboteur exists.

Cures and Preventions

The best defense against the Saboteur Data bug is one that is universally employed by compiler and interpreter developers. Because the input data to these programs is so complex, developers have no choice but to perform as thorough an integrity check as possible when first reading the input, rather than upon later access.

Let's look at several elimination methods.

Parsing as an Elimination Method

The very practice of parsing input is a way to eliminate most saboteurs. Unfortunately, programmers who would never think of writing a compiler without a parser fail to write adequate parsing methods for simpler data. The parsing of simpler data is easy, but that's no excuse for not parsing it at all.

Any program that reads data—no matter how simple—should parse it. After all, such a program can be viewed as a compiler or an interpreter over the "language" defined by its set of valid inputs. Take it from someone who has been there. I plead guilty to having manipulated data without proper parsing in my young and reckless days, and I suffered the consequences—rampant saboteurs. I don't recommend the experience.

Type Checking as an Elimination Method

Another common form of checking performed by compilers for many languages (including the Java language) is type checking. Type checking is an example of a semantic-level check on the integrity of a program.

Provided that the type system is sound (as the Java type system is), this integrity check literally guarantees that a huge class of errors can never occur at runtime. Like parsing, this example from compiler writers can be applied to other programs, which often stipulate semantic-level invariants over their input data (as in our example). These invariants are often not explicit, but they can be made explicit by putting in the corresponding checks.

Iteration as an Elimination Method

Of course, if you suspect an occurrence of this bug pattern with data that has already been read in and stored, it would be prudent to iterate over the data, accessing each bit of data as it would be accessed in the deployed application and ensuring that everything works as expected. In the process, you might be able to correct simple errors as well.

In cases where your data is stored in an immutable database or other immutable finite store, such an offline integrity check can also serve as a performance optimization. If you check over all of the data offline and it all passes, there is no need to check it again when it's used online. You might as well conserve the processor cycles.

But this optimization should be done only when the data is truly immutable, and only when there is no chance that the data will be corrupted while reading it from storage. If there is even a remote chance that new data will be entered or if the connection to this data is any less reliable than a connection to the local filesystem, check it again while reading it. After all, these integrity checks rarely cause significant performance degradation; the data retrieval process itself will almost always be the bottleneck.

But even a small risk of saboteur data is too much risk. Just one case will easily outweigh any advantages of not doing the checks—both from the perspective of your customer when the software makes a catastrophic nosedive and from your perspective when you try to diagnose what happened. Because the symptoms are far removed from the cause, saboteur data can be a bear to diagnose.

A Caveat on Elimination Methods

I don't mean to imply that it is always possible to perform enough checks to eliminate each piece of saboteur data from a program. If that were the case, this would be a much less problematic bug pattern.

There are many reasons why a saboteur might be undetectable before it starts wreaking havoc:

The data necessary to perform all the checks is not available until after the saboteurs are stored away, and they are not all accessible offline.
The function checking the complete set of constraints is not even computable (as is the case for many programming languages).
The set of constraints is computable, but the resources required to check them are beyond what's available to the program.

In such cases, the best we can do is eliminate as many forms of saboteurs as possible.