3.9. Dealing with Deviations and Errors | Prefactoring: Extreme Abstraction, Extreme Separation, Extreme Readability

< Day Day Up >

Dealing with error conditions is probably the hardest part of the development effort. Errors fall into at least two categories: conditions that arise in the normal operation of the program, and failures in the environment in which the program is operating.

I prefer the term deviation for an error that occurs during normal processing. A deviation is a departure from straightforward processing that can occur during normal program operation. Most use case logic deals with straightforward logic. The user does this, the system responds with that. In the normal course of processing, the system needs to deviate from this straightforwardness.

For example, it is possible that a CustomerID is entered that does not equal any of the IDs in the set of Customer s. This could occur because the CustomerID was input incorrectly or the Customer was deleted because the customer had not rented for several years . If the collection of customers is kept on a server, causes include a network failure or server failure.

The first set of causes for a CustomerID not being found are deviations that can occur during normal processing. A correction mechanism can be suggested to the user (e.g., reenter the ID), though user action might not solve the problem. ^[*] The second set of causes (network or server failure) are errors, not deviations. They should not occur during normal operation. However, if the server or network were known to be unreliable, they could be handled as deviations.

^[*] For example, the failure could be due to an incorrect conversion from a String into a CustomerID . That failure is a program bug that should have been caught in testing. It is not a deviation.

Deviations should be dealt with at an appropriate level. The methods closest to where the deviation occurs often have the most information regarding what actions the user can take. If opening a nonexistent file signals an error, the caller of the open method usually knows the file's purpose and can add information regarding what might occur in the absence of that file. For example, suppose the file the method was opening was a configuration file. If the configuration file is nonexistent, the method might choose to use default settings. If the configuration file is absolutely required by the program, the method can signal an error.

Errors also should be dealt with at an appropriate level. There are two types of errors: fatal errors and nonfatal errors. Fatal errors are conditions for which further processing is probably futile. Examples of fatal errors include "out of memory" and "out of disk space." The user level is usually the place to deal with these errors. The internal code cannot correct them. Nonfatal errors are conditions for which the program can continue operation, albeit in a reduced capacity. An example is the inability to contact a service over a primary network. The methods in the level on which this error occurs should attempt contact over a backup network, instead of passing it up to a higher level. If sufficient nonfatal errors occur, they could turn into a fatal error. For example, if both the primary and backup networks go down, a fatal error should be signaled.

DECIDE ON A STRATEGY TO DEAL WITH DEVIATIONS AND ERRORS

Determine for your system what deviations and errors are, and how to deal with both.

Whether deviations are signaled using return codes or exceptions is a matter of preference. If they are reported using exceptions, they should be classified into their own hierarchy to differentiate them from exceptions for unexpected conditions. If all possible deviations are coded as just regular exceptions, it becomes difficult to separate the expected from the unexpected.

Exceptions in many languages are divided into checked and unchecked exceptions. Checked exceptions are listed in the declaration of the method. The caller of the method must explicitly handle all checked exceptions by either catching them or passing them back to its caller. Unchecked exceptions are not listed and the caller might not even be aware that they are thrown. Unchecked exceptions are typically used for conditions that should cause termination, such as the inability to connect to a database.

3.9.1. Failure Distance

A large spread between the spot where an error occurs and when it is noticed makes the error harder to debug. For example, suppose an object reference is set to a null value. If this value is used to refer to an object, a program exception usually occurs. For example:

 String reference = NULL;     // A few lines of code     reference.get_length(  );

If the distance between the setting of the reference and its use is small, it is relatively easy to detect. If the distance is within a single method, often a compiler can identify the problem and issue a warning. However, if the reference is set in one method and is not used until many methods later, all the intervening methods have to be examined for bugs . The sooner the error is detected , the easier it is to correct it.

A concept of distance applies to the development process. The sooner an error is found, the easier it can be to fix. If abstract data types (ADTs) are used extensively, many errors can be detected at compile time in languages that support static type checking. For example, with a method such as:

 get_abbreviation_for_state(String state);

any string can be passed to the method. With:

 enumeration State {Arkansas, Alaska, ...}     get_abbreviation_for_state(State aState)

the compiler will signal an error if anything other than a State is passed.

WHAT'S AN ERROR TO YOU?

Deviations and errors are one way of classifying failures. Gary K. Evans, a reviewer, likes the term deviation but has grown accustomed to using exceptions , where an error is a kind of exception, but not conversely. He categorizes identified exceptions as 0 = no exception (no error); 1 = recoverable ; 2 = unrecoverable; and 3 = fatal. He notes:

Recoverable exceptions let you return to the main use case path and attain the goal of the use case. For nonrecoverable exceptions, you must abandon the goal. Fatal exceptions are moot. You are going down; you cannot attain the goal; and doing use cases for them is not very worthwhile. It has been interesting to me to group all exceptions according to these categories, and to address their implementation in ascending order. If I address only the no exceptions and recoverable exceptions categories in that order in my iterations, I still get approximately 80% coverage of the total exception space.

Another way of classifying conditions that occur is to consider whether they fall in the area of business rules or technology. Jim Batterson, a fellow consultant, divides errors into three categories:

Things are not right but have been allowed for in the business rules. For example, what to do when someone enters a key that is not in the file, assuming your rules tell you what to do. They are not errors from a programming point of view, they are just something that takes us down a different path that we have allowed for in the program.
Application problems are occurring that were not allowed for in the business rules, so your program does not know what to do with them, but still we are in the application domain here.
We have problems that are in the technology domain. We cannot open a file or we run out of space on a drive somewhere. These were never mentioned in the requirements, but they are a problem. These problems can be classified as fatal or nonfatal errors.

3.9.2. User Messages

Messages reported to the user from deviations and errors should be meaningful to the user. They should include as much information as possible regarding how to work around or correct the error. Failures can be categorized into the meaning of the failure and how the user might react to the failure. The user message should designate the category of the failure.

For example, permanent failures imply that the user trying the same operation again will get the same error. Transient failures suggest that the user might attempt the operation again immediately and might complete it successfully. Temporary failures require some undetermined period of time before they are cleared up.

SOONER THAN LATER CAN SAVE MILES OF WALKING

When you are hiking the Appalachian Trail, you try to carry the minimum amount of equipment to save wear and tear on your legs and body. You carry something because you need to use it (with the exception of emergency supplies , such as Band -Aids). So it is essential that you do not leave anything behind.

When moving on, I make it a habit to spend a minute checking the area where I tented to see if I left anything. I started this practice after almost leaving behind the rope I use to keep my food away from bears. Without this rope, bears would have easier pickings at my next stop. Even when I leave a stopping point along the hike, I take another few seconds to check the area to make sure I returned everything to my pack.

If you arrive at a campsite at the end of the day and it turns out you left equipment behind, you cannot call back and ask someone to bring it along. You have a long walk back. I met one hiker who left his stove at a shelter. The shelter had a shelf outside where a stove could be placed. He just walked off without checking around the shelter. It was a 20-mile round trip to get the stove .

It was a long distance between the cause of the failure and its discovery.

REPORT MEANINGFUL USER MESSAGES

Error messages should be reported in the context of what the user can do about the error, instead of in terms of what the underlying error is.

Implementation- related messages (such as a stack trace) should be captured, but not necessarily displayed to the user. You need to provide the means for developers to see the details. Otherwise, you cannot diagnose and fix problems, particularly intermittent problems.

3.9.3. Assertions

Assertions are statements about conditions that must be true while a program is executing. Some developers disable assertions when a program is used for production. If assertions in production code should be true during testing, they also ought to be true during production. Assertions should be removed only if there is a measured performance penalty.

The behavior of an assertion during testing usually causes the program to exit. For many applications, that behavior is appropriate. The user can be informed with a friendly terminating error message. For other applications, such as a server, that behavior might not be acceptable. In that case, assertions should be reported immediately and logged.

ORDERING PIZZA

You decide you want a pizza. Unless you have a specific pizza place in mind (a difference in implementation quality), you pick a pizza place out of the phone book, call them, and place an order.

"I'd like a large pepperoni with extra cheese," you say.

"Right. Address?"

"1 Elm Street," you reply.

"Thirty minutes," you hear, followed by a hang-up.

You ordered a pizza and got a response: it will be at your place in 30 minutes. If it does not arrive, it is a failure of the implementation.

This protocol symbolizes polymorphism. Almost any pizza place you call will require the same information. The speed at which they deliver might be different and the pizza might taste different, but the interface is the same.

What possible failures should you, the Hungry Customer, care about?

Closed: the pizza place does not answer . You really do not know whether it is closed or whether their phone is not working. It really does not matter. You will not get a pizza from them today. This could be a temporary failure or possibly a permanent failure.
No longer make pizza: the pizza shop has decided to concentrate on making calzones or subs. This is a permanent failure. There is no sense retrying .
Unable to make pizza at the current time: you really do not care what the reason is. You have to try another pizza place. This is a temporary failure.
Unable to deliver pizza at the current time: you can get a pizza, you just have to pick it up.

We have given user-meaningful names for these failures. You don't want implementation issues to slip back to the higher level. You don't care why they can't deliver the pizza, if there is nothing you can do to correct the situation. Their vehicle might have run out of gas or their driver might have gone on vacation.

The Hungry Customer should not have to deal with an OutofGas error. If the implementation uses bicycles rather than cars , a ChainBroken error might occur. The only thing that callers should deal with is an UnableToDeliver error. The more detailed errors should be handled at a lower level and rolled into the more abstract exception. The detail could be placed in an explanation and displayed to the user. The caller might find it amusing, but all they can do about it is to decide to pick up the pizza or find another implementation.

< Day Day Up >