Debugging Complex Systems | Comprehensive VB .NET Debugging

As systems become larger and move ever further away from the relatively simple client/server model, you'll find that the complexity of any system grows faster than its size . The communication paths and interactions between system components grow in a factorial manner as the system grows in a linear manner. To understand the behavior of large distributed systems, and even their smaller client server counterparts, you need to understand what causes the complexity and how you can tackle the problems that complexity brings with it.

The software industry has been moving for a while toward components as a way of managing complexity. A component is a block of software that encapsulates some discrete functionality. It might be a DLL that does option pricing, a Web service that provides stock quotes, or perhaps a stored procedure that gives access to a database table. Alternatively, a component might be a middle-ware program that allows messaging between components or a third-party control that implements a grid to be displayed on a form.

These components then either interact with each other to create the software application or occasionally they are linked together with "old-fashioned" procedural code. In this way, much of the complexity of the functionality being programmed is hidden away inside each component, and is then given a friendly face through the public interface that the component exposes.

This means that a large part of managing complexity in the VB .NET world is about managing multiple components and their interactions with each other. When you understand how to build components properly and how to debug their collaboration, you will understand how to beat component complexity.

Building by Contract

In a world of distributed components, several problems have to be solved in order for these components to communicate successfully with each other. Some of these issues have been at least partially solved , but several problems remain that will cause havoc with your applications if you don't understand them properly.

The first problem that has been partially solved is the wire protocol, the means by which the information passes from component to component. The most common protocol in use today is HTTP, and that is proving more and more successful as developers continue to build HTTP support directly into many of their applications.

Another problem that has also been partially solved is the communication format, the common language understood by interacting components. Nowadays people are moving toward XML as the standard communication format, although a surprising number of applications still use HTML techniques such as screen scraping. The progress of XML has been helped because it is a format that is both machine readable and human readable.

The other nasty problem that's now well understood is component coupling. When designing a system, it was very common and easy to design tightly coupled components. Tightly coupled components are bad because they rely on knowledge of each other's implementation in order to communicate successfully. This means that when you change the implementation of a component or application, all the other applications talking to it will suddenly fail. Also, when you have an error in one component, the effects of this error are more likely to transmit themselves to their tightly coupled siblings. So good system designers have now learned to build applications that are loosely coupled and that don't fail or crash when component implementation changes. The applications agree on a wire protocol and a communication format, but avoid agreeing on an implementation.

So far, so good ”this is a simplified description of the state of the art in distributed systems at the start of the twenty-first century. Components and applications create contracts that their clients have to understand and follow. But what happens when these contracts don't have a way of expressing certain concepts critical to normal business processes? The next section discusses some of the problems that can arise.

Understanding Communication Issues

Most applications and their individual components are what are called finite state machines. This means that at any one time, an application is in a particular state and can then move to one of several different states. For example, a server component that deals with e-commerce shopping carts will accept a request for a cart, handle requests to add items to that shopping cart, and finally accept a request to pay for the contents of the cart. At any one time, the server component will have a view of what's allowed and what's coming next. The problem is that there's no current machine-readable language that allows a component to express formally its possible states and therefore the contract that it supports.

This leads to several types of communication issues, which I discuss in the next few sections.

Sequencing Bugs

So what happens when your shopping cart receives a request to pay for the contents of a shopping cart without ever having received the initial cart request? In other words, what happens when a client component has got the sequence of its requests wrong, probably because there was a misunderstanding of the contract offered by the server? The result is a bug, and it's the sort of bug that can be very difficult to diagnose and fix because of the likely complexity of the interactions between the two applications.

This is a very common problem, especially when the number of transactions (and therefore states) that a server component supports is large. Persuading two different applications, each of which is a complex finite state machine, to agree on the precise sequencing of every transaction between them is very hard. Because the sequencing is not expressed in a machine-readable way, it is down to the developers of both applications to agree everything in the conversation sequence, and this process is very prone to error. Even if the original transactions are agreed properly, keeping the two state machines in step as the transactions evolve over time is still very difficult.

Latency Bugs

An illustration of the second cause of bugs is what should happen if a request to pay for the shopping cart's contents is not answered for 5 minutes. Should the payment request be abandoned and the shopping expedition ended? If you try this, maybe the payment request will eventually be accepted and the shopper's account debited! Or should the client code wait longer (how long?) for the request to be answered? The problem here is that, once again, there is no common machine-readable contract that the components can use to agree on timing issues.

The result is that both applications stubbornly stick to their own understanding of the contract, and therefore bugs caused by latency issues flourish. These types of defects are very common, especially when applications try to work together in real time.

Sequencing and Latency Interactions

In the real world, these problems are likely to be combined. During a conversation between multiple applications, you might send requests to several different applications and then receive the replies in a very fluid order. This means potentially keeping track of the state of multiple conversations and trying to coordinate the results into a coherent and sensible whole.

Semantic Bugs

When I make a verbal or written contract with you to perform a service, such as architecting your new house or performing some programming, we try to come to a common agreement about the services and payments involved. Software components have a similar problem in that they need to agree on every detail of the common contracts between them. If an XML field states itself to be the instrument price, is that net or gross? Does an XML field called "earnings" represent net earnings, earnings before interest, or earnings before interest, tax, and depreciation? The problem is that common messaging formats such as XML only push the problem of semantic meaning up by one level. The scope for communication misunderstandings is still as large as ever. If anything, the problems can be even worse because common formats such as XML appear to solve many communication issues. In fact, the problem of meaning is just suppressed and will appear at a later and potentially more expensive stage of the development cycle.

Dealing with Failure

The defining difference between local and distributed applications is the idea of failure. Communication calls between local application components normally just work, whereas the same calls made between remote components can fail in many ingenious ways. Diagnosing the failure of remote components is also much more difficult than diagnosing local failures. You need to distinguish between partial failure and complete failure if you want to ensure that your application is robust.

For a detailed treatment of this issue and some possible solutions, please see Chapter 15, which deals with the debugging of distributed applications.

Possible Solutions

Recognition and understanding of these problems is the first step toward solving them. A further step is to have good documentation for each component that expresses every transaction and every state properly, including sequencing and timing information. This documentation needs to be available to both server and client developers, and kept up-to-date as the code and the contracts change. One of the most useful, and most frequently overlooked, items in this documentation is a complete list of every known exception (error) that can be generated by the component, what each exception means, and the circumstances under which each exception is raised. This list is invaluable for a developer who needs to understand the behavior of an application component and how to use it safely.

One solution for communication issues is to create a machine-readable and human-readable document that properly expresses the contract offered by a component. This still requires that all components subscribing to the contract agree on a common contract schema, but if the schema is able to express concepts such as sequencing and timing, both integration testing and fault diagnosis become much easier. It's even possible to automate some of these processes, and human error can at least partially be removed from the enforcement of contracts between components.