Section 10.1. Motivation for Reliable Messaging | Web Services Platform Architecture(c) SOAP, WSDL, WS-Policy, WS-Addressing, WS-BP[. .. ] More

10.1. Motivation for Reliable Messaging

L. Peter Deutsch, a noted computer scientist, has been attributed with publishing what has become known in software engineering circles as the "Eight Fallacies of Distributed Computing." He first presented them at a talk he gave to the researchers and engineers at Sun Microsystems Labs in 1991. At the time Deutsch first presented the fallacies, there were only seven. He added the eighth sometime later. The eight fallacies are as follows:

The network is reliable.
Latency is zero.
Bandwidth is infinite.
The network is secure.
Topology doesn't change.
There is one administrator.
Transport cost is zero.
The network is homogenous.

Web services are, at their essence, distributed applications. Certainly, when designing any Web service, you should carefully consider these words of wisdom. Ask yourself whether you have inadvertently relied upon any of these false assumptions in making your design decisions.

The next sections dive deeper into a few of these fallacies and discuss their relevance to Web services.

10.1.1. The Network Is Reliable

The first of these fallacies, "The network is reliable," is one trap into which many software engineering projects fall. Given that most Web services deployed today use Transmission Control Protocol (TCP), a highly reliable connection-oriented host-to-host network protocol, you might think that it's unimportant to concern yourself with the inherent unreliability of the network. However, TCP is only reliable to the extent that the sending TCP stack can be certain that a message has been delivered to the TCP stack at the receiving host. Likewise, the receiving host can only be certain that it has either received a message reliably, or it has not. Things can still go wrong from the perspective of the Web service, which resides far above the TCP/IP interface.

First, consider that the reliability of the TCP protocol is limited in its scope to the two communicating TCP stacks and everything in between. Although the receiving TCP stack assumes responsibility for ensuring that received messages are passed to the application layer, the process could terminate before the TCP stack has been able to perform its responsibilities in this regard. Messages that have been successfully received and acknowledged at the TCP layer could be lost from the perspective of the application.

Second, a sending process might terminate before the sending application knows that the receiving TCP stack has received and acknowledged its message, and the receiving application has processed it.

Either of these two failure modes can leave a Web service consumer or provider in an inconsistent state with respect to its counterpart. Although this might not present a problem for certain stateless and/or idempotent operations such as an HTTP GET, it can present quite a serious problem for others that are not idempotent.

If you are going to provide for reliable messaging in the context of Web services, you need to keep these issues in mind.

10.1.2. Latency Is Zero

Whether dealing with distributed components of an application on an intranet or over the Internet, latency between the distributed components impacts reliability. In the time it takes a message to be transmitted from sender to receiver, all manner of things can go wrong. The network could become partitioned due to a router failure or a severed or disconnected network cable. The destination host could crash. The process in which the receiving component is running could terminate.

When considering latency in terms of a round-trip request and response (or stimulus and response), latency becomes an even greater concern. If the service provider is overwhelmed with requests to process, you can often count latency in seconds, if not minutes. Processing a request can involve significant computational resources, or it might depend on another distributed component. Processing a request might even require manual intervention in some cases. The longer the latency, the greater the potential for something to go wrong, leaving the distributed application in an inconsistent state.

10.1.3. There Is One Administrator

Even in an intranet context, this fallacy often rears its ugly head. Although your IT department might assign a single group to be responsible for the network, in the context of a Web service, many administrators typically exist. In most cases, these administrators have rather parochial interests. There might be one administrator for each database, one for each of the application servers that host the Web service components, one for the demilitarized zone (DMZ) and firewall complexes, one for the server room, and so on. Administrators might not always coordinate their activities with your Web service's needs. All of this can lead to circumstances in which certain components of a Web service implementation become unavailable (during an upgrade or routine maintenance, for example), often at critical and unexpected times.

Expanding the scope of Web services to the context of the Internet, things get even more interesting and complicated. You can no more expect to coordinate activities related to the components of a Web service when the administrator(s) of those components are employed by your business partners than you can expect to win the lottery!

Therefore, you need to design your Web service so that it can recover from failures related to the unavailability of a distributed component brought down for routine maintenance or failure.