6.2 Making Applications More Robust | The OReilly Java Authors

I l @ ve RuBoard

The first set of practices in this chapter dealt with how to encode objects that will be sent to another process. This section is very different; it contains practices for making an application more robust. The practices in this section aren't really about "design" in the classic sense. Nothing in these practices will help you determine the object composition for your business logic or write a better remote interface. Instead, they mostly deal with two related topics: connection maintenance and failure handling .

Connection maintenance refers to practices that make sure the programs in a distributed application can connect with each other and send method calls. Failure handling refers to practices that enable an application to recover as gracefully as possible from a connection failure; you might think there's not a lot you can do when an application crashes (or starts performing badly ), but there are a few simple practices that will help you diagnose problems and provide end users with a better experience.

6.2.1 Include Logic for Retrying Remote Calls

Everyone knows that networks fail. Failures can range from small-scale and transient to massive and persistent. Obviously, in the case of massive and persistent network failures, a distributed application will not work. But your application can be built to survive small-scale and transient network failures.

One of the best things you can do to make your application more robust is implement a retry strategy. That is, whenever you make a remote call, wrap it in a loop based on catching RemoteException , as in the following code snippet:

 public void wrapRemoteCallInRetryLoop( ) { int numberOfTries = 0; while (numberOfTries < MAXIMUM_NUMBER_OF_TRIES) { numberOfTries++; try { doActualWork(  ); break; } catch (RemoteException exceptionThrownByRMIInfrastructure) { reportRemoteException(exceptionThrownByRMIInfrastructure); try { Thread.sleep(REMOTE_CALL_RETRY_DELAY); } catch (InterruptedException ignored) {} } } }

This method is simply an encapsulation of a loop. It relies on two other methods , doActualWork( ) and reportRemoteException( ) , to make the remote method call and to report failures in communicating with the remote server, respectively. This code also assumes that RemoteException indicates a network failure, and that retrying the method call a small (and fixed) number of times is a reasonable strategy when the RMI infrastructure throws an instance of RemoteException .

Note that in some cases this is not the correct behavior. For example, if the failure is a timeout, this could lead to a very bad user experienceno user wants to wait through a single timeout, much less three consecutive timeouts. And there are exceptions, such as NoSuchObjectException , which subclass RemoteException and for which retrying the method call is usually pointless. ( NoSuchObjectException usually indicates that the client has a stub for a server that no longer exists. In which case, using the same stub and trying the call again makes no sense.) I'll address all these objections in later practices.

6.2.2 Associate Unique Identifiers with Requests

Once you decide to implement retry logic, you need to worry about partial failures. For example, consider the following scenario:

The client makes a remote method call.
The server receives the call, handles it, and returns the appropriate answer.
The network hiccups.
The client gets a RemoteException and tries again.

Sometimes this is harmless. For example, clients frequently fetch information from a server to display to a user. As long as the request doesn't change any state on the server, it's OK to simply make the request twice. If, however, the request changes state on the server (for example, depositing money to an account), it's usually important that the request not be processed twice. That is, the client needs to make the request a second time, and the server needs to return the correct answer. But the server shouldn't actually perform the requested action twice.

In complicated cases, the client application might need to use a distributed transaction manager to actually make sure that a set of related calls to the server will succeed or fail atomically. ^[5] But in many cases, it simply suffices to associate a unique identifier to the method call. For example, the following code uses the VMID class (from the java.rmi.dgc package) to define a RequestIdentifier class:

^[5] Although, in such cases, you should use a server-side proxy to handle the transaction.

 public final class RequestIdentifier implements Serializable { public static synchronized RequestIdentifier getRequestIdentifier(  ) { return new RequestIdentifier(  ); } public boolean equals(Object object) { if  (!(object instanceof RequestIdentifier) ){ return false; } RequestIdentifier otherRequestIdentifier = (RequestIdentifier) object; if (_requestNumber != otherRequestIdentifier._requestNumber) { return false; } return _sourceVM.equals(otherRequestIdentifier._sourceVM); } public int hashCode(  ) { return _sourceVM.hashCode(  ) * 31 + _requestNumber; } private static int REQUEST_NUMBER_COUNTER; private static VMID THE_VM = new VMID(  ); private int _requestNumber; private VMID _sourceVM; private RequestIdentifier(  ) { _requestNumber=REQUEST_NUMBER_COUNTER++; _sourceVM = THE_VM; } }

If the remote calls include an instance of RequestIdentifier as an additional argument, the retry loop is much safer: the server can simply check whether it has already handled this request and respond appropriately.

Performance costs are associated with the use of request identifiers. Some are obviousfor example, the instance of RequestIdentifier must be created on the client side and sent over the wire to the server. But some are more subtle. The server probably can't store a hashtable of all requests it has ever handled in memory. And the performance costs associated with checking against a database for each remote method call are probably intolerable ( especially given that retries will be rare).

The usual strategy, one which is more than good enough for most cases, is for the server to track recent requests and assume that if a request identifier isn't stored in the recent requests, it hasn't been handled yet. For example, servers can usually store the last 30 minutes of request identifiers in an in-memory data structure. ^[6]

^[6] I recommend using a hashbelt. See the data experation articles I wrote for www.onjava.com for more details.

If you find you need more assurances than tracking recent requests provides, you probably should be using a message queueing system and not RMI.

6.2.3 Distinguish Between Network Lag Time and Server Load

When servers are busy, requests take a longer time to handle. Most of the time, the client simply waits longer. But sometimes, when servers are very busy, a request will simply time out, and an instance of RemoteException will be thrown. In this latter case, retry logic turns out to be fairly painful: if the server is too busy and cannot handle additional requests, the last thing in the world the client should do is send the request again, especially if the request isn't very important, or can wait awhile.

One way to deal with this is to use what I call the bouncer pattern . The idea is to define a new subclass of Exception , called ServerIsBusy , add it to all the remote methods, and then throw instances of ServerIsBusy whenever the server is too busy to handle additional requests.

In the simplest implementation, the server simply keeps track of the number of pending requests and throws an instance of ServerIsBusy whenever there are too many pending requests, as in the following implementation of the Bouncer class:

 public class Bouncer { private static final int MAX_NUMBER_OF_REQUESTS = 73; private static in CURRENT_NUMBER_OF_REQUESTS; private static ServerIsBusy REUSABLE_EXCEPTION = new ServerIsBusy(  ); public synchronized static void checkNumberOfRequestsLimit throws ServerIsBusy { if (MAX_NUMBER_OF_REQUESTS == CURRENT_NUMBER_OF_REQUESTS ) { throw REUSABLE_EXCEPTION; } CURRENT_NUMBER_OF_REQUESTS++; } public synchronized static void decrementNumberOfActiveRequests(  ) { CURRENT_NUMBER_OF_REQUESTS--; } }

Once you've defined a bouncer class, you need to implement the check in all your remote methods. The code transformation is simple. A method such as:

 public foo (  arguments  ) throws  exception-list  {  method body }

is rewritten as:

 public foo(  arguments  ) throws  exception-list, ServerIsBusy  { Bouncer.checkNumberOfRequestsLimit(  ); try {  method body  } finally { Bouncer.decrementNumberOfActiveRequests(  ); } }

Adding this check to your server code has two main benefits. The first is that it enables the client application to distinguish between network failures and when the server is simply too busy. And the second is that it enables you to implement much friendlier client applications. In the simplest case, putting up a dialog box saying "The server is very busy right now, and as a result, this application won't perform very well" will save users a fair amount of frustration. More complicated clients might switch to a secondary server.

It might seem tedious to implement this logic inside every single method that can be called remotely. That's because it is tedious. It's also error-prone . The best solution to this problem is to use aspects to insert this code at the appropriate places. To learn more about aspects, see the AspectJ web site at http://www.aspectj.org.

6.2.4 Wrap RMI Calls in Command Objects

Suppose you're wrapping each remote method call in a retry loop, distinguishing the different types of remote exceptions, and stamping remote requests with identifiers. Then simple remote method invocations such as server.performAction( ) , in which server is a stub associated with some remote object, balloon to 20 or 30 lines of code, most of which simply deal with the complexities of failure handling. This is bad for two reasons. The first is that a simple and easy-to-read line of business logic has become cluttered with extraneous things. And the second is that a lot of code is being written over and over again (the failure-handling code is boilerplate code).

The solution to both of these problems is to encapsulate all the code you've been adding inside a single class. For example, you could define a new class called SpecificCallToServer which encapsulates all this code. And then server.performAction( ) becomes:

 (new SpecificCallToServer( . . . ))..makeRemoteCall(  )

This is a little less readable than the original code, but it's still very readable. And all the logic dealing with the network infrastructure has been neatly encapsulated into a single class, SpecificCallToServer . If SpecificCallToServer simply extends an abstract base class (named something like RemoteMethodCall ), you've made the client application more readable, and only written the code that deals with the complexities of making the remote method call once.

For more information on how to design and implement a command object framework, see the series of command object articles I wrote for onjava.com.

Wrapping remote calls in command objects also facilitates many of the other practices in this chapter. For example, using command objects makes it easier for the client to use a remote stub cache.

6.2.5 Consider Using a Naming Service

A naming service, such as the RMI registry or a JNDI service provider, provides a very simple piece of functionality: it lets a client application pass in a logical name (such as "BankAccountServer") and get back a stub to the requested server.

This level of indirection is incredibly useful. It makes writing the client code much simpler, it means that you don't have to figure out another way to get stubs to the servers (which isn't so hard: RemoteStub does implement Serializable ), and it allows you to easily move servers to different machines.

In short, using a naming service makes it much easier to write and deploy applications.

Using a naming service also makes it possible to use the Unreferenced interface reliably. We'll talk more about this later in the chapter.

6.2.6 Don't Throw RemoteException in Your Server Code

The javadocs for RemoteException say the following:

A RemoteException is the common superclass for a number of communication-related exceptions that may occur during the execution of a remote method call. Each method of a remote interface, an interface that extends java.rmi.Remote, must list RemoteException in its throws clause.

This might make it seem like it's OK, and maybe even a good thing, for your server-side code to throw instances of RemoteException . It's certainly easy, if you're working on a server and discover a new exceptional condition, to add a line of code such as the following:

 throw new RemoteException("You can't deposit a negative amount of money");

It might even seem like good programming practiceafter all, the client code already catches RemoteException . But it's a very bad idea to use RemoteException in this way.

To understand why, you need to understand what RemoteException really means. The real meaning of RemoteException is that something has gone wrong between your client code and server code . That is, your client made a method call on a stub. Your server code is expecting to receive a method invocation via its skeleton. If something goes wrong between that call to the stub and the resulting invocation made by the skeleton, it will be signalled by an instance of RemoteException . Exceptions that happen within the server should be signalled by instances of some other exception class that doesn't extend RemoteException . There are two reasons for this. The practical one is that it's too easy for a client to misunderstand a RemoteException . For example, the retry loop shown earlier would try to invoke the remote method again. And the more abstract reason is that you should really be declaring the types of exceptions the server is throwing so that the client can react appropriately. Throwing generic exceptions is almost always a bad idea.

6.2.7 Distinguish Between Different Types of Remote Exceptions

What it means

The client and the server have different, and incompatible, versions of the codebase .

You shouldn't panic when you look at this list. It's not that complicated, and once you actually start thinking about the different types of RemoteExceptions , most of the information here will become second nature to you. The important point here is that these nine exceptions cover about 95% of the instances of RemoteException thrown in practice. And they are all thrown at different times, for very different reasons. If you write code that simply catches instances of RemoteException , you might be missing an opportunity to make your code more robust, better at reporting urgent problems to someone who can fix them, and more user-friendly.

Note that other exceptions are also thrown during the course of RMI calls. For example, java.net.BindException is sometimes thrown on the server side (if a specified port is already in use), and java.lang.ClassNotFoundException can be thrown on either the client or the server (it's usually thrown on the client side, when the stub classes haven't been deployed correctly).

6.2.8 Use the Unreferenced Interface to Clean Up Allocated Server State

The distributed garbage collector is a wonderful piece of code. It works in a very straightforward manner: a client gets a lease on a particular server object. The lease has a specific duration, and the client is responsible for renewing the lease before it expires. If the lease expires , and the client hasn't renewed the lease, the server JVM is allowed to garbage-collect the server object (as long as no other clients have leases against that particular object).

If a server implements the Unreferenced interfacewhich contains a single method, unreferenced( ) the server will be notified via a call to unreferenced ( ) that there are no valid leases against the server.

It's important to note that any active instance of a stub, in any JVM, will automatically try to connect to the server and maintain a lease. This means that, for example, if the server is bound into the RMI registry, the registry will keep the server alive . (The RMI registry basically stores the stub in a hashtable. The stub keeps renewing its lease.)

In turn , this means that if you're using a naming service to get instances of stubs, no other process can actually get a stub to a server if unreferenced has been called ( unreferenced will be called only if the server is no longer bound into any naming services).

All of this makes the unreferenced method an ideal place to release server-side resources and shut down the server object gracefully.

6.2.9 Always Configure the Distributed Garbage Collector

By default, a lease should last 10 minutes, and clients should renew every 5 minutes (clients attempt to renew when a lease is halfway expired ). The problem is that, in a wide variety of production scenarios, the default values don't work very well. Using JDK 1.3, I've experienced intermittent distributed garbage-collection failures (in which a client has a stub to a server that's been garbage-collected ) when the network is congested or starts losing packets.

Fortunately, you can change the duration of a lease by setting the value of java.rmi.dgc.leaseValue . This parameter, which is set on the server, specifies the duration of a typical lease. The trade-off is simple: smaller values for java.rmi.dgc.leaseValue mean shorter lease durations, and hence quicker notification when a server becomes unreferenced.

But smaller values also mean a greater chance of a false positive: if a client has trouble renewing a lease, giving the client a larger window in which to renew the lease (for example, before the client's lease is expired and unreferenced is called) is often helpful. In particular, larger values of java.rmi.dgc.leaseValue will make your system more robust when the network is flaky. I tend to use at least 30 minutes for java.rmi.dgc.leaseValue .

You might also think that longer leases result in less network traffic (because there are fewer renewals). This is true, but the amount of bandwidth you save is so small that it's really not worth thinking about.

I l @ ve RuBoard