Section 9.7. Playback Failures | Programming WCF Services

9.7. Playback Failures

Even after successful delivery, the message may still fail during playback to the service. Such failures typically abort the playback transaction, which would cause the message to go back to the service queue. WCF will detect the message in the queue and retry. If the next call fails too, the message will go back again to the queue, and so on. Continuously retrying this way is often unacceptable. If the motivation for the queued service in the first place was load leveling, the auto-retry behavior will generate considerable stress on the service. You need a smart failure-handling schema that deals with the case when the call never succeeds (and of course, defines "never" in practical terms). The failure handling will determine after how many attempts to give up, after how long to give up, and even how often to try. Different systems need different retry strategies, and have different sensitivity to the additional thrashing and probability of success. For example, retrying 10 times, a single retry once every hour, is not the same strategy as retrying 10 times 1 minute apart, or the same as retrying 5 times, each with a batch of 2 successive attempts separated by a day. In addition, once you have given up on retries, what should you do with the failed message and what should you acknowledge to its sender?

9.7.1. Poison Messages

Transactional messaging systems are inherently susceptible to repeated failure because they can bring the system to its knees. Messages that continuously fail playbacks are referred to as poison messages, because they literally poison the system with futile retries. Transactional messaging systems must actively detect and eliminate poison messages. Since there is no telling if just one more retry will actually succeed, you can use the following simple heuristic: The more the message fails, the higher the likelihood of failing again. For example, if the message has failed just once, then retrying seems reasonable. If they message has already failed 1,000 times, it is very likely it will fail the 1,001st time, and so it is pointless to try again. What exactly constitutes pointless (or just wasteful) is obviously application-specific, but it is a configurable decision. MsmqBindingBase offers a number of properties governing the handling of playback failures:

 public abstract class MsmqBindingBase : Binding,... {    //Poison message handling    public int ReceiveRetryCount    {get;set;}    public int MaxRetryCycles    {get;set;}    public TimeSpan RetryCycleDelay    {get;set;}    public ReceiveErrorHandling ReceiveErrorHandling    {get;set;}    //More members }

9.7.2. Poison Messages Handling in MSMQ 4.0

With MSMQ 4.0 (available on Windows Vista only), WCF retries playing back a failed message in series of batches. WCF provides each queued endpoint with a retry queue and a poison messages queue. After all the calls in the batch have failed, the message does not return to the endpoint queue. Instead, it will go to the retry queue. Once the message is deemed as poisonous, you may have WCF move that message to the poison queue.

9.7.2.1. Retry batches

In each batch, WCF will immediately retry for ReceiveRetryCount times after the first call failure. ReceiveRetryCount defaults to five retries, or a total of six attempts, including the first attempt. After a batch has failed, the message goes to the retry queue. After a delay of RetryCycleDelay minutes, the message is moved from the retry queue to the endpoint queue for another retry batch. The retry delay defaults to 30 minutes. Once that batch fails, the message goes back to the retry queue, where it will be tried again after the delay has expired. Obviously this cannot go on indefinitely. The MaxRetryCycles property controls how many cycles at the most to try. The default of MaxRetryCycles is two cycles only. After MaxRetryCycles number of retry batches, the message is considered a poison message. When configuring nondefault values for MaxRetryCycles, I recommend setting its value in direct proportion to RetryCycleDelay. The reason is that the longer the delay is, the more tolerant your system will be for additional retry batches, because the overall stress will be somewhat mitigated, having been spread over a longer period of time. With a short RetryCycleDelay you should minimize the number of allowed batches, because you are trying to avoid approximating continuous thrashing.

Finally, the ReceiveErrorHandling property governs what to do after the last retry fails and the message is deemed as poisonous. The property is of the enum type ReceiveErrorHandling, defined as:

 public enum ReceiveErrorHandling {    Fault,    Drop,    Reject,    Move }

9.7.2.2. ReceiveErrorHandling.Fault

The Fault value considers the poison message as a catastrophic failure. It actively faults the MSMQ channel and the service host. By doing so, the service will not be able to process any other messages, be they from a queued client or a regular connected client. The poison message will remain in the endpoint queue and must be removed from it explicitly by the administrator or by some compensating logic. In order to continue processing client calls of any sort, you must either recycle the hosting process, or open a new host (after you have removed the poison message from the queue). While you could install an error-handling extension (as discussed in Chapter 6) and do some of that work, in practice there is no avoiding involving the application administrator. ReceiveErrorHandling.Fault is the default value of the ReceiveErrorHandling property. No acknowledgement of any sort is sent to the sender.

9.7.2.3. ReceiveErrorHandling.Drop

The Drop value, as its name implies, silently ignores the poison message by dropping it and having the service keep processing messages. You should configure for ReceiveErrorHandling.Drop if you have high tolerance for both errors and retries. If the message is not crucial or it is a nice-to-have service, then dropping and continuing is acceptable. In addition, dropping the message does allow for retries, but conceptually, you should not have too many reties because if you care too much about the message, you should not just drop it after the last failure. Configuring for ReceiveErrorHandling.Drop also sends an ACK to the sender, so from the sender's perspective, the message was delivered and processed successfully.

9.7.2.4. ReceiveErrorHandling.Reject

The Reject value actively rejects the poison message and refuses to have anything to do with it. Similar to ReceiveErrorHandling.Drop, it drops the message, but it also sends a NACK to the sender, thus signaling ultimate delivery and processing failure. The sender responds by moving the message to the sender's dead-letter queue.

9.7.2.5. ReceiveErrorHandling.Move

The Move value is probably the best and most practical value of them all. It moves the message to the dedicated poison messages queue, and it does not send back an ACK or a NACK. Acknowledging processing the message will be done after the processing from the poison messages queue.

9.7.2.6. Configuration sample

Example 9-19 shows a configuration section from the host config file, setting poison message handling on MSMQ 4, and Figure 9-9 illustrates graphically the resulting behavior in the case of a poison message.

Example 9-19. Poison message handling on MSMQ 4

 <bindings>    <netMsmqBinding>       <binding name = "PoisonMessageHandling"          receiveRetryCount    = "2"          retryCycleDelay      = "00:05:00"          maxRetryCycles       = "3"          receiveErrorHandling = "Move"       />    </netMsmqBinding> </bindings>

Figure 9-9. Poison message handling of Example 9-19

9.7.2.7. Poison message service

Your service can provide a dedicated poison message-handling service to handle messages posted to its poison messages queue when the binding is configured with ReceiveErrorHandling.Move. The poison message service must be polymorphic with the service's queued endpoint contract. WCF will retrieve the poison message from the poison queue and play it to the poison service. It is therefore important that the poison service does not throw unhandled exceptions or abort the playback transaction (configuring it to ignore the playback transaction as in Example 9-9 or to use a new transaction as in Example 9-10 is a good idea). Such a poison message service typically engages in some kind of a compensating work associated with the failed message, such as refunding a customer for a missing item in the inventory. Alternatively, a poison service could do any number of things, including notifying the administrator, logging the error, or just ignoring the message altogether by simply returning. The poison message service is developed and configured like any other queued service. The only difference is that the endpoint address must be same as the original endpoint address suffixed by ;poison. Example 9-20 demonstrates the required configuration of a service and its poison message service. In Example 9-20 the service and its poison message service share the same host process, but that is certainly optional.

Example 9-20. Configuring a poison message service

 <system.serviceModel>    <services>       <service name  = "MyService">          <endpoint             address  = "net.msmq://localhost/private/MyServiceQueue"             binding  = "netMsmqBinding"             bindingConfiguration = "PoisonMesssageSettings"             contract = "IMyContract"          />       </service>       <service name = "MyPoisonServiceMessageHandler">          <endpoint             address  = "net.msmq://localhost/private/MyServiceQueue;poison"             binding  = "netMsmqBinding"             contract = "IMyContract"          />       </service>    </services>    <bindings>       <netMsmqBinding>          <binding name = "PoisonMesssageSettings"            receiveRetryCount    = "2"            retryCycleDelay      = "00:05:00"            maxRetryCycles       = "3"            receiveErrorHandling = "Move"          />       </netMsmqBinding>    </bindings> </system.serviceModel>

9.7.3. Poison Message Handling on MSMQ 3.0

With MSQM 3.0 (available on Windows XP and Windows Server 2003), there is no retry queue or a dedicated automatic poison queue. As a result, WCF only supports at most a single retry batch out of the original endpoint queue. After the last failure of the first batch, the message is considered poisonous. WCF therefore behaves as if MaxRetryCycles is always set to 1 and the value of RetryCycleDelay is ignored. The only values available for the ReceiveErrorHandling property are ReceiveErrorHandling.Fault and ReceiveErrorHandling.Drop. Configuring other values throws an InvalidOperationException at the service load time.

Neither ReceiveErrorHandling.Fault or ReceiveErrorHandling.Drop are attractive options. On MSMQ 3.0, the best way of dealing with a playback failure on the service side (that is, a failure that stems directly from the service business logic as opposed to some communication issue) is to use a response service, as discussed later on.