Section 7.1. The Recovery Challenge


7.1. The Recovery Challenge

Proper error handling and recovery is the Achilles' heel of many applications. Once an application fails to perform a particular operation, you should recover from it and restore the systemthat is, the collection of interacting services and clientsto a consistent state, usually the state the system was at before the operation that caused the error took place. Typically, any operation that can fail consists of multiple, potentially concurrent, smaller steps. Some of those steps can fail while the others succeed. The problem with recovery is the sheer number of partial success and partial failure permutations that you have to code against. For example, an operation comprising 10 smaller, concurrent steps has some three million recovery scenarios, because for the recovery logic, the order in which the operations fails matters as well, and the factorial of 10 is roughly three million.

Trying to handcraft recovery code in a decent-size application is often a futile attempt, resulting in fragile code that is very susceptible to any change in the application execution or the business use case, incurring both productivity and performance penalties. The productivity penalty results from simply putting in all the effort for handcrafting the recovery logic. The performance penalty is inherited with such an approach because you need to execute huge amounts of code after every operation to verify all is well. In reality, developers tend to deal only with the easy recovery cases; that is, the cases that they are both aware of and know how to deal with. More insidious error scenarios, such as intermediate network failures or disk crashes, go unaddressed. In addition, because recovery is all about restoring the system to a consistent state (typically the state before the operations), the real problem is the operations that succeeded, rather than those that failed. The reason is that the failed operations failed to affect the system. The challenge here is the need to undo successful steps, such as deleting a row from a table, or a node from a linked list, or a call to a remote service. The scenarios involved could be very complex, and your manual recovery logic is almost bound to miss a few successful suboperations.

The more complex the recovery logic becomes, the more error-prone the recovery logic itself becomes. If you have an error in the recovery, how would you recover the recovery? How do developers go about designing, testing, and debugging complex recovery logic? How do they simulate the endless number of errors and failures possible? Not only that, but what if before the operation failed, as it was progressing along executing operations successfully, some other party accessed your applications and acted upon the state of the systemthe state that you are going to roll back during the recovery? That other party is now acting on inconsistent information and, by definition, is in error too. Moreover, your operation may be just a step in some other, much wider operation that spans multiple services from multiple vendors on multiple machines. How would you recover the system as a whole in such a case? Even if you have a miraculous way of recovering your service, how would that recovery logic plug into the cross-service recovery?




Programming WCF Services
Programming WCF Services
ISBN: 0596526997
EAN: 2147483647
Year: 2004
Pages: 148
Authors: Juval Lowy

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net