Section 8.2. Technical Concepts and Solutions | Enterprise SOA: Service-Oriented Architecture Best Practices

8.2. Technical Concepts and Solutions

You can choose from a wide range of solutions to implement process integrity. These solutions range from simple technical solutions such as distributed logging and tracing to advanced transaction concepts. BPM systems provide the facility to address process integrity on a less technical and more business-oriented level. BPMs are used to model and execute business processes, as we discussed in Chapter 7, "SOA and Business Process Management." They enable us not only to explicitly model special cases but also to provide definitions for the appropriate countermeasures in case of exceptions. We look at each of these solutions in turn, followed by recommendations for their application in an enterprise SOA.

8.2.1. LOGGING AND T RACING

Logging and tracingat different levels of sophisticationis probably still the most commonly used approach for providing at least rudimentary levels of process integrity on an ad-hoc basis.

Log traces are commonly used for debugging but are also used in production systems in order to identify and solve problems in day-to-day operations. Particularly in the case of complex systems that integrate large numbers of non-transactional legacy systems, logging is often the only viable approach to providing a minimum level of process integrity, especially if processes or workflows are implemented implicitly (i.e., there is no dedicated BPM system). Often, the operators of these types of systems employ administrators that manually fix problems based on the analysis of log files. If the log file provides a complete trace of the steps executed in a particular process until the failure occurred, the administrator has some chance of fixing the problem, even if this often requires going directly to the database level to undo previous updates or fix some problem to enable the process to continue.

A key problem with logging-based problem analysis and repair is the lack of correlation between the different log entries that relate to a particular logical process instance, especially for processes that are implemented implicitly. If a process fails due to a technical fault, someone must identify what has happened so far in order to complete or undo the process. To do this, you must find the log entries that relate to the process, possibly across different log files from different systems. Ideally, some kind of correlation ID (related to process instances) should exist for each log entry because this helps with the log consolidation (as depicted in Figure 8-1). However, often the only way to correlate events is by comparing timestamps, which is a difficult task, especially with systems that use distributed log files. For example, how is it possible to relate a JDBC exception written into a local log by an EJB application server to a database deadlock event that was written to a database log (both log entries are potentially relevant for identifying and fixing the problem in the application)? In this case, a significant chance exists that both exceptions occurred at the same time. The problem becomes even more difficult if processes are long-lived and you are trying to find out which customer account was modified earlier by a process that failed later, such as when updating a shipping order.

Figure 8-1. Consolidated logs can help system administrators deal with error situations in distributed, long-lived processes. Log consolidation can be performed across systems (e.g., providing the ability to relate updates in one database to updates in another), as well as within systems (e.g., relating database logs to application server logs).

Nevertheless, logging and tracing is still important in many operational systems and is often the only possible way to achieve at least minimum process integrity. Many projects have therefore invested heavily in building a sophisticated infrastructure that helps with distributed logging and log analysis. Often, this infrastructure is built on or tightly integrated with system management platforms such as IBM Tivoli or HP Openview.

Chapter 9, "Infrastructure of a Service Bus," provides a detailed overview of how this can be achieved in an SOA environment. Notice that distributed log consolidation can only partly address technical failures and does not address process inconsistencies that relate to business-level problems at all.

8.2.2. ACID TRANSACTIONS

Online Transaction Processing (OLTP) has been a key element of commercial computing for several decades. OLTP systems enable large number of users to manipulate shared data concurrently. For example, in an online flight reservation system, sales agents around the world share access to flight booking information.

In order to support shared manipulation of data, OLTP systems are based on the concept of transactions. Traditionally, the term transaction has been used to describe a unit of work in an OLTP system (or database) that transforms data from one state to another, such as booking a seat on a particular flight. The term ACID (atomicity, consistency, isolation, durability) has been coined to describe the ideal characteristics of concurrently executed and possibly distributed transactions, which ensure the highest level of data integrity. It is described in ISO / IEC 10026-1: 1992 section 4.

We will use the simplified example of a money transfer with a debit and a credit update to illustrate the properties of ACID transactions:

Atomicity: ACID transactions are atomic "all or nothing" units of work. If any one part of a transaction fails, the entire transaction is rolled back. If the debit update works but the credit update fails, the original debit update must be rolled back.
Consistency: ACID transactions transform data from one consistent state to another. In our example, the sum of all account balances must be the same as before the transaction. Of particular importance is the stipulation that a transaction ensures referential integrity: if a customer account is deleted, orphan address records originally related to the customer must also be removed (see the discussion on data integrity in the previous section).
Isolation: The internal state of a running transaction is never visible to any other transaction. This is usually achieved through locking. In our example, there is a window of time between the debit and the credit, during which the sum of all accounts will not add up. However, nobody outside of the transaction can see this inconsistency.
Durability: Committed updates of a transaction are permanent. This ensures that the consistent, up-to-date state of the system can be recovered after a system failure during the execution of the transaction. If we have a crash after the debit but before the credit, we can still recover the state before the start of the transaction.

Almost all commercial DBMS products support the concept of transactions enabling concurrent access to a database, albeit with varying degrees of "ACIDity." The DBMS must provide an appropriate concurrency control mechanism in order to deal with the concurrent execution of transactions. The transaction performance and behavior in case of access conflicts depends strongly on the choice of optimistic versus pessimistic concurrency control strategies, the choice of locking (exclusive or shared locks) versus timestamps versus versioning, and the locking/versioning granularity (tables, pages, tuples, objects). Most DBMSs can be programmed or configured to use different concurrency control policies (isolation level) in order to enable different types of applications or transactions to chose the appropriate mix of performance versus transactional integrity.

8.2.3. TRANSACTION MONITORS AND DISTRIBUTED 2PC

Building applications with ACID properties becomes more difficult if you operate outside the domain of a single system. Transaction Processing Monitors (TPMs) can be used to ensure the ACID properties of a transaction that spans multiple databases or other transactional resources (as depicted in Figure 8-2). A resource manager typically manages these resources, which include DBMS and transactional queue managers.

Figure 8-2. Example for 2PC: A client begins a new transaction (1), executes two updates (2 & 3), and attempts to commit the changes (4). The transaction coordinator sends prepare requests to both resource managers (5). The final step of the transaction (6) either commits or aborts all updates. If both resource managers agree to commit the transaction, the transaction coordinator sends a commit to all participants. If one participant wants to abort, the transaction coordinator asks all participants to abort.

Most commonly, the so-called Two-Phase Commit Protocol (2PC) is used to ensure ACID properties for transactions that span more than a single resource manager. Transactions are coordinated among different resource managers through a transaction coordinator, which is part of the transaction monitor. At the end of each distributed transaction, the transaction coordinator will coordinate the commitment of the transaction across the participating resource managers:

In the first phase (prepare), all participating resource managers must ensure that all relevant locks have been acquired and that the "before" and "after" state of data that has been modified in the context of the transaction has been persistently captured.
Depending on the outcome of the first phase ("voting"), the transaction coordinator informs all participating resource managers whether to commit or rollback the changes.
A single "abort" vote will cause the entire transaction to be rolled back, helping to ensure the atomicity property of the transaction. Only if all participants vote to "commit" will they also be asked to make the changes permanent and visible.

The 2PC protocol assumes a very tight coupling between all the components involved in the execution of the transaction. All participants must "play by the rules" in order to ensure that the outcome of each transaction is consistent. If a participant does not abide by the rules, the result will be a so-called "heuristic" outcome, that is, a transaction whose final state is undefined. Furthermore, deadlocks become an issue that must be handled with great care.

The most important standard in the area of transaction monitors and 2PC is the X/Open standard for Distributed Transaction Processing (X/Open DTP). Among other protocols and APIs, the X/Open standard defines the so-called XA interface, which transaction monitor use to interact with a resource manager, for example to execute the "prepare" and "commit" calls. Most commercial RDBMS and queue managers provide support for the XA interface, enabling them to participate in distributed transactions. Well-established transaction monitors include CICS, IMS, Encina (IBM), and Tuxedo (BEA). In the Java world, JTS (Java Transaction Service) is widely established, and its counterpart in the Microsoft world is MTS (Microsoft Transaction Server).

8.2.4. PROBLEMS WITH 2PC AND TIGHTLY COUPLED ACID TRANSACTIONS

Although ACID transactions are a good theoretical concept for ensuring the integrity of a single database or even a distributed system, in many cases, they are impractical in real-world applications. We will now cover the key limitations of tightly coupled ACID transactions in more detail.

8.2.4.1 Performance

Even in a non-distributed system, ensuring the isolation property of a concurrently executed transaction can be difficult. The problem is that the higher the isolation level, the poorer the performance. Therefore, most commercial DBMSs offer different isolation levels, such as cursor stability, repeatable read, read stability, and uncommitted read.

In distributed systems, the execution of a distributed 2PC can have an even more negative impact on the performance of the system, in some cases due to the overhead of the required out-of-band coordination going on behind the scenes. Finding the right tradeoff between performance and a high degree of concurrency, consistency, and robustness is a delicate process.

8.2.4.2 Lack of Support for Long-Lived Transactions

Another big problem with systems based on transaction monitors and ACID properties is that most database management systems are designed for the execution of short-lived transactions, while many real-world processes tend to be long-lived, particularly if they involve interactions with end users. Most OLTP systems (and transaction monitors and underlying databases) use pessimistic locking to ensure isolation for concurrent transactions. If a lock is held for long, other users cannot access the resource in question during that time. Because some RDBMS-based applications still prefer page-level locking to row-level locking, a lock can block resources that reside within the same page but that are not directly involved in the transaction. The problem is made worse if the lock is held for a long time. A typical solution to the problem with pessimistic locking is the application of an optimistic, timestamp-based approach. However, this must often be performed at the application level (e.g., by adding a timestamp column to the relevant tables) due to a lack of out-of-the-box support from many DBMSs for timestamp-based versioning. If the application is responsible for ensuring consistency, we now have a problem with our transaction monitor: the transaction monitor assumes that all these issues are dealt with directly between the transaction monitor and the DBMS during the 2PC. The XA-prepare and -commit calls implemented by the resource manager are expected to manage the transition from the "before" to the "after" state and locking/unlocking of the involved resources. Effectively, this means that we must include customized logic into the 2PC that deals with these issues at the application level by checking for timestamp conflicts before applying changes. Even if some transaction monitors support application-level locking through callbacks, which can be inserted into the two-phase commit, this approach severely limits an "off-the-shelf" approach because work that should originally have been split between the transaction monitor and the DBMS must now be performed at the application level (at least partially).

8.2.4.3 Problems with the Integration of Legacy Systems and Packaged Applications

Perhaps the biggest problem with transaction monitors and 2PC is the lack of support for two-phase commit using an XA interface from legacy systems and packaged applications, such as an ERP or CRM system. Even if an ERP such as SAP uses an XA-capable database such as Oracle or DB2 internally, this does not mean that SAP can participate in a 2PC: all access to the SAP modules must go through an SAP API, such as BAPI, which is a non-transactional API.

This lack of support for 2PC in many legacy systems and packaged applications severely limits the application scope of distributed transactions and TP monitors because we normally need to integrate applications instead of databases into complex workflows.

8.2.4.4 Organizational Challenges

If used at all, the adoption of 2PC has traditionally been limited to tightly coupled, well-controlled intra-enterprise environments or perhaps single application systems with short-lived transactions. Transaction coordinators are the control center for the orchestration of the 2PC amongst resource managers. Not only is tight coupling required at the protocol level, but also successful orchestration amongst participants requires that everybody "plays by the rules" in order to avoid frequent heuristic outcomes that leave transactions in an ill-defined state. Such tight control over databases, applications, transaction frameworks, and network availability is usually limited to intra-enterprise environments, if it can be achieved at all.

Conducting inter-organizational transactions between autonomous business partners is a completely different situation. In an inter-organizational environment, we cannot assume total control over all aspects of transaction processing. Relationships between trading partners are typically loosely coupled, and the nature of intra-business transactions reflects this loose coupling. The technical implementations of these transactions must deal with this loose coupling and also with the fact that the level of trust between the transaction participants is different from that of a tightly coupled, well-controlled environment. For example, security and inventory control issues prevent hard locking of local databases: Imagine a denial of service attack from a rogue partner that results in a situation where locks are taken out on all available items in your inventory database, preventing you from doing business while the locks remain in place.^[1]

^[1] Even with optimistic concurrency control, this would be a risk because a participant could block access during the normally short time period of the transaction resolution (prepare/commit), during which all participants must lock the updated data, even when applying optimistic concurrency control policies for transaction execution.

8.2.4.5 2PC Is Not Suited for Discontinuous Networks

When integrating B2B systems over the Internet, the discontinuous nature of the Internet with its lack of QoS (quality of service) properties must be taken into account. This applies to the execution of the actual business logic of a transaction across the Internet in addition to any out-of-band interactions between transaction participants that are part of the transaction coordination because they are also executed over the Internet. Relying on a potentially discontinuous medium such as the Internet for execution of the two-phase commit could lead to situations that increase the time between the prepare and commit calls in an unacceptable way, with the potential for many heuristic outcomes.

8.2.5. NESTED AND MULTILEVEL TRANSACTIONS

The complexity of many business transactions and the fact that business transactions are potentially long running has led to the development of advanced transaction concepts such as multilevel and nested transactions.

A multilevel transaction T is represented by a set of sub-transactions T = {t₁, t₂, t₃, ..., t_n}. A sub-transaction t_i in T can abort without forcing the entire transaction T to abort. T can, for example, choose to rerun t_i, or it can attempt to find an alternative means of completion. If t_i commits, the changes should only be visible to the top-level transaction T. If T is aborted, then so is t_i. Multilevel and nested transactions are slightly different in the way in which they deal with releasing locks on the completion of a sub-transaction.

Although some commercial transaction monitor implementations support the concept of nested or multilevel transactions, the problem with their adoption lies in the lack of support from resource managers (database and queue managers). Very few commercial resource managers provide sufficient support for nested transactions.^[2] Unfortunately, this lack of support from the major commercial resource managers makes these good theoretical concepts somewhat unusable in the real world.

^[2] More commonly, RDBMS offer support for Savepoints or similar concepts, which define a point in time to which a partial rollback can occur.

8.2.6. PERSISTENT QUEUES AND TRANSACTIONAL STEPS

Persistent queues can increase the reliability of complex processes. They can even create transactional steps in a more complex process: when combined with transactions, persistent queues can guarantee consistency for the individual steps of a process or workflow. An application can de-queue messages in the context of a transaction. If the transaction aborts, the de-queue is undone, and the message returned to the queue. People usually use abort count limits and error queues to deal with messages that repeatedly lead to an aborted transaction.

Figure 8-3 shows an example of a transactional process step with persistent queues. Notice that if the queue manager is not part of the database system (i.e., is an independent resource manager), the transaction effectively becomes a distributed transaction and requires a transaction monitor that can coordinate the two-phase commit between the database and queue managers.

Figure 8-3. A transactional process step with persistent queues.

Leverage the Concept of Transactional Steps

A transactional step is a set of closely related activities, executed in the context of a single transaction. At the beginning of the transaction, an input token or message is read from an input queue. The result of the related activities is written into an output queue. Transactional steps are a key concept for ensuring process integrity because they facilitate the decomposition of complex, long-lived business processes into individual steps with short execution times and high transactional integrity. Transactional steps dramatically increase the robustness and flexibility of distributed systems.

8.2.7. T RANSACTION CHAINS AND COMPENSATION

The concept of transactional steps provides a great way for ensuring the integrity of individual process steps. You can create complex workflows or processes by chaining or linking together individual steps ("transaction chains"). However, the remaining issue is to ensure the integrity of the overall process or workflow. Although abort count limits and error queues provide a good means of ensuring no loss of information upon failure of an individual step, it is still necessary to fix these failures after detecting them.

One possible way to deal with failures in individual process steps is to identify compensating transactions that logically undo a previous transaction. For example, the compensation for a debit transaction would be a transaction that credits the same amount. An implementation of a distributed money transfer between two banks could be split into different steps that debit an account A at bank X and pass a message to the receiving bank Y using a transactional queue. If a problem occurs at bank Y (e.g., the account does not exist), bank X would have to execute a compensating transaction, such as crediting the appropriate account A (admittedly, this is a simplified example, ignoring the complexity of intermediary clearinghouses, end-of-day settlements, etc.).

Notice that, unlike nested or multilevel transactions, chained transactions effectively relax the isolation properties of ACID transactions because the results of each link in a chain of transactions is made visible to the outside world. For instance, assume that we have credited an account A in the first step of a transaction chain. When attempting to execute the corresponding debit on another account B in the next step, we discover that the target account B does not exist. We could now launch a compensating transaction to debit account A and undo the previous changes. However, a time interval now exists between the credit and subsequent debit of account A, during which the funds resulting from the credit transfer could have been withdrawn by another transaction, leading to the failure of the compensating transaction because the funds were no longer available.^[3]

^[3] Of course, in this example, the problem could be easily solved by reversing the order of the debit and the credit (starting with the debit first). However, the example still illustrates the problem.

In order to apply compensating transactions in a workflow, we need to log the input and/or output of the individual steps of a workflow because we might need this data as input for compensating transactions.

Combine Transaction Chains with Compensating Transactions

ACID properties are usually too strong for complex workflows. Instead, chained transactions with compensating transactions offer a viable means of dealing with process integrity. Transaction chains combine individual transactional steps into more complex workflows. Compensating transactions undo previously executed steps if a problem is encountered during the execution of a particular step in a transaction chain.

8.2.8. SAGAS

SAGAs are formal workflow models that build on the concept of chained transactions (steps). A SAGA describes a workflow wherein each step is associated with a compensating transaction. If a workflow stops making progress, we can run compensating transactions for all previously committed steps in reverse order. Although formal SAGAs are still at the research stage, the concept of using a compensating transaction for dealing with specific failure situations in a complex workflow is valid.

A number of problems exist with SAGAs and compensations in complex workflows, particularly with the complexity of workflow graphs and the corresponding compensation graphs. Typically, the complexity of compensation graphs increases exponentially with the complexity of the actual workflow graph. Thus, it is generally impossible to define complete compensation graphs. In addition, the need to deal with failures during the execution of compensating transaction chains adds even more complexity.

Even if formal SAGAs and compensations for each possible combination of failures are unavailable, it often makes sense to apply the concept of compensating transactions for individual failure situations that have been specifically identified to fit this approach, such as failures that we expect to happen frequently, perhaps because they are caused by business conditions such as "out of funds."

8.2.9. BPM AND PROCESS INTEGRITY

As introduced in Chapter 7, Business Process Management platforms provide features to enable business analysts and developers to design, execute, and manage instances of complex business processes. Many BPM platforms are in fact based on or at least incorporate some of the features described previously, such as chains of transactional steps. In the following discussion, we examine how the formal introduction of a BPM will help to increase the process integrity aspects of an enterprise application.

Firstly, explicitly modeling workflows and processes and separating them from low-level technical code will have a positive impact on what process integrity actually means with respect to a particular process because the BPM approach provides a comprehensible separation between "technical integrity" and "process integrity."

Secondly, if the BPM engine provides a mechanism for monitoring and managing process instances at runtime, we receive a powerful mechanism for controlling the integrity aspects of individual process instances at no extra cost.

Finally, some BPM engines support the concept of compensating transactions, which is a key concept for managing the rollback of partially executed processes after a failure situation. A key question is whether compensating transactions are sufficient to ensure process integrity, given that the compensation-based model is less strict than, for example, the ACID properties of a distributed transaction. In particular, it is useful to examine how to handle failures of compensating transactions. Do we actually introduce compensations for failed compensations? Although in some rare cases this might be required (some transaction processing systems in financial institutions have sophisticated meta-error handling facilities that cope with failures in failure handling layers), this will not be the case in most systems. Thus, the support for compensating transactions as offered by some BPMs often presents a sound and flexible alternative to platforms that require very tight coupling with applications and resource managers, such as transaction monitors.

8.2.10. RELATED WEB SERVICE STANDARDS

This book takes on the position that Web services are only one possible technical platform for SOAs (see Chapter 9, "Infrastructure of a Service Bus"). However, we want to take a quick look at standards that relate to Web services and process integrity, because this subject is likely to become very important in the future, as Web services become more pervasive. Unfortunately, the area of standards for Web service-based business transactions (see Figure 8-4).

Figure 8-4. A number of different industry alliances and standardization bodies are currently working on different Web services-based transaction protocols. However, much of this work is still in the early stages, and no dominant standard has emerged yet.

It is important to realize that most of this is still at an early stage. Simply defining a new transaction standard is usually insufficient for solving process integrity problems. Even if a standard emerges that is supported by a number of commercial transaction managers or coordination engines, it is insufficient: The real problem is not to implement a transaction or coordination engine that is compliant to a specific protocol, but rather to find widespread support for such a new coordination protocol from resource managers. Without the support of commercial databases, queue managers, and off-the-shelf enterprise application packages, a new transaction standard is not worth much to most people. Recall that it took the X/Open standard for Distributed Transaction Processing almost ten years from its initial specification to its adoption by the most widely used commercial data-base products such as Oracle, IBM DB2, and SQL Server. Even today, the support found in these products for the critical XA interface (which is the DB side of the X/Open standard) is often weak.

Some of the more recent (pre-Web services, post X/Open) transaction standards, such as the CORBA Object Transaction Service (OTS) and Java Transaction Service (JTS), were designed to fit into the widely adopted X/Open framework, which also applies to some of the previous Web services-based transaction standards. However, as we discussed in the first part of this chapter, X/Open-based 2PC transactions are not suited to the long-lived business transactions that you are likely to encounter in most Web service cases, which is why the need exists for a new transaction or coordination protocol. It remains to be seen which of the contenders will eventually become the dominant standard for loosely coupled transaction coordination and at what point in time commercial resource managers will support this standard. Until then, we must cope with ad-hoc solutions.