Section 9.2. Logging and Auditing | Enterprise SOA: Service-Oriented Architecture Best Practices

9.2. Logging and Auditing

In this book, we provide technical and organizational tools to build a system that is not prone to failures. Yet it is of course virtually impossible to rule out all failures. A lot of the measures described elsewhere in this book are potentially costly, and for this reason, they are often omitted in practice. For example, in real-world situations, the budgets for infrastructure investments to set up a failsafe database cluster or a redundant network setup simply might not be approved. Hence, you sometimes must rely on measures such as logging. Chapter 8, "Managing Process Integrity," has already given an in-depth discussion on process integrity for which logging is a crucial building block.

Another reason for operation disruption might be that one of the services that the application depends on must undergo unplanned maintenance. Consider an airline booking application thatamong othersrelies on a pricing service to determine the price for specific flights on a given date. Consider a serious bug within the module that performs flight price calculation. No company wants something like this to happenespecially if the calculated prices are far too lowand the corresponding service will often be shut down immediately to prevent further damage.

The bottom line is that even if you have planned for the system to handle a lot of error conditions automatically, unplanned and unrecoverable errors can still happen. We will call such errors failures in the remainder of the chapter. As depicted in Figure 9-9, failures require activities at different levels, including: user interaction, logging, and systems management.

Figure 9-9. An error must be reported to the user, to a log file or database, and to a systems management system.

When coping with failures, the concepts and techniques discussed in this chapter are not only relevant to SOAs. Most of them are simply good coding practices or design patterns. However, in a distributed and loosely coupled architecture, they are of much greater importance than in a standalone monolithic application. The distributed nature of the architecture and the fact that the source code or core dumps of the actual services are not available for debugging make explicit failure-handling measures essential.

In case of such failures, it is usually necessary to perform certain manual activities to return things to normal. In lucky circumstances, this might be as easy as resetting a power switch. On the other hand, it might result in a number of employees browsing millions of rows in multiple database tables. It is not hard to see that resolving problems is a lot easier if we know where and when the error occurred and which users were involved in it. It is, therefore, mandatory that every SOA has a reliable logging and auditing infrastructure. Although logging and auditing are quite similar from a technical point of view, they differ considerably at the requirements level.

Usually, runtime output from a system is mapped to different log levels. These levels areamong other thingsused to distinguish auditing, logging, tracing, and debugging output. They are usually identified by a number of a text warnings, such as "DEBUG," "TRACE," "INFO," "WARN," "ERROR," "AUDIT," etc.

Auditing is normally put into place to satisfy some legal requirements, such as documenting that a credit card was actually charged because the client ordered flight tickets, or to document that the ticket actually did get printed and that it was sent to the address given by the client.

Auditing creates a new subsystem that by itself impacts system operation. When normal logging fails, for example, if the log file or disk partition is full, you can usually carry on merely by halting the logging process. However, if auditing itself fails, it must be considered a failure, and the system must stop its operation. After all, continuing to run without auditing in place might violate some legal obligation, and no company wants that to happen.

Tracing is usually disabled while a system is in production and is only enabled in case of a major problem because it is often so detailed that it significantly degrades system performance. Tracing is usually switched on explicitly to track down a particular error. In case of intermittent errors, this will of course result in a degradation of system components for a potentially lengthy interval until the error can be identified.

Finally, debugging consists of statements that are only significant to application developers. Therefore, debugging code is often excluded from the production code altogether. In the rest of this chapter, we will focus on logging and auditing. Both are treated in a largely similar fashion.

9.2.1. ERROR REPORTING

One of the most common issues with error reporting is that failures can go unnoticed. Unfortunately, it is all too easy to build a system where failures are not reliably detected. A common mistake is when a program catches all exceptions during development and discards them silently. As the project moves onusually toward an overly optimistic deadlinedevelopers move on to their next tasks, and the RAS (Reliability/Availability/Serviceability) features are never completed. Of course, cases like this should be avoided in any kind of software development by employing proper coding standards. In an SOA, they become an even greater problem because of the loosely coupled nature of the application.

Similarly, an error that is logged but not reported to the customer can cause a lot of confusion. For example, if the airline ticket printing service is temporarily unavailable, an error might not be reported to the customer. Instead, it might be discovered and fixed during a routine log screening at the end of the week. By that point, the customer is likely to have called in and opened a customer service case, causing expenses that might otherwise have been avoided.

It is crucial for the business owners to clearly define both to developers and software designers what their requirements are for logging, auditing, and reporting. Likewise, it is mandatory to report an error each and every time it occurs. When using distributed services, this can be achieved using the technologies of the underlying platform. For example, when using SOAP, you can utilize the SOAP error mechanism. Similarly, if you are using a distributed object technology such as EJB, you can use remote exceptions to report an error.

9.2.2. DISTRIBUTED LOGGING

Using a framework of potentially distributed services does little to make logging easier. In fact, using distributed services for logging is rarely appropriate. Usually, the requirements for logging are that it must be both reliable and lightweight.

Additionally, it should be easy to use in order to encourage programmers to log whenever they think that there is something worth logging. Given these requirements, it is astonishing how often one comes across a central logging service. Granted, the idea to set up logging as a service in a distributed environment is very tempting, but it is easy to see that such an approach is not lightweight by thinking in terms of the network load and latency involved. If logging is implemented using object-oriented technologies, the cost of marshalling and unmarshalling log record objects adds to this overhead. It is also not reliable because many things can go wrong when storing the individual log records. This starts with network failure and ends with storing the actual log records in a file or database table. Finally, it is out of the question to use distributed transactions to make entries in a log facility because this is probably as complex a process as one could encounter.

Log Locally but View Globally

Local logging is essential due to the need for a logging facility to be lightweight and reliable. Global viewing of logs is required for the analysis of errors in distributed processes.

To ensure that logging is both reliable and lightweight, the best approach is to log locally and consolidate the logs as illustrated in Figure 9-10. Whether log data is written to a file or database does not really matter, nor does the format of the log records itself. What matters, however, is that each and every log entry carries some common information to enable log consolidation. This information includes the time-stamp, the origin of the log (method or procedure), and the name of the user who made the call (if legally possible). Additionally, each service call should include a unique token that can be used during log file consolidation to build a distributed stack trace.

Figure 9-10. Structure of a distributed logging environment. The individual services write to dedicated logs of arbitrary format that are consolidated to form a common storage. The log services that are available at the various locations should ideally have identical interfaces. An audit trail service can be used to query the consolidated storage and request consolidation from the various log services.

Session and Transaction Tokens

A good service-oriented logging facility needs to ensure there are tokens such as session-tokens and transaction-tokens, which can be used to consolidate the log files as they are constructed and for searching after an exceptional event.

Consider the example of purchasing an airline ticket, shown in Figure 9-11. The airline ticket service itself logs to an XML file. When called, it generates a unique request ID that is logged with all entries to that file and passed to other necessary services. The billing service logs to an RDBMS, while the flight reservation uses a record-based log file. The three log sources are then all accessible using a log service for that particular location. Ticketing takes place at a different location. It logs using a line-oriented file, again using the unique request ID. The contents of the logs for that location are also made available using a log service. If engineered properly, the interfaces of the two log services should be identical. The log services store the consolidated log information in a centralized data store. This operation does not need to be overly reliable because it is intrinsically idempotent. The common data store can then be queried using an audit trail service.

Figure 9-11. The airline ticket service uses basic services distributed over two locations. Logging is performed by the individual services using RDBMS, XML-, Record-Based-, and line-oriented files. The local log services consolidate the local data on request or periodically into a common storage. A common audit trail service presents an overall picture of the application.

9.2.3. LOGGING AND TRANSACTION BOUNDARIES

As we mentioned previously, logging is a lightweight activity, and as such, log data should not be written to the same database as the transaction being logged. The reason is obvious: In case of an error, not only will transactions get rolled back, but the logs will also be rolled back. It can be very hard to determine the cause and precise circumstances of the failure, especially when using an environment that has sophisticated container-provided transaction management. For example, when using an EJB container, the user must be aware of the restrictions imposed. If, for some reason, it is infeasible to use the logging facilities provided, logging to a file usually does the trick. If you need to log to a resource, such as a transaction message queue or an RDBMS, it is necessary to manage all resources manually.

Never Log Under the Control of a Transaction Monitor

Never log under the control of a transaction monitor. Ensure that each completed call to a logging system yields an entry in the appropriate log data store. Prevent rollback mechanisms of a transaction monitor from extinguishing log entries.

Logging to a file can also show some unwanted behaviour. File systems such as Linux ext3 or NTFS use a write buffer. This means that data is not written persistently until the operating system synchronizes the file system. This might require that the file system be synchronized manually after each call.

In the case where there are multiple explicit updates using a transactional resource, logging should be performed before and after each of these activities, as shown in Figure 9-12 when making a flight reservation with two flight legs. This situation occurs commonly when EJB Entity Beans are used as the persistence mechanism of the service.

Figure 9-12. Transaction log.

Naturally, one of the occasions where logging is mandatory is before calling a remote service and after that call returns. This makes sense not only because making a remote call is more error-prone, but also because an error can often be fixed by running the remaining services in the call chain again. Consider the example in Figure 9-13, where the billing and flight reservation succeeds but the ticketing call never returns. After we determine that no ticket was printed, all we must do is rerun the ticketing service.

Figure 9-13. Call chain.

9.2.4. LOGGING FRAMEWORKS AND CONFIGURATION

A common activity is to build the consolidation part of the system and provide services for remote retrieval of log entries. However, a normal project should not be concerned with building the part of the system that writes the actual log records in the first place. Normally, the execution and development environment will provide some form of logging services. Many containers provide logging in their own right; EJB or CORBA containers often include logging consolidation facilities. A development environment might be able to weave logging into the source code based on well-defined rules. Furthermore, a runtime environment might provide a high degree of logging in any case, based on automatic interception of certain activities, such as making a HTTP request or obtaining a database connection from a pool.

Finally, there are software products and libraries available for the sole purpose of logging. The Apache log4j framework is probably the most prominent example in the Java programming language. Since the arrival of Java 1.4, the JDK itself contains similar libraries for logging [GS2003].

That being said, there is no way to avoid the fact that logging is ultimately the programmer's responsibility. Too much business logic is coded by hand instead of being automatically generated. The input and output for this business logic usually provides the most valuable insight into the cause of an error. With the advent of Model Driven Architecture (MDA) and highly sophisticated code generation, this might be automated in the future. Of course, it is very unlikely that even generated code is completely bug-free. Thus, an important part of code generation is sensibly placed log and audit statements.

Traditional midrange and mainframe applications usually provide a robust built-in logging infrastructure. For example, IBM's CICS offers logging using the CICS log manager. This infrastructure is used for all logging purposes, from basic user-level logs up to CICS transaction logging and recovery. This logging facility connects to the system-level MVS log streams. One of its most powerful features besides its robustness is its ability to consolidate log messages from different physical and logical machines that use IBM's Sysplex technology. Furthermore, logs can be automatically placed into different archive stages so that accessing and searching the most recent logs becomes easy, while at the same time, the full historical information is preserved.

Logging should ideally be configurable in a fine-grained manner. For example, frameworks such as log4j enable users to define how to handle a particular log event based on its origin and severity. Usually, this is used to log into different target files based on the origin, to turn logging off for low severity, or to turn logging off completely for a particular origin. This is sensible from a performance perspective, but it can have the opposite effect in certain scenarios. Assume that the flight reservation service has the nasty behavior thatjust sometimesflight legs are not reserved properly. Fortunately, the programmers have included a high degree of logging in their code. Given the data, it is quite easy to determine why this happens. The flight reservation service is currently configured to log only events of severity "error" and above, but information of the level "trace" is needed to really figure out what is happening. Unfortunately, the system does not allow for runtime configuration. This means that the entire flight reservation service must be taken offline, the configuration must be changed, and then the system must be brought back online. This can cause an unwanted disruption of service because (a) it's the holiday season and (b) it's 5 P.M., and customersa lot of customersjust want to use this system.

Runtime Configuration

If there is a requirement for fine-grained log configuration, take care to ensure that the settings can be changed at runtime. Otherwise, everything should be logged.