Even after you build an application integration environment and it is running successfully, your work does not stop there. The environment will need to be monitored and maintained over time. This section discusses the various operational management considerations involved. Implementing the capabilities listed in this section should help to ensure that your architecture continues to function as smoothly as possible.
Operational management is a large topic that covers a broad range of technical and process-oriented issues. Many books, software, and Web resources discuss the various aspects of operational management. In this guide, the discussion is restricted to technically oriented aspects of operational management, specifically to those that are relevant to application integration.
This guide does not discuss processes-oriented considerations; however, effective operations require technology to provide system information, correct interpretation of the information, and processes to ensure adherence to an overall plan. Deficiency in any of these areas leads to poor operational management.
As with security, an important part of effective operations is to have a predefined operational management policy, which defines how operations occur throughout your organization. If you currently have a policy for operational management, you should make sure that your application integration environment meets the requirements specified there. If you do not have such a policy, designing your application integration environment represents an excellent opportunity to institute a more organized approach to operations.
Your operational management policy, at the minimum, should provide the following information:
Clearly defined terminologies and target metrics
Methodology and/or formulas for measuring the metrics to ensure that results are consistent
Service-level prioritization to ensure that the most important policies are followed first
You should make sure that any policy and service-level metrics clearly define terminology. For example, if you simply state that a system needs to be available 99.999 percent of the time, you often find that you are not defining your requirements appropriately. When you examine the situation in more detail, you often find that the business unit only requires very high availability within business operations hours.
A number of basic capabilities are usually required to ensure effective operational management in an application integration environment. Table 3.2 shows these capabilities.
Business Activity Management
Monitors, manages, and analyzes business transactions processed by the integration system.
Receives events and acts on them.
Tracks hardware and software configuration.
Maintains information on applications, subscriptions, and services.
Manages change within the application integration environment.
Determines whether the hardware, operating system, and software applications are functioning as expected, and within the desired operating parameters or agreed-upon service levels.
For more information about each of these capabilities, see Appendix A, "Application Integration Capabilities."
As with security requirements, the precise operational management requirements of your application integration environment depend on a number of factors, including the following:
The operational management requirements of your organization
The business requirements for application integration
The technical requirements for application integration
The capabilities of the applications you are integrating
The platforms on which applications are running
The following paragraphs discuss operational management considerations that are common to many application integration environments.
Your application integration environment typically involves multiple applications running on multiple computer systems. To keep your environment running successfully, you must monitor the system to ensure that each element of the application integration environment is functioning properly and is meeting its performance goals.
One challenging aspect of system monitoring in an application integration environment is that information often is spread across multiple systems in different geographic locations. It is therefore particularly useful to have a system that can aggregate the performance data in one centralized location where it can be analyzed. Because data aggregation is one of the goals of application integration itself, you can often use the data capabilities of your application integration environment to support your System Monitoring capability.
Your system monitoring should focus on the following areas:
System and application health. Tracking the health status of the system and the application. A healthy system or application can perform its operations within expected parameters or agreed service levels.
System and application performance monitoring. Tracking the system and application response times for service requests.
Security monitoring. Monitoring security related events and audit trails.
Service-level monitoring. Monitoring the system and application adherence to agreed or predefined service levels.
The following paragraphs discuss each type of system monitoring in more detail.
In an application integration environment, you need to ensure that the operating system, the applications, and the capabilities that facilitate application integration are all functioning properly. In some cases, you may need to design a special module within the integration logic to perform basic diagnostics of all the components it uses, or create a special test case using specific data (called from a probe) that returns a well-known result if the system is functioning properly.
In some cases, your applications may be instrumented and therefore may provide useful information about their health. However, in situations where an application is not generating that data, either because it is not configured to, or because it is unable to, you can get further information by issuing a probing test from an external system or another application within the system. By using regular probes, you can detect unavailable systems and raise an alert. Setting very frequent probes may affect system or application performance, but setting infrequent probes means that you will be unable to quickly detect unavailable systems. The interval for probes you set should be based on the maximum desired elapse time. For example, if you set probes to occur 10 minutes apart, in the worst-case scenario the alert will not be generated until 10 minutes after the system becomes available.
Often system health is thought of simply in terms of whether the system is up or down. In reality, however, health is not just a binary value. Just as human health is not just measured according to whether the person is dead or alive, a computer can be unhealthy and yet still function, albeit at reduced capacity.
For example, if your application is designed and tested to handle 100 concurrent users and respond within 2 seconds for requests, you can consider it healthy if it meets this criterion. If it takes 30 seconds to respond to requests, you will probably consider it to be unhealthy.
By detecting and diagnosing unhealthy applications early, you can prevent a system or application outage. There are numerous possibilities for systems to become unhealthy during operations, including:
Virus or malicious attack. A virus usually consumes system resources or alters system behavior. The impact of viruses can range from having no noticeable effect to rendering the system useless. Malicious attacks come in two forms: internal intrusion and external attacks. Internal intrusion is when the attacker tries to gain control and penetrate the system. This form of attack is usually difficult to detect because the attacker intends to be discreet. A problem may not be apparent from a system and application health perspective, even though a security breach has occurred. External attacks usually come in the form of denial of service (DoS). The main purpose of these attacks is to prevent the system from providing services. A system that is hit by a DoS attack can be considered as sick or dead, depending on whether it is still able to process requests. Installing antivirus software and ensuring that it is updated is a good start to battling viruses. However, you should also monitor system requests and probe your applications to ensure that your service levels are being met. If they are not, this is a potential indication that you are undergoing some form of malicious attack.
Unplanned increase in usage. A sudden and unforeseen increase in usage generally affects externally facing systems that provide services to anonymous service requesters. Such usage increases can render a system sick, because it was never designed to handle the increased load and maintain the agreed-upon service level. If you provide artificial limits or queuing, you can allow additional loads to be handled in a predictable manner. Systems that create new threads to handle requests can be very susceptible to spikes (or storms). If the spikes are large enough to cause the system to generate enormous amounts of threads, it can overload the operating system so that it becomes too busy to manage the threads and cannot allocate resources to handle the actual requests.
Failure in resilient systems. To provide fault tolerance in your environment, you may use resilient systems, such as Web farms. If one of the servers fails in this type of environment, the system can still function at reduced capacity. It is extremely important that you detect failed servers and fix them as soon as possible so that the system can return to full capacity. You should make sure that you know exactly how many types and levels of failure your resilient systems can handle before the system itself will fail. The majority of Web farms implemented today use pin-level packets, which means an application can be unavailable and still receive requests. In these cases, failures can be very difficult to detect because the service to the end user does not always suffer in a predictable manner.
Integration applications generally rely quite heavily on message-based communications. This type of asynchronous communication provides good scalability, but it can also make it difficult for you to detect failures in communication paths. Most message-based products use dead letter queues to inform the application if any messages fail to arrive or fail to be consumed by the receiving system.
You should use performance monitoring to ensure that application integration is adhering to agreed-upon or predefined service levels. Performance monitoring also gives you an early indication of potential failures or future capacity issues. The two main areas important to integration application for performance monitoring are:
System response time measurements
Resource usage measurements
By measuring system responses for all aspects of a request, you can help to ensure early detection of potential bottlenecks. It is fairly easy to measure performance at a general level, such as the average number of requests per second and the average amount of time it took to process a request. However, one of the more useful items you can track through performance monitoring is delayed responses to requests as they go through the system. Doing so allows you to determine if a delay was the result of a particular request type or other factors. For example, for a purchase order handling system that relies on a number of back-end systems, a particular purchase order that calls back-end system A with more than 100 items may result in an unusually slow response from that system. Without adequate monitoring and tracking, you would find it very difficult to trace the problems based on pattern analysis.
Resource usage measurements help you to determine whether your systems have adequate resources to run at their full potential. Inadequate resources can lead to resource contention, where the lack of resources causes degradation to system performance. If you keep historical information on resource usage, you can track and predict increases in resource usage.
You can also use longer-term observations of performance and resource usage tracking to establish useable baselines. This technique is particularly useful if you need to increase performance or reduce resource usage.
Your application integration environment generally requires secure communications between applications within your organization. As mentioned earlier in this chapter, security is vital in all integration application design. However, the security of your systems depends not only on the design and implementation of the software, but also on appropriate monitoring or auditing capabilities. As an example, imagine that an operating system does not provide the ability to track failed logons. It is very difficult to check if someone is trying to attack a particular account, because system operators have no way of detecting such attacks.
At the minimum, your security monitoring should provide the following capabilities:
Logon audit log. For tracking logon information. You may want to track all logons, but you should certainly track all unsuccessful logons. The integrity of the security audit log is paramount. The log content should be kept secured with the data available only in read-only mode and accessible to the minimum number of people.
Data access log. For tracking all access to the data repository. This capability is very important if the original requester identity is used to access the data. As mentioned earlier in this chapter, some systems provide the ability to perform single sign-on or impersonation. The value of the data access log diminishes if the system consolidates the various requester identities into a single identity used to access the data.
Security policy modification log. For tracking all changes made to the security policy. This capability is important in detecting changes that relax the security policy, including changes due to human error and changes by hackers or disgruntled operators.
Alert mechanism. An active alerting mechanism is needed to flag suspicious or unusual occurrences. Simply relying on logging is not adequate because the system can generate a large volume of log information. You should provide a rule-based alerting mechanism that allows critical or important analysis to be done automatically.
One very important part of security monitoring is ensuring that the security logs themselves are secure. You must ensure that security logs are accessible only to authorized personnel and that the information captured cannot be modified. Solutions for protecting the log information can involve storing the information on a read-only device and also providing signatures, as a secondary measure, to allow verification of information integrity.
Business Activity Management provides probably the greatest potential to clearly demonstrate return on investment to the business owners. Nonetheless, it is one of the areas often left out in integration environments. The short development time for many IT projects often causes the monitoring aspects of a system to be designed last or never designed at all, because they are viewed as an additional benefit. To provide efficient and valuable Business Activity Management capabilities, you should ensure that the correct information is captured during design and development.
To provide you with added value in the longer term, your Business Activity Management should at the minimum provide the following capabilities:
Business transaction exception handling. This capability allows you to handle transactions that generate business-level exceptions. For example, your system has paused the processing of a loan approval because the credit rating of the applicant was borderline. However, after checking manually, your loan officer has decided to approve the loan. Rather than rejecting the applications and forcing the applications to start from the beginning, your system should provide the capability for the loan officer to reroute the transaction.
Contextual monitoring. You should be able to track the progress of any business transaction through the process chain. Providing response times at each level of the business step (whether it was processed by a system or by a person) allows you to determine any cause of delays in the process.
Rules-based alerting. This capability allows you to generate alerts due to business events. This means that you can detect anomalies and potential delays in processing. Early detection of potential delays gives you the opportunity to contact the party that originated the transaction and inform them of the problem in advance, or to fix the problem before it affects them.
Historical data mining. This capability allows you to capture useful business process information, such as the time it took to process each process step, the data sources for the business process, and the next step in the process. You can analyze the information and then modify your business processes if needed. Generally, the more information you can capture the better, because having more information means that there are more ways to take apart and analyze the data. It is also helpful because you cannot be sure now what information will be useful in the future.
In an application integration environment, if one application raises an event, it often leads to actions in other applications. An unexpected event or exception in one application may lead to the failure of another application, for example. To ensure that your application integration environment is stable, you should have the capabilities to receive system events from your applications and take actions to ensure that other applications and systems react appropriately to maintain service.
In many cases, an exception at the system level leads to particular failures at the business process level. You should therefore have capabilities for dealing with events at the business process level as well. However, it is often useful for the events at the system level to be passed up to the business process level, because a system event may well be the first indication of a potential failure at the business process level.
Application integration environments are notoriously difficult to manage, because they tend to involve increasing numbers of applications communicating on multiple disparate systems. You should therefore consider the benefits that good change and configuration management bring to your application integration environment.
Implementing change and configuration management effectively is a major project in itself, which can generate significant initial costs. Fortunately, however, you will complete some of the significant work required for effective change and configuration management as you define your application integration environment—for example, developing an understanding of how applications communicate with each other and which systems they run on. If you can define the requirements of your change and configuration management system prior to your work on defining your application integration requirements, you can significantly reduce the costs of implementing change and configuration management.
Your application integration environment may contain multiple directories that contain information about identities, profiles, subscriptions (used in a publish/subscribe scenario), application configuration, and capabilities. Alternatively, this information may all be located in separate parts of a single directory. As you define your application integration environment, you should determine which of this information you need to store and where you will store it.
Most modern operating systems are based on an extensible directory that can be used to store the information required by application integration. Such a directory can be particularly useful in situations where your organization uses one operating system. However, in cases where multiple operating systems are used, it is possible to synchronize each directory into a meta-directory. Alternatively, you may want to store your directory information separate from the operating system and use the capabilities of your application integration environment itself to facilitate replication.