Chapter6.Real-Time Operations


Chapter 6. Real-Time Operations

Effectiveness and accuracy of real-time operations directly affects compliance with Service Level Agreements (SLAs). The time taken to detect a problem, determine its cause, and take corrective action is the time during which service quality is at risk and SLA violations may occur.

The demand for higher service quality shrinks the time allowance for responding to the actual variations in service behavior. Every SLA has time-based metrics. For example, availability metrics are all about time: the available uptime each month; the total of outages (downtime); the time between outages; and the duration of each outage. Transaction completion times are another example of a time metric. The emerging metrics for measuring provider Quality of Service (QoS), such as service activation time or responses to trouble tickets, are also time-based.

Real-time operations management involves operations that deal with time-sensitive tasks such as monitoring, analyzing, and responding to potential service disruptions. Real-time operations management tools are a core part of most commercial system management consoles.

To illustrate real-time operations management, consider Figure 6-1. As shown, the real-time operations manager receives alert input from the real-time event management module and time-sliced measurement input from the SLA statistics module. The real-time operations manager typically contains the sophisticated analysis tools that evaluate the incoming alerts and SLA measurements, identifying problems and proposing solutions.

Figure 6-1. Real-Time Operations Architecture


NOTE

Alerts indicate failures, conditions in which SLA compliance is compromised, or situations where compliance may be compromised in the future. Alerts are often unpredictablethey respond to dynamic behavior that is influenced by many factors.


Time-sliced, periodic measurements are used for managing and reporting on service quality. The real-time operations manager must continuously update its assessment of system behavior and take further actions as needed. It may generate internal alerts if it detects that the time-sliced measurements are straying over predetermined thresholds.

Automated responses can be directly activated by alerts or after other functions, such as root-cause analysis, have performed their tasks. The use of automated responses assists in decreasing the time to handle a situation and the potential for errors and misjudgments. Many routine issues can be handled by automation; where issues cannot be mitigated automatically, automated analysiseven if only partialcan assist the human troubleshooters.

NOTE

Another way of compensating for relatively slow human troubleshooting speeds is to increase redundancy and capacity to handle failures and congestion while the analysis proceeds.


The function of real-time operations management is to help staff reduce Mean Time To Repair (MTTR) when incidents occur and to increase the Mean Time Between Failures (MTBF) whenever possible through proactive prediction of difficulties. There are three basic methods discussed in this chapter to achieve those goals:

  • Reactive management

  • Proactive management

  • Automated responses

These methods are discussed in order in this chapter, followed by illustrative descriptions of some major commercial real-time operations managersincluding response managers for denial-of-service attacks.




Practical Service Level Management. Delivering High-Quality Web-Based Services
Practical Service Level Management: Delivering High-Quality Web-Based Services
ISBN: 158705079X
EAN: 2147483647
Year: 2003
Pages: 128

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net