The Need for Policies

Different factors are driving service providers and enterprises to consider implementing policy-based management systems. These factors include the following:

Staffing costs Economic pressures to show strong bottom line results work against the need to hire expensive expert staff.
Growing complexity The managed environment itself is becoming more complex as more technologies and organizations are blended into online business processes. Complexity soon outstrips the ability of staff to understand the situation and take the appropriate action within the time limits imposed by an SLA.
Growing awareness Service failures tend to draw media attention, risk customer loss, and make corporate management unhappy because service delivery is more and more tightly coupled to the bottom line.
The need for a knowledge repository Policies capture experience and expertise. This knowledge persists in the face of staff turnover and lack of immediate availability.

Table 7-2 illustrates the major differences between the service and element policy domains. For example, high levels of redundancy can absorb element failures without substantially degrading service quality. More sophisticated policies are needed for managing a complex and dynamic service environment.

Table 7-2. Summary of the Two Policy Domains
Element-Centric	Service-Centric
Applied to single elements	Applied to services
Applied within one infrastructure	Applied across infrastructures
Relatively simple	Relatively complex

The next two sections discuss management policies for elements and for entire services.

Management Policies for Elements

Early management policies were associated with managing elements within the various infrastructures. They were vendor-specific for the most part and dealt with relatively simple situations. For example, a network switch could have a policy that says, "If any port has a utilization level greater than this threshold, send an alert to the element manager." A more complex policy might add local actions, such as, "If the broadcast traffic on any port exceeds the threshold, disable the port and send an alert."

Simple policies are not exclusively applied to network elements. A policy applied to servers, for example, could specify that if a process dies, the management system should send an alert and restart the process. If that fails, the management system should try three more times and then reboot the server while sending another alert.

Management policies for elements are good for speeding up many management responses and for preventing staff mistakes. Management staff involvement is needed only if the policy actions fail to restore service levels.

Most element management policies are configuration-centric because they define specific configuration information for each element to satisfy higher-level rules. Different vendors have their own unique ways of setting operational parameters, making this job even harder if staff members are forced to remember all the vendor-specific details. Some companies, such as MetaSolv Software, have created products that deal with products from a range of vendors.

Policies offer large environments that have many devices, sites, and users a consistent way for handling element configuration. This approach scales gracefully as the environment grows. In addition, staff are freed from element-specific details and are involved only if a policy fails.

While freeing administrators from a plethora of low-level decisions and reducing the likelihood of error is attractive, it is important to remember that the best results are obtained when policy management has unambiguous input. Elements that have very clear management instrumentation and a limited set of configuration options are the best candidates for applying automated policies.

Conversely, elements such as high-end operating systems, application servers, and other parts of the service delivery architecture don't always expose their management information clearly. This situation makes automated decisions less clear-cut. The multiple layers of complexity inside some elements, such as servers, also make tuning them a challenge. The pressure is on policy designers to incorporate those subtleties to get the most from a policy-based approach.

Service-Centric Policies

This policy category deals with service-quality issues rather than element behavior. Such policies are inherently more complex, and they can span several infrastructures. Most importantly, service-centric policies are targeted as much toward achieving business aims as maintaining technical performance. For example, policies are focused on minimizing penalties or treating the affected customers in various ways.

Let's look at an example to clarify the differences between element- and service-centric policies. Consider a provider using a tiered server farm to speed transaction flows. The redundancy of the farm means that a single server failure does not immediately impact service availability, but it begins to expose the site to performance problems if the remaining servers are approaching their loading limits. This is an example of a service-centric policy, which focuses on maintaining adequate server capacity, rather than responding in detail to the failure of any server in the farm.

The policy actions taken when a server fails can include the following:

Check the other servers on the tier Is their load after the failure still under the defined threshold? As an example, consider that a set of four servers each running at a 25 percent load transforms into three servers each with a 33 percent load.
Check the load If the load is acceptable, for now, send an alert and wait for the staff to take further action. If the load is too high, increase the severity of the alert and page the server manager.
Check the pool of stand-by servers Then determine the appropriate candidate to replace the failed server, based on matching resources, required software, and proximity to the failed server.
Check the number of servers in the standby pool If they are depleted below a threshold value, send a high-priority alert to the event management system.
Provide detailed reports The reports should cover the steps taken and warn of imminent problems. They should also generate a problem ticket for repair of the failed server.

Other information can be used to increase the intelligence of the response. For instance, there can be a check to see if there are imminent load changes. The alert could then provide more information, such as whether the remaining servers in the tier are operating under threshold now and whether the afternoon traffic surge is 30 minutes away. This gives the staff better information and indicates that attention is needed to avoid compounding the problems.

Table 7-2. Summary of the Two Policy Domains

Management Policies for Elements

Service-Centric Policies