|< Day Day Up >|| |
One of the goals of any Information Technology (IT) operations team is to maximize the availability of business-critical IT services, and one of the keys to maximizing availability is to ensure that the network and the infrastructure are reliable. I have had the opportunity to help many different companies over the years, and all Chief Information Officers and IT managers eventually ask about infrastructure reliability-that is, how to reduce unplanned downtime. Unfortunately, sometimes these managers attempt to apply a solution to the problem without fully understanding the problem. For example, they may want to implement clusters or Microsoft Windows Datacenter in the belief that this alone will solve their reliability problems. However, hardware reliability is just one cause for unplanned downtime- and it is not really the most significant problem.
Various organizations have done surveys to gather information about the causes of unexpected downtime. For example, Figure 2.1 shows results collected by Ontrack Data International Incorporated (http://www.ontrack.com/ ).
Figure 2.1: Ontrack Data International Incorporated survey
Hardware or system malfunctions accounted for 44% of downtime. These include incidents such as electrical failures, head crashes, and controller failures.
Human error accounted for 32% of downtime. This includes accidental deletion of critical files, inadvertent drive formatting, dropping disk drives, poor network architectures, and sloppy data center procedures.
Software corruption accounted for 14% of downtime. This includes corruption caused by improper use of diagnostic or repair tools, failed backups, and overly complex configurations.
Computer viruses accounted for 7% of downtime.
Natural disasters accounted for 3% of downtime.
Figure 2.2 shows results collected for a Gartner Group survey. This survey-used slightly different categories and drew slightly different conclusions from Ontrack Data International Incorporated.
Figure 2.2: Gartner Group survey
Application failure accounted for 40% of downtime. This includes the use of untested applications, poor change management, overloaded systems, and weak problem detection.
Operator errors accounted for another 40% of downtime. These are caused by lack of procedures, operator forgetfulness, backup errors, and security leaks.
The remaining 20% of downtime was caused by factors such as hardware problems, network problems, power loss, and natural disasters.
One area of consistency in all surveys is the high amount of downtime caused by operator error-that is, sloppy data center procedures that make business-critical IT services unavailable for hours. Because the problems have a direct impact on the productivity and availability of the IT system, it is imperative to implement some mechanism for handling these problems. There is no hardware solution to this problem; only solid operational procedures will help. Good operational procedures enable you to maximize the availability and reliability of your network and your infrastructure.
Companies depend on their IT infrastructures to support mission-critical business operations. Effective IT operations are the key to supplying a reliable and quality IT infrastructure, and companies invest significant time and money to deliver the service levels required to meet their business obligations. Because business requirements and technology constantly change, they also invest heavily and plan carefully to ensure that they perform this evolution without disrupting the production cycle.
This careful planning and heavy investment is common in most traditional mainframe data centers where the business applications are run according to the best practice standards. Most companies manage their corporate business data using documented processes, strict security, automated procedures, and documented Service Level Agreements (SLAs).
Unfortunately, the same IT departments that are so careful with their mainframe production environment are not quite so careful with their own Microsoft-based environment. For the Microsoft infrastructure, they still rely on stand-alone administration that is disconnected from the rigorous discipline used for the production environment.
However, several changes are causing IT departments to reconsider the way they manage their Microsoft infrastructure. The two primary reasons for this change in attitude are:
Companies are increasing their use of Windows-based servers as platforms for mission-critical applications.
Server consolidation increases the number of impacted users if a server should fail. This increases the need for stability and reliability.
To begin to understand operations frameworks, the manner in which enterprises manage their mainframe production environments needs to be examined. What is different in the mainframe production environment? Primarily, it is discipline.
The Chief Information Officer has a holistic view of the mainframe production environment. It does not matter how well individual servers are working if the network is down or if key applications are not working. The primary measurement is whether the mission-critical applications are available to meet the needs of the business units. These companies measure their success on the basis of meeting SLAs that they have negotiated with the business units. The company reviews and measures the IT department against these SLAs on a daily basis. As you might suspect, relationships with the business unit managers are critical.
Companies carefully plan their mainframe production environment. They build and maintain it based on an enterprise architecture. Operational management is a key component of the architecture. Any changes to the environment go through a rigorous Change Management process. These companies strictly enforce risk management so that they know the impact of changes in advance. They also formally approve and audit all releases to operations.
The production environment operational management is proactive and is based on documented processes and policies for common activities, such as monitoring, incident and problem management, alerting, and problem resolution.
Finally, companies view management of the production environment as a team effort that includes the business units, the users, and even key vendors. Constant communication and collaboration among these groups is important.
The Information Technology Infrastructure Library (ITIL) documents the best current industry practices for IT Service Management. ITIL is technology neutral and is designed to be adapted and enhanced, which is exactly what Microsoft chose to do. The Microsoft Operations Framework (MOF) combines the ideas in ITIL with specific guidelines for using Microsoft technologies.
|< Day Day Up >|| |