What is change management and integrating SAN/NAS changes into an existing data center CM system?
When was the last time someone made a change to the configuration you were supporting? Dont know? Well, probably about five minutes ago, if youre operating in an unregulated environment. The chief creator problem is change. Something changed, and finding out what changed constitutes the bulk of problem analysis. Finding who changed what becomes more of a people management issue than technical. However, the discipline of change management within computer systems came about for just these reasons. Who changed what, when, and where is the mantra of change management.
If tracking and prioritizing changes is already an established discipline within the configurations you support, then youll understand the value of integrating an unwieldy infrastructure like storage networking into the change management activities. The central value of managing change is that in the case of young technology solutions like SANs, almost 90 percent of the problems can be eliminated just by understanding the configurations and implementing reasonable changes.
The chameleon-like nature of NAS and the ease and attraction of Plug-and-Play provide a tremendous force in circumventing the change discipline. Well, you say, its just some network storage and I can put it on the network in less than five minutes. Yes, you can, and you can also bring the network to its knees, lose all the data on the NAS, and corrupt the data on the redirected server. Wow, thats a lot of potential problems for Plug-and-Play. Regardless of whether were accustomed to this type of service, it doesnt have to happen.
Forget the formalization of change committees and change windows for the SOHO environments. All you need is some common sense to discipline your changes. Developing a change management discipline can work just as well for small environments as for large environments. No one appreciates the use of the defragmentation process that runs during the peak of the servers day, especially when it crashes and all the sales transactions have to be reentered because the mirror on the NAS device didnt work. Consequently, even if youre not familiar with change management, starting with storage networks can be extremely beneficial, all in the name of availability.
An ongoing program of regular monitoring and analysis of the storage network configuration ultimately reveals changes that must be made to enhance performance or provide sufficient reliability to meet service levels. Whether this comes through changes in software or hardware activities, the ability to schedule, track, and implement these changes effectively, and in a non-disruptive manner, requires the disciplines of change management.
In many cases, however, there will be changes to the configuration that require a timely response. Consequently, there must be a prioritization within the change management system to handle these issues. In terms of SAN serviceability, the inoperative switch that impacts the service levels of a production application system must be analyzed given its relationships to the entire configuration and its effects on other components and systems being supported through the SAN configuration. This may require a quick response, but without sufficient and accurate knowledge of the existing configuration, the ability to identify and resolve the problem becomes difficult. Either the change is made without knowledge of its impact, or a lengthy analysis to document the configuration may elongate the outage or degrade the service.
Consequently, configuration management drives changes by documenting storage network hot-spots and quickly referencing the relationships within the configuration. In addition, an effective change management program can leverage the information within configuration management system diagrams to facilitate the problem resolution process in conjunction with change prioritization. Helping to facilitate the establishment of change prioritization is an added benefit of configuration management that drives change management in a positive direction.
You cant fix a problem unless you change the configuration. You dont have problems unless there is a change to the system. This is the yin and yang of change management. Although we have yet to discuss problem management, we have described the effect of changes within the configuration being driven by problems that occur. Because storage networks are implemented using diverse components, they are exposed to a disparate set of external stimulus. They become the critical elements to production systems, however, and as such, they are the prime focus when diagnosing severity level one problems. In other words, storage is a key element to evaluate during system outages.
Consequently, managing change can drive the balance of change within the overall data center activities. After all, its easy to blame everything on the new kid (say, storage network configuration), if they cant defend themselves . The need to enroll storage network configurations into change management activities allows many things, including the visibility of changes within the configuration, visibility to external changes that affect the configuration, and the ability to maintain a credible defense in problem identification activities.
One of the major factors that disrupt storage configurations are upgrades. Whether therere extending capacities or enhancing performance, these changes should affect existing service levels in a positive way; however, in many cases, capacity upgrades can provide unforeseen results and performance, generally in the wrong direction. These situations evolve from unscheduled or unmanaged change plans that do not account for adequate time to complete or consider the effects to related systems.
This is evident in the installation of additional storage capacity using NAS devices. Without sufficient consideration for the related network capacity and existing traffic, NAS devices can solve the storage capacity problem but create a network problem as a result. The other issue with NAS upgrades is the related file protocol services that need to be connected or reconnected after an installation. Installing or upgrading NAS device capacity may be quick in terms of configuring and identifying the network; reestablishing the drive mappings and device mounts for the new NAS devices are sometimes an overlooked detail.
SAN configurations provide a more complex activity for upgrades and capacity upgrades. Due largely to the diversity of components and their interrelationships, SAN upgrades and installations will require a significantly longer planning and analysis cycle than other storage configurations. For example, upgrading storage capacity has a ricochet effect within the configuration. Adding a significant amount of capacity is driven by increased access. The increased access may require additional node servers to be attached, and ultimately will place more traffic within the switch configurations.
If the extenuating situations as mentioned in the preceding example are not addressed, the SAN configuration can become compromised in the name of increasing service. Addressing these situations requires a discipline that is supported through a comprehensive change management system. The articulation of changes within the storage infrastructure drives the discipline to analyze the following: whats being changed, why, the desired result, and the plan for reestablishing service. Managing changes highlights the division of labor between software and hardware changes. It also highlights the division of responsibility in large configurations where the network part and the storage part are separated and supported by different groups, each having their own hardware and software expertise. Integrating change management into the storage configuration management process provides a forum to highlight all the required activities for servicing the SAN or NAS configurations.
Tracking of changes within a storage network is similar to the tracking of changes within the data center: a separation of hardware and software changes. Another view is to divide them into physical and logical categories. This mediates the traditional separation of hardware versus software, thereby integrating the component responsibility into combined or shared responsibilities. For example, the typical enhancement of storage capacity within a storage network is the installation of a new and larger storage array. This could be performed in an off-shift time period, or on a weekend , and no one would be the wiser.
However, it could also be scheduled within a change management system where the information regarding the upgrade, proposed activity, and scheduling are communicated to other areas that may be effected. This allows others to either consider the effects of the change and plan accordingly , postpone the change until issues are addressed, or, at a minimum, be aware that the possibility of problems could exist after the change date.
Because both NAS and SAN configurations are closely associated with hardware and software functions (see Parts III and IV for more details on the major components of these storage systems), the tracking of physical changes provides a method that encompasses both the physical aspect of the component as well as the bundled software configuration parameters that need to be addressed. For example, the recabling of a SAN switch to enable the installation of a new switch requires both physical changes and subsequent fabric changes. Viewing and administrating changes in this way forces the change process to encompass both hardware and software skills while ensuring the changes are analyzed in an integrated fashion.
The tracking of logical changes performs the same function within the change process, except that logical configurations are generally driven by changes to software components within the storage network. These may be enhancements to various features and functions, or maintenance of existing software elements such as the fabric OS, NAS micro-kernel , and storage and adapter drivers and firmware.
An example is the upgrade to the NAS micro-kernel software. This change, viewed from a logical perspective, should require an analysis of configuration management information regarding which relevant NAS devices are affected by an upgrade. The NAS target configurations would then be evaluated for subsequent firmware changes within the storage array controllers that may be required to support the new NAS kernel software. In addition, the micro-kernel change may have different levels of IP network support and will require the evaluation of the potential network components. Consequently, the change to the logical configuration brings in all of the NAS components, as well as the external elements that may be affected. This also forces the necessary storage and networking hardware skills to participate in the scheduling and implementation of the change.
The following guidelines can be used to provide the beginning of a change management program for storage networking. Even though a formal change management function may exist in your data center, the establishment and integration of storage networking as a valid change management entity is key to participation at a data-center level.
Establish a link between storage configuration and change management. This forces the evaluation of each change to be analyzed for the entire configuration, as well as other systems in the data center.
Develop a relationship matrix between the key components among the storage network configurations. For example, the relationship between the FC switch and the storage nodes is a fundamental and interdependent one. If the switch becomes inoperative, the storage is inaccessible and potentially magnified by several storage arrays supported by the switch. Using this example, it also provides a roadmap of Inter Switch Links (ISL) dependencies and other fabric nodes that are participating in the fabric.
Establish an internal (within the storage network) change prioritization scheme and an external (outside the storage network) change effects scheme based on a storage network availability matrix. The storage change prioritization scheme will define the levels of change based upon their potential effects on the storage configuration. The external change effects scheme will define the levels of how the storage configuration is affected by outside configurations of servers, networks, and other storage configurations.
Establish a link between storage change management and problem management. Additional discussion on problem management follows ; however, its important to mention the critical link between problems and change given their synergistic relationship. For example, if a storage configuration denotes a FC switch that is attached to five storage arrays and another two switches where information drives the relationship matrix that depicts its interdependencies. These component interdependencies drive the change prioritization scheme to reflect this component as a critical availability element within the configuration. The external changes affect change prioritization schemes that, in turn , reflect the risk to external systems supported by these components. Consequently, this will drive a level of change prioritization if and when a problem arises within this component.
Storage problems within the data center can ricochet through production applications faster than any other component. Obviously, this is a reflection of their critical position within the diversity of components that support production processing. When problems occur, there generally is some type of trouble ticket system that logs problems for assignment and potential resolution. These systems provide an invaluable service when it comes to documenting and tracking problems, but does little to facilitate the entire aspect of problem management. Consequently, applying a trouble ticket system or help desk software to storage problems only addresses a small portion of problem management.
Problems, especially within the storage infrastructure, are resolved in a number of ways, depending on the data-center facility, administrator preferences, and management, or even the establishment of service levels. Regardless of data centerspecific attributes and management styles, there are fundamental activities that reflect problem management disciplines as well as critical activities that must be performed to respond to storage problems effectively. The problem management disciplines are: problem prioritization, problem identification, and problem resolution. Overlaid with problem documentation and tracking, they form the fundamental parts of problem management. How these are implemented and integrated with storage networks is critical to the success and reliability of the storage infrastructure.
There is no debate more dynamic within the data center than that surrounding problem prioritization. Someone must decide what constitutes a severity level one problem, as opposed to severity level two, three, and so on. This discussion becomes even more vigorous when dealing with storage, since storage contains the data, and ownership is an attribute of that data. If the data becomes unavailable, then the application that uses the data is unavailable and the level of impact is driven by the effect this situation has on the activities of the company. Consequently, end users drive the establishment of severity definitions with the data availability, vis- -vis, storage availability. However, most are ill-equipped to handle or understand the level of complexity within the storage infrastructure to effectively provide this.
These discussions will become either a very complex issue or a very simple equation given the level of data ownership taken by the end user . If end users take data ownership seriously, then discussions regarding the definitions of data, usage, and management will become involved. However, if end users simply see the data center as the service supplying the data, the discussion can be reduced to a simple level of on or off. In other words, either the data is available, or it isnt.
Regardless of these two perspectives, and their highly charged political engagement with end users, the data center must view problem prioritization in a more utilitarian fashion. Simply put, the complexities of providing data delivery and availability infrastructures must be managed within a matrix of interconnecting systems that are beyond the level of knowledge or interest of end users. Moving end users toward a utilitarian approach allows the data center to deal effectively with storage ownership and management of the infrastructure. Defining problems from an end-user perspective within this increasingly complex set of infrastructures allows data centers and underlying areas to define the prioritization required for the data utility. Key to establishing prioritization mechanics is storage networking.
The relationship between data availability and storage networking problem prioritization increases the complexity of this activity. Given the characteristics of data delivery and the evolution of decoupled storage infrastructures such as storage networks, the chances that problems will arise, given the level of interrelationships within the storage area, increases dramatically. This results in environments with increasingly complex storage problems while trying to maintain a simple utility data delivery system.
Investigating and identifying problems within storage networking configurations is an ongoing activityfor three reasons. One, these technologies are still new and continue to be temperamental. Two, storage networks consist of multiple components that must interact both internally within themselves and externally with other components and systems within the data center. And three, the industry has not yet produced the level of tools necessary for effective operational problem evaluation and identification.
However, there are some key items to consider when looking at problem identification in these areas. Some activities dont require tools so much as basic problem-solving practices. Although this is beyond the scope of this book, the necessary activities should be mentioned in context with storage networks.
Accurate and Current Information Configuration management for installed storage networks should be accurate and readily available.
Diagnostic Services A minimum level of standard operational procedures should document the entry-level activities for problem investigation and resolution. This should include the dispatching of vendor service personnel in order to facilitate problem resolution.
Service Information Accurate and current information should be available to contact vendors and other external servicing personnel.
Tool Access Access to consoles and other logical tools should be made available to support and operations personnel. This includes current physical location, effective tool usage, diagnostic processes, and authorized access through hardware consoles and network-accessible tools.
In SAN environments, an effective guideline is to investigate from the outside in. Essentially, this requires the evaluation of problems that affect all devices versus problems that affect only specific devices within the storage network.
For problems that affect all devices, its necessary to view components that are central to the network. Specific to SANs is access to the availability of the fabric operating services. The directly connected console is the most effective tool if network access is not available, while fabric configuration routines provide status information regarding port operations and assignments. There may be additional in- band management routines as well, which can be used for extra information.
For problems that affect only some of the devices, it becomes necessary to first view components that are common to the failing devices. If the total SAN configuration is not completely down, then out-of-band management routines can be accessed to provide a port analysis, port connections, and performance checks. In addition, the ability to telnet or peek into the attached node devices and node servers verifies the basic operations of the switch fabric.
Specific to the node devices are the mostly out-of-band tools which evaluate and verify the operation of storage devices, server HBA ports, and server operating systems. However, access through the fabric can also verify connectivity, logins, and assignments. Key to this area are the continued network drops from the fabric by node devices, a practice which can be traced more effectively through fabric access using configuration and diagnostic tools.
In NAS environments, the same approach can be used to identify failing components; however, NAS configurations are more accessible to typical performance tools. Therefore, the most effective approach is to view problems from the outside in by focusing on access points common to all NAS devicesfor example, the network. This is followed by the investigation of individual NAS devices and related components that are failing.
The centralized approach can be used with existing network tools to monitor and respond to NAS network drops where access within a specific area may be the key to the problem (for instance, a network problem versus a NAS problem). This demonstrates how NAS devices require less overhead to identify problems because of their bundled self-contained status and their integration into existing networks as IP-addressable devices.
If problem identification focuses on an individual device, then the failing components may be simple to find given the bundled nature of the NAS solution and the available tools within the NAS microkernel operating system. All NAS vendors provide a layer of diagnostic tools that allow investigation of internal operations of the device. This becomes a matter of a failing server or storage components given the simple nature of the NAS device.
NAS devices are less expensive and easier to integrate into existing problem identification activities. Given their simple configuration and mode of operation as an IP addressable network component, they can be identified quickly through network tools. In addition, internal diagnostic vendor tools provide performance statistics and notification of failures.
Once a problem has been reported, the ability to document its particulars is critical to the identification process. Most data centers will have an existing problem documentation system associated with their help desks. Integration of these systems is important for setting the foundation for problem tracking within the storage network. This doesnt automatically mean that all problems will be documented through the help desk, however. Many storage network problems will come from internal sources where theyre tracked and reported by data-center administrators, data base administrators, and network administrators, to name a few.
Even though these positions may report the bulk of problems, it does not guarantee those problems will be reported through the same trouble ticket system used by help desk and operations personnel. Thus, the foundation is laid for problem duplication and multiple reporting. This demonstrates just one of the challenges regarding implementation of the storage network into the problem tracking system. The same can be said for other key areas of the problem reporting process when it comes to critical and accurate reporting of problems.
Storage network problem documentation is made more effective by following a few general guidelines.
Consistent Reporting Process Ensure that problem reporting uses the same system for all external and internal users. This requires a certain level of internal discipline to get all administratorsboth in systems and applications support to utilize the same system. This is important because a problem reported externally may be linked to a problem reported internally, which will have more technical detail. An example is a network-shared drive that continues to be inaccessible, as reported by external users. An internal problem report indicates that the network HUB associated with that device has experienced intermittent failures and is schedule to be replaced . These two sets of information provide a quick level of diagnostic information that should lead the storage administrator to verify the NAS device internally and take steps toward a resolution that may require a different path to the storage.
Historical Reporting and Trending One of the most valuable assets problem tracking systems provide is a database to view the history of a component or device, its past problems, and resolution. This can be extremely important as both SAN and NAS components are tracked and analyzed for consistent failures or weaknesses. For example, historical trends regarding complex switch configurations can be very helpful in identifying weaker components or trends toward failure as port counts become higher and higher and interswitch linking becomes more sophisticated. This proves a valuable tool in anticipating potential problems and thereby scheduling changes prior to a failure. This approach offers a proactive answer to problem management and subsequent downtime statistics.
Experience and Expertise Tackling problems is one of the quickest ways to learn a system or technology. Using problem tracking information and configuration management information provides effective training for IT personnel new to storage networks.