Change management is another important cog in the high availability works. Change management means different things to different people, but for the purposes of this book it is the process that allows changes to applications and systems to happen in a predictable fashion with minimal or no interruption in service. Change management applies to all phases, from development to production support.
|More Info|| |
More information on change management can be found in the chapter Managing Database Change in the Microsoft SQL Server 2000 Resource Kit (available from Microsoft Press, ISBN 1-7356-1709-0), and the chapter Change and Configuration Management in the SQL Server 2000 Operations Guide on http://www.microsoft.com .
A lack of change management is a major cause of data center failures. From a database perspective, it must become a priority to document, test, and manage the deployment of all database changes. Database changes include stored procedures, functions, physical database structures (that is, indexes, constraints, stored procedures, and so on), the data, the storage components of the database, or even the server itself. This helps you avoid new errors and minimize the impact on the service level.
Besides having input into the design and planning stages, including the data access code, database objects, server architecture, and its configuration, the DBA must also be involved in evaluating and analyzing the proposed changes to any database system. The final stage in the change process is managing the implementation process, so that only known and tested factors are introduced into the system. The implementation process includes documented plans. A post implementation review, which includes the DBA team, allows learning from the change before it begins again.
Database work necessarily overlaps development and testing (Microsoft Solutions Framework, abbreviated MSF) with operations (Microsoft Operations Framework, abbreviated MOF). Change can be application driven or it can be operations driven. The DBA might begin in the design phase of the MSF cycle and end up going through MSF and part of MOF to implement the change. This same cyclical approach also applies to the way that a DBA maintains the database. By constantly evaluating the value of potential changes to either the database servers or the processes by which they are supported, the DBA is essentially continuously moving full circle through the Operations Framework. A formal process of quality assurance is essential to a stable production environment.
Keep in mind that the approaches to change management might differ depending on whether you are rolling out a packaged product or a custom- built application.
When you are making configuration and upgrade changes to your database environment, there are a number of tasks you should perform related to planning, user communication and coordination tasks, disaster recovery, testing, application and vendor issues, and many other considerations. Some of these issues are discussed in greater detail in Chapter 13, Highly Available Upgrades.
The optimal configuration for any environment implementing change management includes completely separate testing and staging environments that simulate or reproduce the production environment in the most faithful way possible. Why separate environments and not just one server, SQL Server instance, and so on, for all functionality? The problem is that they all have different purposes, and some environments are more transient than others. A development environment by nature is chaotic ”changes are happening quickly. A testing environment is a bit more stable than a development environment; it is usually set up to test a particular version of the software or specific conditions and reconfigured accordingly for subsequent tests.
A staging environment, although similar to a testing environment, has a completely different use and purpose. Sure, it is for testing, but it is either a joint venture between the IT staff and product development or maintained by the IT staff only to test what will be rolled out in production. Staging, more often than not, is an exact copy of what is in production. You do not want to affect one environment at the expense of others.
Here are some sample questions to ask yourself.
If development is always changing code on a server, does that invalidate any testing?
What happens if a testing engineer accidentally deletes data needed by developers on a shared development and testing database?
What happens if a developer or tester reboots a server in the middle of others testing or developing against it?
Although these issues sometimes occur in separate environments, they will be greatly minimized.
From a cost perspective, having three dedicated environments might not be possible ”yet another barrier to availability. However, you have to ask yourself if not simulating your production environment in development, testing, and staging to save a few pennies will ultimately hurt more than it will help. Go back to the arguments posed in Chapter 1, Preparing for High Availability What is the cost of downtime versus any up-front investments, especially if you could have found an issue before releasing the software? As an example, production cluster systems and the disks that support them can be potentially expensive, so duplicating them might not be in the budget for one or two additional environments. If you do not have the budget, one option is to look for alternatives, such as spare hardware that can be reused in certain scenarios, or software like VMWare Workstation or Connectix VirtualPC that might help you simulate your environment.
The bottom line is that you should do the best you can to reduce the risk of failure and increase availability in your production environment by developing and testing environments properly before you need to support them on a daily basis. At the very least, try to maintain development and staging environments.
Maintaining separate environments also means more administrative work for everyone involved. How will you ensure a change to the production environment (such as a security patch) is reflected in development, testing, and staging so that when a new change is developed, it will not break production? Similarly, how will development be notified of any necessary changes to development or testing systems? These are only two possible scenarios. Processes must be devised and implemented to have changes flow from development to production and back. This requires a great deal of communication and synergy. It is potentially a huge pain to maintain these separate environments, but the efforts will pay themselves off with the first system failure that is caught in the test environment.
Using a change request form is one way to manage the change process from development to production, or to make changes to the production environment. A change request form serves as an official document that is signed by the relevant parties prior to implementation. Otherwise, the implementation cannot occur. This document should be stored, and it can also be used to help measure success or failure, as well as satisfaction with the tasks performed. A change request form ensures accountability.
|On the CD|| |
A sample change request form can be found on the CD- ROM with the title Change_Request.doc. The document can be used and altered for your environment.
If you are developing custom applications or modifying and extending third- party or packaged applications, you must provide a change infrastructure to ensure availability by the time the application hits production. Change management also will encompass excellent high availability application development practices in the development space.
Contrary to what many developers would like to think, high availability is not just an IT problem to be dealt with later. As mentioned briefly in the section Where Does Availability Start? in Chapter 1, decisions you make during the development life cycle absolutely impact how the solution will not only be deployed, but also maintained. Most important, these decisions impact the availability of the solution. Bad application decisions often lead to low availability applications even with a $10,000,000 back end. If you adhere to the MSF process, assessing risk will already be done on a regular basis. Developers should develop code with deployment and the eventual back end firmly in mind. If you are a DBA or an IT person, make your development counterparts aware of what they need to do from an application perspective. Remember, you are the one who will be supporting it in production!
|More Info|| |
Specific tips about how you should think about coding applications for particular technologies are listed in some of the technology-specific chapters (such as Chapter 6, Microsoft SQL Server 2000 Failover Clustering, and Chapter 7, Log Shipping ).
One topic not covered specifically in subsequent chapters is the development of the installation process for custom applications. This obviously affects the availability of any production rollout. There might be times when rebooting is completely unavoidable, but rebooting should be kept to a minimum. For example, the developer should think about making all changes and installing all files before requiring a reboot. Doing a reboot more than once could prove costly.
Equally important is the ability to uninstall the application or back out from the installation process (that is, cancel it in the middle of the installation) without adversely affecting the system, and to make sure it returns to its state prior to installation. Consider the following example. A contractor developed an updated version of an existing application. The application was supposedly tested and found to be as error-free as possible (meaning no showstoppers ) in the test environment, and it was installed during a maintenance window. All of a sudden, the process failed. When asked about the uninstall method, the developer shrugged and said, What uninstall? This is a permanent change. In an IT shop, such a change would prove to be painful.
Version control is the practice of being able to store and ultimately deploy multiple versions of an application, if necessary. Version control applies to application development and production environments.
From a development perspective, version control is absolutely critical to the success of the environment. It allows developers to check out and check in code so that no one developer can overwrite what someone else has done without creating a record of it. Version control also allows a development environment to trace the history and have code regression in case a bug is found in a current version. You do not want to ship a completed and compiled version of the application to customers without a way of rebuilding it exactly, should you need to. This situation does happen. Clearly identify the released version and make sure that it is made secure so that no one can modify or delete it.
Consider the uninstall example from the previous section. If that occurred, you would be relying on your backups and disaster recovery plans to get the system back into a usable state. With a version control tool or some other method, you might not have access to builds of a previous version of the application to restore the system to a working state if that is part of your disaster recovery plan. This would not increase your availability. In fact, without access to the proper bits, the process could be painful.
If there is no budget for a proper version control tool, the easiest way to start storing code is to put it in a protected, secure folder and use a simple directory structure with standardized file names to ensure uniqueness between version builds. However, you should use some form of version control software. This has some immediate advantages over directory storage, but it requires another process that needs to be managed and available. A great option for software is Visual SourceSafe. One limitation of SQL Server is that when database structures, functions, stored procedures, and so on are created or updated, they overwrite what was already there (that is, an existing object of the same type and name ) without version control. You can mitigate this by maintaining the different versions of the database structures and updating scripts into a tool like Visual SourceSafe. Then give the proper script to the person executing the task to update or add the structure or stored procedure. Whatever method you use, be sure that everyone is trained in its use and committed to using it properly.
It is also useful to store related documents for application design, implementations , or server configurations in version control. Store e- mails outlining varying opinions , any design documents (whether they are used or not), and approvals from quality assurance. For example, if, in six months, someone asks you what happened to the AddressLine7 column where customer comments were stored, you will be able to get back to them with an intelligent and responsible answer. If later it becomes apparent that the design could have been better, you will have adequate documentation on any alternate designs that were suggested and the reasons (if any) that they were discarded. This can save valuable analysis time, which allows you to provide a timely answer even when you are busy with other tasks.
A great deal of testing is required before rolling out a change or update in a production environment. There are different types of testing ”unit testing, regression testing, white box testing, black box testing, acceptance testing, and so on. This chapter is not going to redefine standard testing terminology. All types of tests are important. When it comes to testing for high availability, you must get beyond features and functions tested only in isolation. Three main questions must be addressed:
How does the application perform in isolation with the technologies that will be deployed? It is crucial to document from testing for IT is how the application (by itself) behaves with respect to a specific technology (if it is not widely known). Packaged third-party applications should, in theory, be documenting this for you already.
How does the application perform as part of the entire solution with respect to the failure of individual components? The problem really comes in when you start mixing software and technologies ”how do they function together? Developers and testers can make IT s life difficult if they have no idea what to expect when they are handed the package to install and support. IT can make proper disaster recovery plans if everything is documented properly at this stage. Similarly, any findings that are discovered during the implementation process or in the support of the application or solution should flow back to development so test plans can be updated. Faulty, outdated , or unrealistic test plans should be considered unacceptable.
What are the implementation pitfalls? Unfortunately, development does not always have the time to document how to properly or optimally set up an environment. This can lead to longer installation times as IT finds things out or ”if something goes wrong ” longer troubleshooting times. Developers and testers need to document, document, document!
Test plans with their specific test cases should take these points into account.
|On the CD|| |
For a blank test plan form to fill in your test cases, use the document Blank_Test_Case.xls. See other chapters for sample test plans, based on this template, relating to technologies such as failover clustering.
Now that you have addressed your development environment, it is time to tackle the change management process in a production environment.
Prior to rolling out a change or update in a production environment, a great deal of planning and work must be done. For any database “related change, whether to the system itself or to objects and data in the database, the database system engineer (DSE) or DBA should ensure that the changes developed will work well in the production environment. The DBA or DSE should develop or help develop, review, and test implementation scripts, rollback scripts, and testing scripts, as well as create or maintain related database documentation.
Make sure that your change plan is as modular as possible. The ability to restart the entire process, stop at any point, and then continue, identifying exactly which changes have been made and which have not, is the best way to go. A good way to keep track of what you have done is to put auditing code in all of your implementation scripts, thereby recording success or failure to a database table.
Although each method of SQL code propagation serves a purpose, a script- based installation (if possible) best meets change management objectives. Scripts allow for repeated, controlled, and highly automated installations that can easily be rolled back at any state. Nonscript, graphical user interface (GUI)-based installations should cleanly uninstall if they are canceled at any point during the process, and they should be tested to ensure the system will be left in its previous state.
When making changes, even if they have been tested, always have a contingency plan ready. Never make a change without one, and raise a red flag if someone else attempts to make changes without one. Plan what to do if something goes wrong in the middle of the script sequence, or if the database scripts work but the application team encounters errors and has to remove the changes. Always plan for a minimum of two possibilities: complete removal of all changes (which could require rollback scripts, or a restore, if the application is not designed to support multiple versions running in the same database), or an alteration of your plan due to either predictable or unforeseen circumstances.
A highly available system should rarely, if ever, be altered without testing. Untested, non “standard production changes made to avert a crisis situation should be handled only by highly experienced DBAs or under the guidance of Microsoft Product Support Services (PSS).
During the planning stages, you should be thinking ahead to what could go wrong during this implementation. This risk analysis is vital to the success of your process, and might change what is implemented, or how it is done. If you can predict problems by thinking through the possibilities or recalling problems you have encountered in the past, make a chart showing these risks and what you could do to correct the problem. No matter what kind of risks you identify, even if you cannot identify any, you must have a rollback strategy for your production implementation. An experienced DBA can contribute a wealth of knowledge and guidance in this phase, even if he or she will not be doing the actual work. IT shops should not ignore DBA input for rollouts that, on the surface, might not seem to affect the database systems.
Your rollback strategy could be as simple as a script for how to undo every change you have made and restore everything to the way it was before you started. A script in this situation means a series of documented steps that you will perform to roll back your change. This will mostly consist of Transact-SQL scripts you will run. The strategy could also be much more complex, as in the case of a SQL Server service pack installation, because you cannot uninstall it; a service pack is a permanent change. Whatever the plan, it must be tested thoroughly.
Even if you have planned everything correctly, the need to roll back can still occur. You might have identified risks that could cause a catastrophic problem, or something might occur that you did not foresee. Developers or testers might even find some problem during the post implementation test that prevents them from approving the installation. When the call comes to roll it back, you will have a very short amount of time to undo all the changes.
If you have additional servers or instances involved in the implementation that are not all going to be upgraded simultaneously , you need to plan and script how these will also be changed. For example, assume you are making a major change to your database (say, you are merging several databases into one), and you want to delay implementation on your offsite standby, which is normally 24 hours behind the primary server. Once you have completed your production implementation, the offsite standby is merely serving as an easy way to restore to the point in time prior to the change. If you have a failure of the main site, however, and need to continue with the new system at the alternate site, then you have a large task to plan and immediately implement. The best practice is to think through all the possibilities and be as prepared as possible.
The team deploying the change should have a good understanding of both the system in question and the change being made. The leader must have the authority to take drastic corrective action without undue delay, and he or she must know how to reach all the appropriate people in operations who can help with operational aspects that are outside his or her realm.
Although minor changes to the system can be made without taking the system down, deploying a major change without incurring any outage takes planning and can involve the cooperation of more than one team. If this is a 24/7 system, your project team will have to negotiate an implementation window that is acceptable to all groups affected by the change. In a truly mission-critical system, even the shortest downtime might be unacceptable, so if your change is invasive you might need to provide another system or database for alternative access.
One option is to use a read-only database if that is the only thing needed. This clearly will not work in an e-commerce environment. In the case of a read- only database, interim support of the read-only system must also be provided, and the users and help desk personnel should be kept informed.
If, for any reason, you cannot keep the main system online during the change, you might instead have a manual switch to a standby system, especially for a read/write database. This ensures that the availability SLAs are met, but it might pose an administrative challenge in resynchronizing the data after the change is applied to the production server. If, for example, you are making extensive changes to the database schema, you cannot allow users to enter or change data on the standby system unless you have developed and tested a plan for capturing those changes and importing them into the new structure.
You should compile an implementation plan and distribute it to everyone in the IT department who will be involved in or affected by the implementation. The plan is simply a list of the steps, who is responsible for each step, the times at which everything is expected to occur, and who to contact to initiate the next step. Be sure to include contact numbers for everyone on site or on call during that time frame.
Try to imagine likely events, and make a secondary plan that accommodates these variables . Once that is done, think of the improbable things that might occur, and accommodate those if possible. Experience with your application and infrastructure will help you gauge the level of detail required. Note any situations ( showstoppers ) that would cancel the entire implementation and invoke the rollback plan.
As a group , the implementation team should create a backup plan. This is a little different than a simple rollback strategy for the database. If anyone s section fails, the group should have an overall plan for evaluating the situation, making a decision, and then proceeding with or canceling the implementation.
No implementation should skip the testing stage. Important insights about the process come from testing, not to mention the confidence that the change plan works. A proper test environment is crucial. To make it useful, you must have a good method of load testing that simulates the conditions that will be experienced. Therefore, you should create this load test before making any changes. Run the plan against the test system and record your selected measurements so that you have something to compare to your changes during the postimplementation testing. By monitoring a system with the same counters during the performance of the original production system test script, and then monitoring the same system after a change is made, you can judge whether the changes you are examining pose a detectable threat to the production environment.
This task takes a considerable amount of analysis and can be time consuming. There are tools available on the Microsoft Developer Network (MSDN) ( http://msdn.microsoft.com ) that allow configuration of database test scripts, or you can adapt a good Profiler trace instead. You can also use a third-party tool for testing, rather than using a Transact-SQL “based process.
The beginning of the operations group s ownership of the system begins with a release readiness review, also called a go/no-go meeting, which determines whether this project is ready for implementation. Although operations staff should be involved throughout the project, this is the first meeting that is run by them, rather than by the development or quality assurance teams .
The purpose of this meeting is to allow members from each team to indicate their final approval (or disapproval) and raise any issues they wish to discuss. This is the last chance to alter the implementation schedule, barring any unforeseen circumstances. Any no-go votes should be seriously considered; if no resolution can be reached by those present, the meeting should be adjourned with a no-go status until resolution is reached or the objection is overridden by senior IT staff. In any case, the objection, the reasons for it, and any risks that are brought up should be recorded for future reference.
The DBA or DSE might occasionally need to postpone or reject changes that have been requested for business or technical reasons. In either case, when this happens, the business impact of the decision (to change or not to change the system) should be evaluated and documented.
The agenda for the go/no-go meeting should include the following:
The readiness of the release itself
Alterations to the physical environment
The preparedness of the operations staff and processes
The installation plan
The contingency plan
Potential impacts on other systems
Staffing and availability
Once the processes for implementation have been decided on and the risks mitigated, a change request form should be filled out as described earlier in this chapter and signed by the proper people.
After planning to implement a change and testing the change thoroughly in a test or staging environment, it is time to execute the well-tested plan. Be sure you have at hand the contact numbers of the on-call server room technicians, network administrators, or security administrators whose help you might need during the implementation.
All users of the system have an SLA (in some cases, an implied SLA) and must be notified in advance of the work that will occur on the system. This process should be clearly documented in the SLA to avoid confusion or omission of important groups.
The first step of any implementation must always be a backup of the current system, from the system down to the databases. Once this is complete and verified , you are ready to begin. Because the DBA group is rarely the lead on the implementation, the person leading the deployment effort should coordinate with the DBA group to either do the backups using some other method or have the DBAs themselves do the backups of the SQL Server databases. The person doing the deployment can then take the process from there.
Except on the very smallest of teams, there should always be at least two DBAs on hand for an implementation. In a large implementation, you might have several DBAs involved in the deployment of the system, but you should still have a standby in the event that someone becomes ill, or in case there are related or unrelated system failures. Remember, even if another production server goes down during an implementation, every effort should be made to continue the deployment as planned; there should be processes in place to handle that failure.
Despite the best-executed plans, something truly unforeseen could occur during a production implementation. Notate your implementation script with whatever action you had to perform, documenting the difference. Do not wait until the crisis has passed to start making notes, as they could be lost or forgotten. The sequence of events and steps taken might be needed for future use, including subsequent support calls, analysis, or future implementations. Remember that meticulous accuracy is more important than anything else at this stage. Due diligence requires that even if you make a mistake, you record it. Your thoroughness now might save someone else from making the same mistake later on, and it will help the group learn as a whole.
During the deployment process, status e-mails should be sent letting the powers that be know what has and has not happened, and if the deployment plan is proceeding according to the plan, including being on schedule. There should also be a no-go point if the maintenance window cannot be achieved due to unforeseen problems. If this decision is made, the rollback plan must be put into action. Although this is rare, it can happen. The same applies for a failure: if a severe failure occurs, the rollback plan must be started. Without the rollback plan as your insurance policy, you might be faced with implementing a full disaster recovery plan, which is not a fun proposition. That could mean hours ”or possibly days ”of reconfiguration. For a failure or a no-go, document the reasons; these are needed for a postmortem. The problem might not be related to the database at all.
If the implementation is seemingly successful (that is, everything on the surface went smoothly), test it with the test scripts and plans used in the planning and testing phases. This is a crucial step for ensuring that the entire operation was a success. The results should be the same as those found in the testing phase. If they are not, there is a chance the change might have failed.
If the implementation is a failure ”whether determined during testing or even earlier ”or a success, the users who might have been inconvenienced by the downtime should be notified. If the communication is about a success, include contact information should users encounter problems.
The deployment process is not just about technical issues, but teamwork as well. Good teamwork is a vital part of the success of an implementation team, which crosses many departments in IT. It relies on the irreproachable accuracy of the information provided by team members.
Whether the deployment process was successful or not, a few things should happen when it is complete.
If the deployment had many sequential steps, and yours was only one, notify the next people when you are complete so that they can start their portion. Do not go home; you might still be needed.
Send an e-mail to the implementation team and all relevant parties with a final status update, including all work done, any observations, and so on.
If there was a failure, include the details of the failure in that e-mail.
Once sufficient testing has been done, notify all users that the system is now ready for requests .
A postmortem meeting should occur, so all parties involved can learn from the process, improve the process for future production deployments (similar or not), and analyze what went wrong if there was a failure. Everything, including the project plan, the design, the arguments, the solutions, the crises , and the final outcome should be examined. Everyone involved in the process from planning to execution should be invited to the meeting. The point of this process is to learn from the events that occurred. If you have things to add, take your documentation with you to share. For the sake of the team, remember to share your positive remarks, not just the negative ones.