|< Day Day Up >|| |
Availability once meant that an application would be available during the week, from 9 to 5, regardless of whether customers needed anything. Batch processing took over the evenings and nights, and most people didn’t care because they were at home asleep or out having fun.
But the world has changed! It’s now common to offer extended service hours in which a customer can call for help with a bill, inquiry, or complaint. Even if a live human being isn’t available to help, many enterprise applications are Web-enabled so that customers can access their accounts in the middle of the night while sitting at home in their pajamas.
Increased availability is good, except for one fact: Many systems programmers, DBAs, and other mainframe experts are maturing. It takes a lot of care and feeding to keep applications ready for work, and the people who have maintained these environments for so long have other things they want to do. Many are starting to shift their sights toward that retirement community in Florida that they’ve heard so much about. Most of the bright youngsters who are graduating from college this term haven’t had much exposure to mainframe concepts in their course work, much less any meaningful grasp of the day-to-day requirements for keeping mainframe systems running.
The complex systems that have evolved over the past 30 years must be monitored, managed, controlled, and optimized. Batch windows are shrinking down to almost nothing. Back-ups often take place while an application is running.
Application changes take place on the fly, under the watchful eye of the change-control police.
If an outage occurs, the company stands to lose tens of thousands of dollars an hour. In today’s gloomy economy, stockholders don’t want to hear that their favorite investment is having system availability problems.
Certainly, hardware failures were once more common than they are today. Disk storage is more reliable than ever, but failures are still possible.
More likely to occur, though, is a simple mistake made by an application programmer, system programmer, or operations person. Logic errors in programs or applying the wrong update at the wrong time can result in a system crash or, worse, an undetected error in the database—undetected, that is, until minutes, hours, or days later when a customer calls, a reconciliation fails, or some other checking mechanism points out the integrity exposure.
Finally, disasters do sometimes strike, and most often they occur without warning. Flooding doesn’t always occur when it’s convenient; tornadoes never do. Hurricanes and earthquakes are big-ticket events that ruin everyone’s day. When they strike your data center, wipe out your processing power, or even destroy your basement-level back-up power supply, you have a lot of recovering to do.
Does anyone need a reminder that budgets are tight? You have fewer resources (people, processing power, time, and money) to do more work than ever before, and you must keep your expenses under control. Shrinking expertise and growing complexity cry out for tools to make systems management more manageable, but the tools that can save resources (by making the most of the ones you have) also cost you resources to obtain, implement, and operate.
Businesses today simply cannot tolerate availability problems, no matter what the source of the problem. Systems must remain available to make money and serve customers. Downtime is much too expensive to be tolerated. You must balance your data-management budget against the cost of downtime.
One of the most critical data-management tasks involves recovering data in the event of a problem. For this reason, installations around the world spend many hours each week preparing their environments for the possibility of having to recover. These preparations include backing-up data, accumulating changes, and keeping track of all the needed resources. You must evaluate your preparations, make sure that all resources are available in usable condition, automate processes as much as possible, and make sure you have the right kind of resources.
Often the procedures that organizations use to prepare for recovery were designed many years ago. They may or may not have had care and feeding through the years to ensure that preparations are still sufficient to allow for recovery in the manner required today.
Here is a simple example: Say that an organization has always taken weekly image copies on the weekend and has performed change accumulations at mid-week. Will this approach continue to satisfy their recovery requirements? Perhaps. If all of the resources (image copies, change accumulations, and logs) are available at recovery time, these preparations certainly allow for a standard recovery. However, if hundreds of logs must be applied, the time required for the recovery may be many hours—often unacceptable when the cost of downtime is taken into account.
This example illustrates the principle that, although your recovery strategy was certainly adequate when it was designed, it may be dangerously obsolete given today’s requirements for increased availability. And what if a required resource is damaged or missing? How will you find out? When will you find out? Finding out at recovery time that some critical resource is missing can be disastrous!
The previous example was unrealistically simplistic. Many organizations use combinations of batch and on-line image copies of various groups of databases, as well as change accumulations, all staggered throughout the week. In a complex environment, how do you check to make sure that every database is being backed-up? How do you find out whether you are taking image copies (either batch or on-line) as frequently as you planned? How do you determine whether your change accumulations are taken as often as you wanted? What if media errors occur? Identifying these types of conditions is critical to ensuring a successful recovery.
Having people with the required expertise available to perform recoveries is a major consideration, particularly in disaster situations. For example, if the only person who understands your IBM Information Management System (IMS) systems (hierarchical database system) and can recover them moved far away, you’re in trouble. However, if your recovery processes are planned and automated so that less-experienced personnel can aid in or manage the recovery process, then you’re able to maximize all your resources and reduce the risk to your business.
Automation takes some of the human error factor and “think time” out of the recovery equation, and makes the complexity of the environment less of a concern. Creating an automated and easy-to-use system requires the right tools and some planning for the inevitable; but, compared to the possible loss of the entire business, it is worth the investment. With proper planning and automation, recovery is made possible, reliance on specific personnel is reduced, and the human-error factor is nearly eliminated.
Creating the recovery Job Control Language (JCL) for your IMS systems is not as simple as modifying existing JCL to change the appropriate names. In the event of a disaster, the IMS Recovery Control (RECON), data sets must be modified in preparation for the recovery. RECON back-ups are usually taken while IMS is up, which leaves the RECONs in need of many clean-up activities before they can be used to perform a recovery: deleting OLDS, closing LOGS, deleting SUBSYS records, and so on. This process often takes hours to perform manually, with the system down, equating to lost money. Planning for RECON clean-up is an important but often-overlooked step of the preparation process; discovering a deficiency in this area at disaster-recovery time is too late.
Planning for efficient recoveries is also critical. Multithreading tasks shorten the recovery process. Recovering multiple databases with one pass through your log data certainly will save time. Taking image copies, rebuilding indexes, and validating pointers concurrently with the recovery process further reduce downtime. Where downtime is costly, time saved is money in the bank. Any measures you can take to perform recovery and related tasks more quickly and efficiently allow your business to resume faster and save money.
So after you’ve thought about and planned for your recoveries, it’s time to think about executing your plan. Clearly the first step to a successful recovery is the back-up of your data.
Your goal in backing-up data is to do so quickly, efficiently, and usually with minimal impact to your customers. If you have a large window where systems aren’t available, standard image copies are your best option. These clean copies are good recovery points and are easy to manage. If, however, you need to take back-ups while systems are active, you may need some help. You can take advantage of recent technological changes in various ways. You might need only very brief outages to take instant copies of your data, or you might have intelligent storage devices that allow you to take a snapshot of your data. Both methods call for tools to assist in the management of resources.
If all of these challenges and preparations sound like a lot to keep up with, you’re right. A lot of things can go wrong, and many of them can lead to major difficulties if you ever need to recover your data. Fortunately (and not surprisingly), for example, BMC Software has a solution to help with it all: the Back-up and Recovery Solution (BRS) for the IMS product.
To assist you with your back-up needs, BRS contains an Image Copy component to help manage your image copy process. Image copies can be taken in a variety of ways, including methods that allow you to take the copy without taking the database off-line. Whether you elect to take batch, on-line (fuzzy), or incremental image copies; Snapshot copies; or Instant Snapshot copies, BRS can do it all. The Image Copy component of BRS is faster and easier to use than the IMS Database Image Copy utility and offers a variety of powerful features: dynamic allocation of all input and output data sets, stacking of output data sets, high-performance access methods (faster I/O), copying by volume, compression of output image copies, and database group processing—all while interfacing with DBRC and processing asynchronously. These features and more are available to help manage your image copy process more efficiently.
BRS also incorporates a Change Accumulation component, which replaces the functions of the IMS Database Change Accumulation utility. The BRS Change Accumulation component takes advantage of the multiple engines, large virtual storage resources, and high-speed channels and controllers that are available in many environments. Use of multiple task control block (TCB) structures enables overlapping of as much processing as possible, reducing both elapsed and CPU time. Use of state-of-the-art techniques reduces sort overhead and I/O, all while providing full DBRC support. Dynamic allocation makes it easy to use.
The tools required both to perform the actual recoveries and to automate the recovery process are also included in BRS. The BRS Recovery component, which functionally replaces the IMS Database Recovery utility for full-function (DL/I) databases and data-entry databases (DEDBs), allows recovery of multiple databases with one pass of the log and change accumulation data sets while dynamically allocating all data sets required for recovery. And it is faster and easier to use than the native IMS utility. Multiple log readers can be used to maximize resource utilization and minimize elapsed recovery time. BRS also recovers multiple databases to any point in time, not just to times when the database is deallocated, eliminating the need for scheduled log switches and saving on the overall number of tapes that you must create, process, and manage. BRS can even determine the best choice for a Point-In-Time (PIT) recovery. Creating image copies concurrently with the database-recovery process, as well as concurrently checking database pointers, saves an additional job step and the time associated with it. You can use DBRC-registered secondary image copies and logs as input. Full DBRC support is included, of course.
The Recovery Manager component of BRS further automates a number of key functions that are associated with recovery. In a problem scenario, the time required to perform the recovery is a major issue and cost. By minimizing this recovery time, business can resume sooner. Because downtime is expensive, minimizing downtime lessens the financial impact of a problem. By creating meaningful groups of related databases and creating optimized JCL to perform the recovery of these groups, the Recovery Manager component lets you automate and synchronize recoveries across applications and databases.
Automatically issuing IMS commands and monitoring their results is also easier with the IMS Command utility of the Recovery Manager component, which provides a positive response for the IMS commands that are used to deallocate and start your databases. This utility is especially helpful in coordinating recoveries in a data-sharing environment.
The Recovery Manager component fully automates the process of cleaning the RECON data sets for restart following a disaster recovery. It captures allocation information about your IMS database data sets and builds IDCAMS delete/define control statements. The Recovery Manager component also allows you to test your recovery strategy. You can ensure that all assets that are necessary for a recovery exist, analyze logs, practice recoveries by creating sample recovery JCL, and create alternate databases for testing purposes. The Recovery Manager component also notifies you when media errors have jeopardized your recovery resources. All of these functions help you predict the impact that your recovery strategy may have on your databases, applications, users, and, ultimately, your business.
As part of your image copy or recovery process, wouldn’t you like to verify the validity of database pointers as you go? For example, BRS offers this capability through the Concurrent Pointer Checking function for both full-function databases and Fast Path data-entry databases (DEDBs). By using the same I/O operation to read the database records, you’ll save elapsed time and I/O. Reduced recovery time helps save resources, including money.
Rather than having to maintain image copies of your primary and secondary indexes, using the Index Rebuild function of BRS is another way you can save time and effort. If indexes are ever damaged or lost, this function allows you rebuild them rather than recover them.
Last but not least, BRS contains the Recovery Advisor component that you can use to keep an eye on your recovery resources. The Recovery Advisor component allows you to monitor the frequency of your image copies and change accumulations. It helps you determine whether all of your databases are being backed-up. Finding out about a problem with your recovery resources, at the time you need to use those resources to solve a critical problem, is not a good thing; by then, it’s too late. By identifying potential problems early, you can take steps to correct the situation before you need to recover. The tool can even take corrective action for you, if you like.
Finally, today’s environment is complex and intolerant of unavailability. Many experts are leaving the field. It’s becoming more difficult to manage this environment. Don’t let your availability woes make the news. Plan for the worst, hope for the best, and make sure you can keep your business up and running. By using any number of back-up and recovery tools available, you can better manage your world and be ready to recover!
Finally, let’s look at some disk and tape data-recovery case studies.
|< Day Day Up >|| |