In every high availability solution, as alluded to in Chapter 1, Preparing for High Availability, you must adhere to some fundamentals; otherwise , achieving the availability you need will be difficult. Even if you are in an IT shop or department that takes care of many of the basics listed in this chapter, you should still find a few nuggets of information here or you might be bored. Either way, it would be inappropriate for a book on high availability to ignore the fundamentals.
This chapter addresses matters that relate to both the human and technological factors of the high availability equation. It includes such topics as data center best practices, staffing, service level agreements (SLAs), and change management.
This section starts with a few funny ”and yet not so funny ”stories that are not only based on real events, but more common than you would think.
A company was experiencing an availability problem intermittently. At approximately 2 A.M. a few nights a month, the main accounting database server went down and then automatically came back about an hour later. In some cases, at least one of the databases running was marked suspect in Enterprise Manager, causing it to be rebuilt. One occurrence would be an isolated incident, but this was clearly more than that. The CIO, CFO, and CTO launched a joint investigation. In the end, the culprit was not a denial of service attack, it was a janitor attack: the janitor was unplugging the system to plug in cleaning equipment.
SQL Server was experiencing intermittent failures with no apparent root cause or predictable timing. As in the first example, a joint investigation was launched to determine what was going on. The problem turned out to be that telephone maintenance workers were walking past the cabling, which was loose and tangled, shifting cable plugs a bit. To add insult to injury , in the main wiring room that contained the core networking and telephony wires, the telephone workers occasionally touched some of the network wiring in addition to the telephony ones, causing some network outages.
A junior database administrator (DBA), who was new to the company and did not know the systems, their purpose, or everything about how the company administered servers, started a backup process. The DBA did something incorrectly ”such as backing up to the wrong disk volume and filling up a disk subsystem critical for another SQL server ”and caused a serious availability problem until the situation was corrected.
Have any of these events, or something similar, happened where you currently work or have worked in the past? In most companies, at least, the first example is preventable by using qualified, professional, and trained cleaning staff. The others might not be as easily solved , especially the third one if, say, all the other DBAs are unavailable and the junior DBA is left at the helm. Proper training would not cover the improper backup situation, but it would eliminate the other problems (not knowing the systems, and so on).
Chances are, you inherently know the best practices that relate to your production environment ”many books and sessions at conferences talk about them. But how many do you adhere to, and how many are barriers to your availability? Each topic in this section could be more detailed, but the ideas that follow will serve as guidelines.
As a database system engineer or administrator, you might not have complete authority over how your SQL Server is eventually hosted or managed, but you should be able to observe and document discrepancies you see. If problems persist, take your documentation, along with records of downtimes, to management to influence and change how the servers and systems you are responsible for are hosted and managed.
Even if you do not run the data center, have no direct involvement or access, or have no input on its potential construction, it is important to let the proper people know about the things listed next . Whether you are using a third-party hosting company or keeping everything in-house, consider location; security; cabling, power, communication systems, and networks; third-party hosting; support agreements; and the under the desk syndrome.
The location of your data center will contribute to its availability. Do not locate or use a data center that is under plumbing lines, sewer lines, or anything similar ” basically anything that could cause problems. What would happen if a pipe burst, sending water through the ceiling of the data center and flooding the server room? Think about what businesses are above and below your offices. If, for example, your offices are located under a kitchen, not only is the potential for fire increased, but also leaks could seep through the kitchen floor. Is it wise to put your data center near something that could go up in smoke at any minute?
If none of this can be avoided, consider the options listed throughout this section and see if they help mitigate risk to the data center. If they do not, or if some of them are not feasible due to high costs or other reasons, the risk will have to be documented and taken into account. Some situations are unavoidable, even in the best of IT shops .
Proper climate control of systems is crucial for availability. There is a big difference between hosting one server and hosting 1000 servers in the same location. Computers do not work well if they are overheating due to the climate, which might also cause anomalies such as condensation if the conditions in the room are just right. Make sure you have a properly installed heating, ventilation , and air conditioning (HVAC) system ”with the emphasis on properly installed. This means that there will be sufficient air circulation to keep the entire room at a uniform, acceptable temperature. Conversely, if your data center is in an extremely cold geographic climate, you might also need a heating system because systems do not work well in extreme cold, either.
Make sure that the entire data center s airflow is even, otherwise there could be hot and cold spots. If certain areas are too cold, there s the risk of an administrator adjusting the air conditioning if he or she is cold while working in the room and raising the temperature in areas other than their own, causing system outages in the now-hot areas. Maintaining systems at the proper temperature extends their mean time between failures (MTBF) and the life of the hardware; in extreme temperatures some hardware might not function at all.
Electrical lines must be properly grounded. Failing to have this done can potentially result in some bizarre behaviors in electrical equipment. Newer buildings should, for the most part, be grounded, but some older buildings might not be. Computer components , especially small ones like CPUs and memory, do not react well when sudden power surges occur. Power conditioners might be needed to guarantee that there is a consistent, even amount of power flowing to all things electric.
Servers should be neatly and logically organized in racks (see Figure 2-1) where all equipment can be reached. The location of each server should be documented so that in the event of an emergency, an affected server can be easily located. Racks also help protect the equipment from damage, protect against most accidental situations (such as a person accidentally knocking a wire loose), and foster air circulation to assist with proper climate control.
Figure 2-1: A good example of servers arranged neatly in racks.
To enable the equipment to be rolled into the data center on racks, you might need specific clearance for both height and width; otherwise, you might not be able to get the equipment into the data center. Conversely, if you need to move the equipment out, you must account for that as well (for example, if you have custom racks built into the data center instead of brought in). Also, make sure that ramps can hold the weight of the equipment being rolled in and out, as well as accommodate people in wheelchairs. The ramps should not be too narrow.
Plan for growth on many levels. Much like capacity planning for your systems themselves , invest the time to assess how much physical space is currently needed and how much will be needed over the next few years. What is the planned life of the data center? One specific example of something small and seemingly insignificant is making sure there are enough electrical outlets in the room to allow for additional equipment over the years . In no case do you want to have to resort to power strips to expand the power, risking overload and outages. You must also take into account the increased power consumption and its affect on such things as the HVAC.
A fire suppression system should be installed in the data center. If at all possible, make sure that it will not further damage the computer systems and endanger human life. It s imperative that anyone who works in or near the server room reacts properly to a fire alarm and adheres to all building rules, including taking part in fire drills. If the fire alarm cannot be heard , there should be some other visual indicator in the data center to alert the staff. One concern about some fire systems is that they are just as dangerous to human beings as they are to computer systems. Halon depletes the ozone layer, and some fire suppression systems can turn artificial ceiling materials into a deadly weapon.
Make sure that there is enough room above and below the server racks to allow the cables and climate control system to be installed properly (see Figure 2-2). Also, make sure the racks will fit into the room ”do not prepurchase racks only to find out that they will not fit your room, as well as allow for the clearance and airflow needed.
Figure 2-2: The proper amount of clearance above the racks as well as room for the cabling.
Climate and geographic location are important factors in the placement of a data center. If you work in a place susceptible to volcanic eruptions, earthquakes, hurricanes, tsunamis, or other natural disasters, take that into account in your planning. Because of these limitations imposed by nature ”you cannot stop a hurricane from flooding or, in a worst case scenario, destroying your data center ” you might need to consider a second location in a separate geographical area to act as a secondary data center, or cold site.
Security is no joke. Securing both the computer systems and the data center itself will not only protect the system, but increase your systems availability. If it is not done, availability will decrease as a result of security breaches, viruses, or other attacks. According to two separate polls reported in eWeek ( http://www.eweek.com/article2/0,3959,930,00.asp and http://www.eweek.com/article2/0,3959,537304,00.asp ), many IT organizations were still struggling with securing their environments or thought they were doing just fine with their security plans, a shock in and of itself. Despite the fact that planning and implementing the security of your systems is a good thing, much like high availability, no matter how much you plan for security, there is always something you could miss . Although security is a book unto itself, its concepts are addressed throughout this book.
No unauthorized users should be allowed physical access to the data center. To prevent this, access of all persons who enter the server room, and even those who enter the operations center, should be logged. For tighter security, only people with special entry cards should have access. At a bare minimum, invest in a clipboard, some paper, and a pen and make sure people walking in and out of the data center log their times.
Install video monitoring equipment so that intruders can be identified in the event of unauthorized access. Store those tapes in a separate, secured location for the period of time specified by your corporate security policies.
Along with allowing only authorized people to have access to the server room, make sure that no one needs to go through the server room or behind the racks to get to other equipment or areas, such as the telephone wiring system. The design of the data center should prevent such a situation.
Locks, whether standard or electronic, should be installed where appropriate, starting with the doors, and possibly even down to the level of individual servers. This supports authorized access only. If your data center has a clipboard, or even a live guard, to allow people to sign in or out (as previously mentioned) that might not be enough protection. Signing in does not stop someone from powering down a mission-critical server.
After the server is moved into the data center, you should not need to physically touch the console, with rare exceptions, such as if a piece of hardware fails or software must be installed and installation cannot be done through some other method. Leave the server alone; forget it is there (well, do not forget, but only worry about it in other ways, such as backup and monitoring). Use a remote administration or access tool, such as SQL Server s Enterprise Manager or Windows Terminal Server, to manage and access the servers. Ensure that proper training is given to staff to prevent accidental shutdowns or reboots. If there is not a problem on the system, do not look for one.
System auditing, both for physical logon to the operating system as well as auditing of events on Microsoft SQL Server, should be enabled if possible. This makes unauthorized accesses easier to track should they occur.
Very basic considerations in a data center ”power and networking ”are just as important as security.
If your data center is located in an area that is prone to natural events such as tsunamis, earthquakes, tornados, and so on, or if you are worried about power interruptions, a power generator fueled by diesel should be connected to the data center to ensure a continuous power supply. These generators were very common in planning for Y2K efforts. Test the switch over to the generator and, if necessary, have a backup for the backup!
Even if a power generator is not necessary, uninterruptible power supplies (UPSs) ”at a bare minimum ”should be attached to each system in the data center.
Systems should not all be placed on the same power grid, especially if a primary server and its secondary or redundant server for high availability are in the same data center. This would be considered a single point of failure.
All data and voice networks should have redundant paths into and out of the data center. Ask the data and voice provider to place the redundant lines on a different pathway , if possible. That ensures that if one line is accidentally cut or goes down due to weather, you can still be up and running.
The cabling infrastructure should be extremely neat (see Figure 2-3), labeled, and possibly even color -coded (for example, red for 10-Mb networking, blue for crossover cables, and so forth). In the event of an emergency or a troubleshooting session, it will be much easier to work in a well thought out, standardized environment. Throughout the entire infrastructure, Category 5 cabling should be used.
Figure 2-3: A good example of neat cabling.
Telephones with outside lines should be placed at regular intervals in the data center. If possible, some of the phones should be placed on analog lines (see Figure 2-4). In the event of a power or connectivity problem, analog lines and normal phones can usually dial out even if there is a problem with a switchboard. They also let a vendor s consultant dial into his or her company s systems to get a patch if it cannot be downloaded directly from the system.
Figure 2-4: An analog phone connection box in a data center, clearly labeled.
Although this is more of a networking issue, if the company is geographically dispersed and depends on systems in multiple locations, plan for contingencies if the wide area network (WAN) goes down. Redundant networks in and out of the data center will not cover this situation if the link at the other end is disabled.
Remember to plan for proper growth to support the systems that will be placed in the data center in the future. If you run out of network switches, ports, cabling, and so on, you will have other availability- related problems.
If you use a third-party hosting company, all of the preceding points apply, and you should also consider the following, along with other ideas you might have:
What kind of access will your employees have if they need to visit the hosting company? Is there a limit to how many of your employees can be in the server room at once, and if there is, does it meet your needs? Can you have access only at certain hours of the day? How long does it take to get to the location in good and bad times (for example, during rush hour), especially if it is not within driving distance?
What is the turnaround time for a problem if the hosting company is doing the normal day-to-day administration of the servers? Do you have direct access to qualified engineers , or is there a gated process that slows the time to resolution? This would be addressed by a proper SLA, which is discussed later in this chapter.
How secure is your equipment compared to that of another company? If someone is performing maintenance on another company s servers that are located next to yours, can the administrator accidentally dislodge a cable in your racks?
How will you monitor that the hosting company is actually doing the work it is required to (for example, swapping tapes in the backup digital linear tape [DLT] device)? How will you secure your backups ? How will you access them quickly, if needed?
Make sure that the third-party vendor s support agreements are in place and meet your level of availability.
In conjunction with the previous point, check to see that the third- party hosting company itself is adhering to sound high availability and disaster recovery practices.
Purchase support contracts for all software and hardware that will be hosted in the data center. This includes the operating system, application software, disk subsystem, network cards, switches, hubs, and routers ”anything that is or is not nailed down! Support contracts are forms of SLAs. There is nothing worse than encountering the example described in Chapter 1 in which the system was down for a long time due to the lack of a proper support contract. If your availability requirement is three nines, make sure the hardware vendor or third-party hosting company and their vendors can meet that standard. Support contracts are insurance policies that any environment serious about high availability cannot be without.
Make sure you understand what type of agreements you have with each vendor your company has a contract with. Learn how to use their support services. Record their contact information in several safe but accessible places. Include your account number and any other information you need to open a case. Keep records of this information in your operations guide, and keep a copy of the operations guide offsite. It is critical that members of the team are well aware of the terms of these agreements, because many of them have specific restrictions on what you or your team can do to the system.
For example, many hardware vendor SLAs have clauses that permit only support personnel from the vendor or specific, certified persons from your team to open the server casing to replace defective components. Failure to comply could result in a violation of the SLA and potential nullification of any vendor warranties or liabilities. Something as simple as turning a screw on the case might violate the SLA, so be careful!
Also consider a product s support life cycle. This will not only help you determine the support contract that will need to be purchased, but will help the planners decide the life cycle of the solution without having to do a major upgrade. Microsoft publishes support life cycle, policies, and costs at http://support.microsoft.com/lifecycle . The Microsoft Support Life Cycle is a three-phase support approach:
Most products will have a minimum of five years of mainstream support from the date they are generally available.
Customers might have the ability to pay for two years of extended support for certain products.
Online self-help for most products will be available for a minimum of eight years.
For any changes to policies in the program, please consult the Microsoft Web site just listed.
Last but not least, the production servers should not be under the DBA or system administrator s desk. Sadly, this phenomenon still exists in some companies. Think about the way some applications make their way into production ” someone releases a beta or prototype, people like it and start using it, and it becomes a production or mission-critical application. Most applications are planned for and do not get into production this way, but in companies small and large there are exceptions to the planning rule.
If a production server is under the desk, the server does not have any of the protection (including monitoring and backups) it needs to be highly available, leaving it exposed to anyone who walks by. Another problem is that there is a good chance that because the application was not initially intended for production use or was a beta version, it might not have been fully tested and, more specifically , load tested to see if it can handle thousands of enterprise users hitting the server. The computer itself might have the right specifications to run the mission-critical application ”the processor and memory might be more than you currently need, and it might work well on a desktop operating system ”but once it becomes a mission-critical server that must be online, will it meet the needs and growth of the enterprise?
Keeping it in the data center does more than protect it from casual passersby or environmentally related accidents; it means that factors like needs and growth were accounted for. If there is no data center or its operations are not optimal to provide the protection and availability the server needs, other options should be explored, such as a hosting provider.