Achieving high availability is not as simple as installing a
piece of software, a new piece of hardware, or using a
/highavailability
command-line switch when starting up
a program. If it were that easy, you would not be reading this
book. Constructing and testing an end-to-end solution that
encompasses people, process, and technology is the only
tried-and-true method of preparing for a potential disaster.
Redundant technology for availability only provides the end
physical manifestation of a larger, agreed-on goal that also takes
into account security and performance. Trade-offs can be made to
achieve the end result, but the end result should
In every high availability solution, as alluded to in Chapter 1, Preparing for High Availability, you must
This chapter addresses matters that relate to both the human and technological factors of the high availability equation. It includes such topics as data center best practices, staffing, service level agreements (SLAs), and change management.
This section starts with a few funny ”and yet not so
A company was experiencing an availability problem intermittently. At approximately 2 A.M. a few nights a month, the main accounting database server went down and then automatically came back about an
SQL Server was experiencing intermittent failures with no apparent root cause or predictable timing. As in the first example, a joint investigation was launched to determine what was going on. The problem turned out to be that telephone maintenance workers were walking past the cabling, which was loose and tangled, shifting cable plugs a bit. To add insult to
A junior database administrator (DBA), who was new to the company and did not know the systems, their purpose, or everything about how the company administered servers, started a backup process. The DBA did something incorrectly ”such as backing up to the wrong disk volume and filling up a disk subsystem critical for another SQL server ”and caused a serious availability problem until the situation was corrected.
Have any of these events, or something similar,
Chances are, you
As a database system engineer or administrator, you might not have complete authority over how your SQL Server is eventually hosted or managed, but you should be able to observe and document discrepancies you see. If problems persist, take your documentation, along with records of downtimes, to management to influence and change how the servers and systems you are responsible for are hosted and managed.
Even if you do not run the data center, have no direct involvement or access, or have no input on its potential construction, it is important to let the proper people know about the things listed
The location of your data center will contribute to its availability. Do not locate or use a data center that is under plumbing lines, sewer lines, or anything similar ” basically anything that could cause problems. What would happen if a pipe burst, sending water through the ceiling of the data center and flooding the server room? Think about what businesses are above and below your offices. If, for example, your offices are located under a kitchen, not only is the potential for fire increased, but also leaks could seep through the kitchen floor. Is it wise to put your data center near something that could go up in smoke at any minute?
If none of this can be avoided, consider the options listed throughout this section and see if they help mitigate risk to the data center. If they do not, or if some of them are not
Proper climate control of systems is crucial for availability. There is a
big
difference between hosting one server and hosting 1000 servers in the same location. Computers do not work well if they are
Make sure that the entire data center s airflow is even, otherwise there could be hot and cold spots. If certain areas are too cold, there s the risk of an administrator adjusting the air conditioning if he or she is cold while working in the room and raising the temperature in areas other than their own, causing system outages in the now-hot areas. Maintaining systems at the proper temperature extends their mean time between failures (MTBF) and the life of the hardware; in extreme
Electrical lines must be properly grounded. Failing to have this done can
Servers should be neatly and logically organized in racks (see Figure 2-1) where all equipment can be reached. The location of each server should be documented so that in the event of an emergency, an affected server can be easily located. Racks also help protect the equipment from damage, protect against most accidental situations (such as a person
Figure 2-1:
A good example of servers arranged neatly in racks.
To enable the equipment to be rolled into the data center on racks, you might need specific clearance for both height and width; otherwise, you might not be able to get the equipment into the data center. Conversely, if you need to move the equipment out, you must account for that as well (for example, if you have custom racks built into the data center instead of brought in). Also, make sure that ramps can hold the weight of the equipment being rolled in and out, as well as accommodate people in wheelchairs. The ramps should not be too narrow.
Plan for growth on many levels. Much like capacity planning for your systems
A fire suppression system should be installed in the data center. If at all possible, make sure that it will not further damage the computer systems and endanger human life. It s imperative that
Make sure that there is enough room above and below the server racks to allow the cables and climate control system to be installed properly (see Figure 2-2). Also, make sure the racks will fit into the room ”do not prepurchase racks only to find out that they will not fit your room, as well as allow for the clearance and airflow needed.
Figure 2-2:
The proper amount of clearance above the racks as well as room for the cabling.
Climate and geographic location are important factors in the placement of a data center. If you work in a place susceptible to volcanic eruptions, earthquakes, hurricanes, tsunamis, or other natural disasters, take that into account in your planning. Because of these limitations imposed by nature ”you cannot stop a
Security is no joke. Securing both the computer systems and the data center itself will not only protect the system, but increase your systems availability. If it is not done, availability will decrease as a result of security breaches, viruses, or other attacks. According to two separate
No unauthorized users should be allowed physical access to the data center. To prevent this, access of all persons who enter the server room, and even those who enter the operations center, should be logged. For tighter security, only people with special entry cards should have access. At a bare minimum, invest in a clipboard, some paper, and a pen and make sure people walking in and out of the data center log their times.
Install video monitoring equipment so that intruders can be identified in the event of unauthorized access. Store those tapes in a separate, secured location for the period of time specified by your corporate security policies.
Along with allowing only authorized people to have access to the server room, make sure that no one needs to go through the server room or behind the racks to get to other equipment or areas, such as the telephone wiring system. The design of the data center should prevent such a situation.
Locks, whether standard or electronic, should be installed where appropriate, starting with the doors, and possibly even down to the level of individual servers. This supports authorized access only. If your data center has a clipboard, or even a live guard, to allow people to sign in or out (as previously mentioned) that might not be enough protection. Signing in does not stop someone from powering down a mission-critical server.
After the server is moved into the data center, you should not need to physically touch the console, with rare exceptions, such as if a piece of hardware fails or software must be installed and installation cannot be done through some other method. Leave the server alone; forget it is there (well, do not forget, but only worry about it in other ways, such as backup and monitoring). Use a remote administration or access tool, such as SQL Server s Enterprise Manager or Windows Terminal Server, to manage and access the servers. Ensure that proper training is given to staff to prevent accidental shutdowns or reboots. If there is not a problem on the system, do not look for one.
System auditing, both for physical logon to the operating system as well as auditing of events on Microsoft SQL Server, should be enabled if possible. This makes unauthorized
Very basic considerations in a data center ”power and networking ”are just as important as security.
If your data center is located in an area that is prone to natural events such as tsunamis, earthquakes, tornados, and so on, or if you are worried about power interruptions, a power generator fueled by
Even if a power generator is not necessary, uninterruptible power
Systems should not all be placed on the same power grid, especially if a primary server and its secondary or redundant server for high availability are in the same data center. This would be
All data and voice networks should have redundant paths into and out of the data center. Ask the data and voice provider to place the redundant lines on a different
The cabling infrastructure should be extremely neat (see Figure 2-3), labeled, and possibly even
Figure 2-3:
A good example of neat cabling.
Telephones with outside lines should be placed at regular intervals in the data center. If possible, some of the phones should be placed on analog lines (see Figure 2-4). In the event of a power or connectivity problem, analog lines and normal phones can usually dial out even if there is a problem with a switchboard. They also let a vendor s consultant dial into his or her company s systems to get a patch if it cannot be downloaded directly from the system.
Figure 2-4:
An analog phone connection box in a data center, clearly labeled.
Although this is more of a networking issue, if the company is
Remember to plan for proper growth to support the systems that will be placed in the data center in the future. If you run out of network switches, ports, cabling, and so on, you will have other availability-
If you use a third-party hosting company, all of the
What kind of access will your employees have if they need to visit the hosting company? Is there a limit to how many of your
What is the
How secure is your equipment compared to that of another company? If someone is performing maintenance on another company s servers that are located next to yours, can the administrator accidentally dislodge a cable in your racks?
How will you monitor that the hosting company is actually doing the work it is required to (for example, swapping tapes in the backup digital linear tape [DLT] device)? How will you secure your
Make sure that the third-party vendor s support agreements are in place and meet your level of availability.
In conjunction with the previous point, check to see that the third- party hosting company itself is adhering to sound high availability and disaster recovery practices.
Purchase support contracts for all software and hardware that will be hosted in the data center. This includes the operating system, application software, disk subsystem, network cards, switches, hubs, and routers ”anything that is or is not nailed down! Support contracts are forms of SLAs. There is nothing
Make sure you understand what type of agreements you have with each vendor your company has a contract with. Learn how to use their support services. Record their contact information in several safe but accessible places. Include your account number and any other information you need to
For example, many hardware vendor SLAs have clauses that permit only support personnel from the vendor or specific, certified persons from your team to open the server
Also consider a product s support life cycle. This will not only help you determine the support contract that will need to be purchased, but will help the planners decide the life cycle of the solution without having to do a major upgrade. Microsoft publishes support life cycle, policies, and costs at http://support.microsoft.com/lifecycle . The Microsoft Support Life Cycle is a three-phase support approach:
Most products will have a minimum of five years of mainstream support from the date they are
Customers might have the ability to pay for two years of extended support for certain products.
Online
For any changes to policies in the program,
Last but not least, the production servers should not be under the DBA or system administrator s desk. Sadly, this
If a production server is under the desk, the server does not have any of the protection (including monitoring and backups) it needs to be highly available, leaving it exposed to anyone who walks by. Another problem is that there is a good chance that because the application was not initially intended for production use or was a beta version, it might not have been fully tested and, more
Keeping it in the data center does more than protect it from casual passersby or environmentally related accidents; it means that factors like needs and growth were accounted for. If there is no data center or its operations are not optimal to provide the protection and availability the server needs, other options should be explored, such as a hosting provider.