Case Study: Microsoft.com | Microsoft SQL Server 2000 High Availability

Microsoft.com is one of the world s most visited Web sites. Because it is the company s portal to the world, Microsoft maintains multiple versions of the Web site in different languages, all of which are available 24 hours a day, 7 days a week. This is the goal for most companies, but how does Microsoft achieve it?

Background Information

The operations side of Microsoft.com employs many of the principles and best practices in this book, with obvious tweaks for their own environment. To give you a better understanding of the topology that Microsoft.com employs, as of the writing of this book they have 930 servers (of all types ”SQL Server, domain controllers, IIS servers, and more), and might have as many as 1500 in the next 18 months. The SQL Servers used by Microsoft.com currently house approximately 1100 databases. Microsoft.com has a dedicated monitoring group for all servers (including SQL Server) that is staffed around the clock to ensure that any problems that occur can be detected quickly. Microsoft.com also has redundant data centers in the United States that host all versions of Microsoft.com, including localized versions for international customers. Downloads, such as service packs for Microsoft products, are the only things hosted outside of these data centers because it is more cost-effective to use global caching vendors , because the files are closer to you, and the download is as quick as possible, no matter where you are.

With an enterprise of this magnitude, you imagine that a small army deploys, maintains, administers, and monitors these servers. That could not be further from the truth. As of the writing of this book, Microsoft.com employs 53 full-time employees , only nine of whom are DBAs. Having nine DBAs translates to an average of slightly more than 120 databases per person. For most, that seems like a massive amount of responsibility. Microsoft.com achieves this streamlining through effective team cross-training and coordination. No one, including the Group Operations Manager for Microsoft.com, carries a pager. This makes a bold statement.

Each component hosted on Microsoft.com, including all of the databases, has ownership with responsibilities attached to it. DBAs, as advocated in this book, are involved early on in the process, including system design, capacity planning for hardware purchasing, database design, code reviews, and availability architecture. If something goes wrong, they are called directly. There is no middle escalation point. It is therefore in the best interest of the database or application owner to get it right the first time, every time, because if there is a problem, the owner is called directly, no matter what time of day or night. The person responsible for each section must write a troubleshooting guide that is provided to the 24/7 monitoring staff. The goal is to design systems that are reliable, available, scalable, and easily maintainable out of the gate.

Microsoft.com stresses that no matter how good your monitoring system, you must monitor the monitor. How do you know the monitor is available? If it goes down, how do you recover quickly so that you are up and running to respond to other alerts? Microsoft.com uses an in-house, custom monitoring tool that consolidates everything under one interface.

Microsoft.com also uses prerelease and beta versions of Microsoft software in production. This is a proving ground for many Microsoft technologies because it is a high-volume Web site that needs to be scalable and available. This is one of the keys to success ”although you are supporting current and past platforms, you also need to look forward. The experiments might not always work out as you had hoped, but the lessons learned can always be applied elsewhere. For Microsoft.com, employing prerelease software has not affected its overall availability.

In terms of technology, Microsoft.com uses failover clustering for some systems because clustering offers failover to prevent downtime when performing some routine maintenance. They employ other technologies ”for example, Network Load Balancing coupled with log shipping ”where the cost of implementing a cluster did not make sense or was overkill. They also use high-quality components and hardware designs including fibre, redundant power supplies , and error-correcting memory. They then match that to a quality over the software architecture design.

For the workers at Microsoft.com, availability goals are a point of personal pride in addition to the 24/7 SLA. The goal is to be the most highly available site on the Web as well as one of the largest. That presents particular challenges; for example, how do you maintain availability when you need to take the back- end database down for routine maintenance, upgrades, and so on?

Planning and Development

One of the most important policies implemented at Microsoft.com is that each project has not only its owners but also a process that covers the lifecycle of the project. The group operations manager and his counterpart have formal project reviews once a week. These must be polished presentations, not unlike a business review. If they feel that the team is not ready, they do not let them proceed. Microsoft.com is too critical to risk, and application designs must be virtually bulletproof when they are released.

Another important aspect of the availability of Microsoft.com is the development of troubleshooting guides for the monitoring staff to use prior to production rollout. These troubleshooting guides must be well written, because if the problem cannot be solved by using the guide, the person responsible for it will have a call escalated to them. If the proper procedures are clearly defined, the phone does not ring. Additionally, operations guides, maintenance plans, and monitoring solutions are designed at this stage. The application itself is designed with a monitoring method. Either it maps to existing tools or you build a new component that interfaces with the existing tools. Many developers historically do not design with implementation in mind, including monitoring, but at Microsoft.com it is a way of life.

On the CD

A sample project checklist in outline format used by Microsoft.com is in the file MSCOM_Project_Checklist.doc. The checklist includes all phases of the lifecycle and the development, deployment, and testing efforts as well as management involvement through project or program managers.

How Microsoft.com Achieves High Availability in Production

There are two basic rules that allow Microsoft.com to achieve high availability in their production environment:

Once the application is deployed and monitoring is in place, leave it alone as much as possible.
Once the application is in production, it is all about process.

These are two very simple tenets, but they are very hard for most companies to achieve. Achieving scalable, reliable, available systems is a combination of good people, good processes, and good architecture. Once an application makes it into production, it is no longer subject to change on a frequent basis (due to schema changes, new code updates, new versions every month, and so on). The goal is stability. To achieve stability, Microsoft.com restricts access to the servers. If you do not need to physically log into the server, you are not permitted to do so. Monitoring and other administration tools are there for a good reason.

Microsoft.com s Barriers to Availability

Microsoft.com deals with the classic barriers that everyone needs to contend with: network configuration, hardware configuration, software configuration, and service availability. Running SQL Server adds three main considerations to the mix:

Determining the availability of the database to the application. If the database is down, how does the application respond?
General database design issues that could lead to availability issues.
Challenges presented by highly utilized OLTP systems. You want to ensure that they are not a single point of failure, so redundancy is built in. But how do you synchronize standby servers to ensure no loss of transactional consistency?

In terms of what would be considered normal problems related to SQL Server, the top three issues that Microsoft.com encounters, listed here, will probably sound familiar to most of you.

Hotfixes and Service Packs Hotfixes and service packs are an obvious availability concern. When you cycle a server, you cause an availability outage . Microsoft.com is lucky enough to have a dedicated test lab to determine what each hotfix or service pack will do to the behavior of their applications. The advantage of being Microsoft.com is that they have a unique relationship with the SQL Server development team. As soon as a hotfix or service pack is in development, the Web group works with the SQL Server development team and has access to new builds. Although this might seem like an advantage ”and it is ”it is only marginal. They still need to roll out the fixes and patches like everyone else, taking into account their availability.
Service account password changing Any time a service account is changed, you might need to stop and start your SQL Servers (including clustered servers).
Ensuring code that will be deployed is written properly Developers, by nature, are not usually production DBAs. They do not always understand that what they write has a direct impact on everything ”disk I/O, memory, locking and blocking, and so on. Microsoft.com does not have any dedicated developers on their operations staff, but for Microsoft.com as a whole, there is another dedicated team of more than 60 developers to augment the operations group. Because of thorough reviews throughout the project lifecycle, DBAs and other people responsible for the operational side can ensure that the code they are ultimately going to roll out in production meets their standards. It is not meant to be an adversarial relationship; if someone, for example, uses recursive cursors , it is not enough to tell a developer Do not do that. The DBAs at Microsoft.com document the reasons why it is not optimal and communicate that to the developer. In the end, it is about synergy.

Outside of the usual realm of these problems, Microsoft.com has a problem not everyone faces: because they represent Microsoft and Microsoft.com is on the Internet 24 hours a day, they are frequently subject to attacks and hackers. Remember, one of the most important aspects of availability for a public system is security. Microsoft.com has a dedicated person for SQL Server security as well as a counterpart for Internet Information Services. Although this is not possible in every environment, if security is important to you, you should assess the need to dedicate a resource to this important task. One of the reasons Microsoft.com is so successful is that their monitoring group knows what to look for and can detect problems.

Microsoft.com is a good example of how you can achieve high availability purely with Microsoft-based technologies.