Microsoft.com is one of the world s most visited Web sites. Because it is the company s portal to the world, Microsoft maintains multiple versions of the Web site in different languages, all of which are available 24 hours a day, 7 days a week. This is the goal for most companies, but how does Microsoft achieve it?
The operations side of Microsoft.com employs many of the principles and best practices in this book, with obvious tweaks for their own environment. To give you a better understanding of the topology that Microsoft.com employs, as of the writing of this book they have 930 servers (of all types ”SQL Server, domain controllers, IIS servers, and more), and might have as many as 1500 in the
With an enterprise of this magnitude, you imagine that a small army deploys, maintains, administers, and
Each component hosted on Microsoft.com, including all of the databases, has ownership with responsibilities attached to it. DBAs, as advocated in this book, are involved early on in the process, including system design, capacity planning for hardware purchasing, database design, code reviews, and availability architecture. If something goes wrong, they are called directly. There is no middle escalation point. It is therefore in the best interest of the database or application owner to get it right the first time, every time, because if there is a problem, the owner is called directly, no matter what time of day or night. The person responsible for each section must write a troubleshooting guide that is provided to the 24/7 monitoring staff. The goal is to design systems that are reliable, available, scalable, and easily
Microsoft.com stresses that no matter how good your monitoring system, you must monitor the monitor. How do you know the monitor is available? If it goes down, how do you recover quickly so that you are up and running to respond to other alerts? Microsoft.com uses an in-house, custom monitoring tool that consolidates everything under one interface.
Microsoft.com also uses prerelease and beta versions of Microsoft software in production. This is a proving ground for many Microsoft technologies because it is a high-volume Web site that needs to be scalable and available. This is one of the keys to success ”although you are supporting current and past platforms, you also need to look forward. The experiments might not always work out as you had hoped, but the lessons learned can always be applied elsewhere. For Microsoft.com,
In terms of technology, Microsoft.com uses failover clustering for some systems because clustering offers failover to prevent downtime when performing some routine maintenance. They
For the workers at Microsoft.com, availability goals are a point of personal
One of the most important policies implemented at Microsoft.com is that each project has not only its
Another important aspect of the availability of Microsoft.com is the development of troubleshooting guides for the monitoring staff to use prior to production rollout. These troubleshooting guides must be well written, because if the problem cannot be
| On the CD |
A sample project checklist in outline format used by Microsoft.com is in the file MSCOM_Project_Checklist.doc. The checklist includes all phases of the lifecycle and the development, deployment, and testing efforts as well as management involvement through project or program managers. |
There are two basic rules that allow Microsoft.com to achieve high availability in their production environment:
Once the application is deployed and monitoring is in place, leave it alone as much as possible.
Once the application is in production, it is all about process.
These are two very simple tenets, but they are very hard for most companies to achieve. Achieving scalable, reliable, available systems is a combination of good people, good processes, and good architecture. Once an application makes it into production, it is no longer subject to change on a frequent basis (due to schema changes, new code updates, new versions every month, and so on). The goal is stability. To achieve stability, Microsoft.com restricts access to the servers. If you do not need to physically log into the server, you are not permitted to do so. Monitoring and other administration tools are there for a good reason.
Microsoft.com deals with the classic barriers that everyone needs to contend with: network configuration, hardware configuration, software configuration, and service availability. Running SQL Server adds three main considerations to the mix:
Determining the availability of the database to the application. If the database is down, how does the application respond?
General database design issues that could lead to availability issues.
Challenges presented by highly utilized OLTP systems. You want to ensure that they are not a single point of failure, so redundancy is built in. But how do you synchronize standby servers to ensure no loss of transactional consistency?
In terms of what would be
Hotfixes and Service Packs
Hotfixes and service packs are an obvious availability concern. When you cycle a server, you cause an availability
Service account password changing Any time a service account is changed, you might need to stop and start your SQL Servers (including clustered servers).
Ensuring code that will be deployed is written properly
Developers, by nature, are not usually production DBAs. They do not always understand that what they write has a direct impact on everything ”disk I/O, memory, locking and blocking, and so on. Microsoft.com does not have any dedicated developers on their operations staff, but for Microsoft.com as a whole, there is another dedicated team of more than 60 developers to augment the operations group. Because of thorough reviews throughout the project lifecycle, DBAs and other people responsible for the operational side can ensure that the code they are ultimately going to roll out in production meets their standards. It is not
Outside of the usual realm of these problems, Microsoft.com has a problem not everyone faces: because they represent Microsoft and Microsoft.com is on the Internet 24 hours a day, they are frequently subject to attacks and hackers. Remember, one of the most important aspects of availability for a public system is security. Microsoft.com has a dedicated person for SQL Server security as well as a counterpart for Internet Information Services. Although this is not possible in every environment, if security is important to you, you should assess the need to
Microsoft.com is a good example of how you can achieve high availability purely with Microsoft-based technologies.