The dependence of sites on networks and uninterrupted services puts considerable pressure on operations staff to ensure ongoing service availability, health, and performance. A well-designed management system is critical to the successful management and operation of large business sites. Great deployment tools allow smooth and rapid growth of Web sites. Great monitoring and troubleshooting tools allow operations staff to quickly deal with problems of components and services before business is affected. The management system itself must be highly available to ensure continuous operations.
Many large sites are geographically remote from operations staff, and hosted in a managed data center close to high-capacity Internet bandwidth. To reduce costs associated with staffing and travel to remote locations, the management network must offer remote capabilities to deploy, provision, monitor, and troubleshoot geographically dispersed sites.
Management and operations of site systems are a very complex and challenging task. The operations staff faces significant challenges in deploying, fine-tuning, and operating Web site systems. Microsoft, as well as many third-party vendors, offers a wide range of products for managing and administering Windows NT systems. In addition, the suite of Microsoft development tools allows operations staff to customize the management system to better operate a site.
Management of a site should incorporate both system and network management. Microsoft's System Management Server can be used for system management tasks such as planning, deployment, and change and configuration management. A suite of Microsoft tools and services such as Performance Monitor, SNMP Services, WMI, event logging, and backup tools are used for event management, performance management, and storage management (operations management).
Large sites often outsource network management to the Web hosting companies that provide the network infrastructure, services, and facilities (for example, troubleshooting and fixing minor problems as well as around-the-clock monitoring of servers, backbone paths, routers, power systems, or anything else that could affect the delivery of a site's content anywhere in the world). This section concentrates on system and operation management aspects, leaving network management to the Web-hosting providers, and explains how reliable and powerful management systems can be built using Microsoft products and technologies.
The key points addressed in this section are:
Management Network
Management and operations can either share the back-end network or exist on a separate LAN. Management using the back-end network (sometimes called in-band management) is less costly and easier to operate. However, it may be unsuitable for managing around-the-clock services for the following reasons:
Therefore, large Web site developers should build a separate management network for scalability, availability, and security.
Management System Components
Normally a management system consists of management consoles, management servers, and management agents.
Figure A.13 is a simple diagram illustrating the core management system components and communication between them.
Figure A.13 Management system components
Management consoles
Management systems interface with users through management consoles. Management consoles are responsible for:
Many current solutions implement management consoles and management servers, which should be thought of as two logical tiers, in a single tier for reasons of cost savings and ease of use. However, it is sometimes desirable or necessary to decouple the two for availability, scalability, and remoteness of operations centers from managed networks.
Management servers
Management servers are the workhorses of the management system. Management servers communicate with managed nodes (servers running Windows NT, Cisco routers, and other network equipment) through proprietary or standard protocols. Management servers are responsible for:
Because management servers collect a lot of information (up to gigabytes of data a day), this information is often stored on separate machines, the back-end servers.
Management agents
Management agents are programs that reside within a managed node. In order to be managed, each device—whether a server running Windows NT or a simple network hub—must have a management agent. Management agents perform primary management functions such as:
Some nodes may have only one agent, such as an SNMP-managed network router. Others, such as Windows NT Server, are more complex and include multiple agents using different protocols. Agents and servers communicate using standard and proprietary protocols.
Scaling Management Infrastructure
To preserve initial investment, a management system must be able to start small and grow along with the site it manages. As a site expands and new equipment and services are added, the management system has to scale adequately.
A small Web site can be managed with a very simple management system that typically uses the back-end network. The simplest management system is a centralized system: a small number of machines installed with management server and console software. Each machine is capable of managing the entire site. The centralized management system is described next. To scale such a management system, the developers must distribute it. Then, we describe the steps necessary to distribute the management system. Finally, we describe an example distributed management system.
Centralized management system
A management system can be centralized or distributed. A single central managing entity, which controls all management systems, characterizes centralized management systems. Centralized management is implemented with one (or more) powerful machine(s) that allow access to all components of the site system, monitor all devices, and accept alarms and notifications from all managed elements. Central management is often done using the main service network.
Because of its simplicity, low cost, and easy administration, a centralized management system may be desirable in small environments such as a start-up site with just a few servers. Microsoft offers a rich set of tools and applications for centralized management such as SMS, PerfMon, Event Log, Robocopy, and scripting tools. Other applications and tools are available from third-party vendors.
Distributed management systems
With rapid growth of a Web site, a centralized management system may prove inefficient. A centralized management system, concentrated on one or two machines, has significant problems: It lacks scalability, creates performance bottlenecks, and has a single point of failure. These issues make centralized management systems unsuitable for managing very large, rapidly expanding, and highly available sites. To address scalability and availability problems, management systems should be distributed in the following ways:
Example management system
Our example site uses a distributed management system implemented on a separate LAN.
Figure A.14 depicts the management system used in our example site. Because the focus of this chapter is on the management system, we show the managed system—the site itself—as a cloud. Refer to Figure A.3 for details on example site architecture.
In this example management system, different line styles, thickness, and annotations show the management LAN, remote access, and applications installed on the management system components. In particular:
Figure A.14 Example management system
Management consoles
In this example management network, management consoles are decoupled from management servers and thus can be concentrated in the (highly secured) Network Operation Center(s). Management tools and applications must be carefully chosen to provide almost all management capabilities remotely.
Management consoles can run Windows NT Server, Windows NT Workstation, or Windows NT Professional editions. They would normally have a number of applications installed: System Management Server Administrator Console, Terminal Server (TS) Client, Telnet, Internet Explorer, and SNMP MIB browsers. These tools all provide remote management capabilities and therefore can be used in roaming environments by traveling technicians.
Management servers
In a distributed system, each management server serves only managed nodes in its jurisdiction—such as a farm or partition, a floor, a building, a campus, or a city. For example, a local management server can be running in each of the offices in Europe and North America, managing events and networks locally. Distributing management servers and partitioning them to manage only a limited number of nodes allows one to:
The same management server does not interact with managed nodes in other areas (however, it is not precluded from doing so).
We recommend that management servers run either Windows NT Server or Windows 2000 server editions to ensure better stability of the system, and provide additional services that are available only in server editions. Management servers host management applications that provide system and network management capabilities required for a site. Services and applications provided by Microsoft should be installed on the management servers (Performance Monitor, System Management Server (SMS), and Event Logging). SNMP trap managers or trap receivers should also be installed on the management servers.
Back-end servers (BES)
Back-end servers are machines with large storage disks that are used for persistent storage of information collected by management servers. It is not necessary to use separate machines to store the management data. However, most large business Web sites log gigabytes of data each day (for later data mining and exploitation) and use separate machines to store this information. Large databases are often used to store events logged by the managed nodes, performance counters, and statistical data. SMS databases can be located on the BESs as well. Back-end servers can also host utilities and tools that manipulate the data stored in the databases: harvesters, parsers, etc. This allows many customers to use their own highly customized or legacy tools.
Distributed vs. Centralized
Distributed management systems have several key advantages over a centralized management system. They offer better scalability and availability, and reduce or eliminate performance bottlenecks and a single point of failure. However, distributed management systems introduce some deficiencies, such as higher costs (associated with adding more equipment and administration) and growing complexity. When designing a management system for a site, carefully weigh the pros and cons of taking a centralized or distributed management approach.
Deployment and Installation
To successfully deploy new services and equipment, the management system must provide tools for deployment and installation. Deployment includes installing and configuring new equipment and replicating Web site content and data on new machines. The following tools and techniques are most often used to deploy new services and machines.
Unattended/automated server installation
To deploy new servers, use scripts to build a golden (or ideal) version of the server. Then, capture an image copy of the golden server's system disk using a tool such as Norton Ghost and Ghost Walker (http://www.ghost.com/), and use that golden image to build new servers.
SysPrep (Windows 2000)
SysPrep is a tool (available in the Windows 2000 Resource Kit) designed to deploy fully installed Windows 2000 installations on multiple machines. After performing the initial setup steps on a single system, administrators can run SysPrep to prepare the sample machine for duplication. Web servers of a site farm are normally based on the same image with minor configuration differences like name and IP addresses. Additionally, the combination of SysPrep and a winnt.sif answer file provide the tools for making the minor configuration necessary for each respective machine.
Content replication
Content Replication Service and Robocopy are most often used for content replication. Content Replication Service is part of the Microsoft Site Server product line (http://www.microsoft.com/siteserver/site/). Robocopy is a 32-bit Windows command-line application that simplifies the task of maintaining an identical copy of a folder tree in multiple locations. Robocopy is available in the Windows NT Resource Kit and the Windows 2000 Resource Kit.
Change and Configuration
System Management Server provides all the means necessary for change and configuration management of site servers. SMS automates many change and configuration management tasks, such as hardware inventory/software inventory, product compliance, software distribution/installation, and software metering.
More information on Microsoft System Management Server is located at http://www.microsoft.com/smsmgmt/. Other tools available from third-party vendors are listed at http://www.microsoft.com/ntserver/management/exec/vendor/ThrdPrty.asp.
Performance Monitoring
Continuous monitoring is essential for operating a site's 24x7 services. Many sites use extensive logging and counter-based monitoring, along with as much remote administration as possible, to both ensure continuous availability and to provide the data with which to improve their infrastructure. Tools used to monitor performance of site servers include Performance Monitor, SNMP MIB Browsers, and HTTPMon.
Event Management
Event management entails monitoring the health and status of site systems (usually in real time), alerting administrators to problems, and consolidating the event logs in a single place for ease of administration. The event monitoring tools may track individual servers or network components, or they may focus on application services like e-mail, transaction processing, or Web service. Event filtering, alerting, and visualization tools are an absolute necessity for sites with hundreds of machines in order to filter out important events from background noise. Tools such as Event Log, SNMP Agents, and SMS (for event-to-SNMP-trap conversion) can be used for event managing.
Out-of-Band Emergency Recovery
Repairing failed nodes when the management network itself is down presents a difficult manageability problem. When in-band intervention is impossible, out-of-band (OOB) management comes to the rescue.
OOB management refers to products that give technicians access to managed nodes using dial-up telephone lines or a serial cable and not using the management network. Therefore, a serial port must be available on every managed node for out-of-band access. Use OOB management to bring a failed service or node online to repair it in-band, analyze the reasons for failure, etc.
OOB requirements
OOB management should provide all or some of the following capabilities:
Operating System and Service Control
BIOS and Boot Device Control
OOB solutions
Many solutions are available for performing the tasks just described. Table A.2 summarizes the most widely used solutions from Microsoft and third-party vendors.
Table A.2 Out-of-Band Solutions
Capability | Name | Vendor |
---|---|---|
Terminal Server | Available with Windows NT 4.0 Terminal Server Edition or Windows 2000 TermServ | Microsoft Seattle Labs |
Setup and installation | Unattended OS and Post OS shell scripts Ghost IC3 Remote Installation Service Microsoft (Windows 2000 only) | Microsoft Norton ImageCast Microsoft |
BIOS configuration and hardware diagnostics | Integrated Remote Console (IRC) Remote Insight Board (RIB) Emerge Remote Server Access | Compaq Compaq Apex |
Hardware power management | Integrated Remote Console (IRC) Remote Insight Board (RIB) Remote Power Control | Compaq Compaq Baytech |
OOB security
Dialing into the console port exposes the network to access. Prevent this by securing OOB operations. At a minimum, strong authentication of administrative staff should be required, usually with one-time (challenge-response) passwords provided by security tokens. Administrators are provided with either hardware or software-based tokens, which negotiate with an access server at the target site. This opens a connection to a terminal server, which in turn provides serial port access to a specific host. Ideally, employ link encryption as well in order to prevent snooping or possible compromise by an intruder. An increasingly popular solution is to use public-key based VPNs (Virtual Private Networks) to provide both strong authentication as well as encryption.
Automation of Management Tasks
Management system design should allow implementation of automated actions such as stopping or starting a service or entire node, running a script or a batch file when certain events occur, or attempting an out-of-band recovery if the management network is unavailable. Well-designed systems will automatically notify IT technicians of events or problems using e-mail, telephone, pagers, or cell phones.
Many tools and applications can automate management tasks:
Many third-party solutions that allow automation are also available, as listed on: http://www.microsoft.com/ntserver/management/exec/vendor/ThrdPrty.asp.
Security
Security of the management infrastructure is of paramount importance, because compromise of this subsystem can lead to compromise of every other component of a site. All of the elements of security discussed in the preceding section, which discusses security architecture, apply here.
Although very widely used, one of the most popular management protocols—SNMP—is poor from a security standpoint. The SNMP community string is a very weak password. While it does not permit a user to log on, it does permit someone to take control of a node. Carefully choose and tightly control management protocols for each site.