Management and Operations

Introduction

The dependence of sites on networks and uninterrupted services puts considerable pressure on operations staff to ensure ongoing service availability, health, and performance. A well-designed management system is critical to the successful management and operation of large business sites. Great deployment tools allow smooth and rapid growth of Web sites. Great monitoring and troubleshooting tools allow operations staff to quickly deal with problems of components and services before business is affected. The management system itself must be highly available to ensure continuous operations.

Many large sites are geographically remote from operations staff, and hosted in a managed data center close to high-capacity Internet bandwidth. To reduce costs associated with staffing and travel to remote locations, the management network must offer remote capabilities to deploy, provision, monitor, and troubleshoot geographically dispersed sites.

Management and operations of site systems are a very complex and challenging task. The operations staff faces significant challenges in deploying, fine-tuning, and operating Web site systems. Microsoft, as well as many third-party vendors, offers a wide range of products for managing and administering Windows NT systems. In addition, the suite of Microsoft development tools allows operations staff to customize the management system to better operate a site.

Management of a site should incorporate both system and network management. Microsoft's System Management Server can be used for system management tasks such as planning, deployment, and change and configuration management. A suite of Microsoft tools and services such as Performance Monitor, SNMP Services, WMI, event logging, and backup tools are used for event management, performance management, and storage management (operations management).

Large sites often outsource network management to the Web hosting companies that provide the network infrastructure, services, and facilities (for example, troubleshooting and fixing minor problems as well as around-the-clock monitoring of servers, backbone paths, routers, power systems, or anything else that could affect the delivery of a site's content anywhere in the world). This section concentrates on system and operation management aspects, leaving network management to the Web-hosting providers, and explains how reliable and powerful management systems can be built using Microsoft products and technologies.

The key points addressed in this section are:

  • Separating the management network from service networks for high availability and increased security.
  • Distributing management network components to:
    • Eliminate or reduce performance bottlenecks
    • Eliminate single point of failure
    • Allow independent scaling
    • Increase availability of the management system
  • Employing Microsoft tools and products where possible to achieve greater performance due to tight integration with the underlying platform.
  • Automating tasks where possible.
  • Monitoring everything to improve infrastructure and identify problems before they occur.

Management Infrastructure

Management Network

Management and operations can either share the back-end network or exist on a separate LAN. Management using the back-end network (sometimes called in-band management) is less costly and easier to operate. However, it may be unsuitable for managing around-the-clock services for the following reasons:

  • In-band management hampers performance of the service network. Management-related notifications such as SNMP traps can flood the network, creating and/or amplifying performance bottlenecks.
  • Failure notification is impossible when the service network is down.
  • Security implications.

Therefore, large Web site developers should build a separate management network for scalability, availability, and security.

Management System Components

Normally a management system consists of management consoles, management servers, and management agents.

Figure A.13 is a simple diagram illustrating the core management system components and communication between them.

click to view at full size

Figure A.13 Management system components

Management consoles

Management systems interface with users through management consoles. Management consoles are responsible for:

  • Logging in and authenticating users (network operators, administrators).
  • Providing access to all management servers: once a management server is accessed, the user can view the status of all managed nodes in that server's jurisdiction and issue commands for software and configuration updates on those nodes.
  • Providing response to the user-issued commands.

Many current solutions implement management consoles and management servers, which should be thought of as two logical tiers, in a single tier for reasons of cost savings and ease of use. However, it is sometimes desirable or necessary to decouple the two for availability, scalability, and remoteness of operations centers from managed networks.

Management servers

Management servers are the workhorses of the management system. Management servers communicate with managed nodes (servers running Windows NT, Cisco routers, and other network equipment) through proprietary or standard protocols. Management servers are responsible for:

  • Accepting, filtering, and correlating events from managed nodes in its jurisdiction.
  • Gathering, storing, and analyzing performance information.
  • Distributing and installing software on the managed nodes.
  • Updating configuration parameters on managed nodes.

Because management servers collect a lot of information (up to gigabytes of data a day), this information is often stored on separate machines, the back-end servers.

Management agents

Management agents are programs that reside within a managed node. In order to be managed, each device—whether a server running Windows NT or a simple network hub—must have a management agent. Management agents perform primary management functions such as:

  • Monitoring the resources of the managed device and receiving unsolicited notifications and events from those resources.
  • Providing the means of configuring and fine-tuning the resources of the managed device.
  • Querying on demand the resources of the managed device for their current configuration, state, and performance data.

Some nodes may have only one agent, such as an SNMP-managed network router. Others, such as Windows NT Server, are more complex and include multiple agents using different protocols. Agents and servers communicate using standard and proprietary protocols.

Scaling Management Infrastructure

To preserve initial investment, a management system must be able to start small and grow along with the site it manages. As a site expands and new equipment and services are added, the management system has to scale adequately.

A small Web site can be managed with a very simple management system that typically uses the back-end network. The simplest management system is a centralized system: a small number of machines installed with management server and console software. Each machine is capable of managing the entire site. The centralized management system is described next. To scale such a management system, the developers must distribute it. Then, we describe the steps necessary to distribute the management system. Finally, we describe an example distributed management system.

Centralized management system

A management system can be centralized or distributed. A single central managing entity, which controls all management systems, characterizes centralized management systems. Centralized management is implemented with one (or more) powerful machine(s) that allow access to all components of the site system, monitor all devices, and accept alarms and notifications from all managed elements. Central management is often done using the main service network.

Because of its simplicity, low cost, and easy administration, a centralized management system may be desirable in small environments such as a start-up site with just a few servers. Microsoft offers a rich set of tools and applications for centralized management such as SMS, PerfMon, Event Log, Robocopy, and scripting tools. Other applications and tools are available from third-party vendors.

Distributed management systems

With rapid growth of a Web site, a centralized management system may prove inefficient. A centralized management system, concentrated on one or two machines, has significant problems: It lacks scalability, creates performance bottlenecks, and has a single point of failure. These issues make centralized management systems unsuitable for managing very large, rapidly expanding, and highly available sites. To address scalability and availability problems, management systems should be distributed in the following ways:

  • Decouple management consoles from management servers.
  • Add more servers so that each manages smaller numbers of nodes.
  • Add more consoles to allow access to more administrators and technicians.
  • Partition workload between management servers geographically or by management functionality.

Example management system

Our example site uses a distributed management system implemented on a separate LAN.

Figure A.14 depicts the management system used in our example site. Because the focus of this chapter is on the management system, we show the managed system—the site itself—as a cloud. Refer to Figure A.3 for details on example site architecture.

In this example management system, different line styles, thickness, and annotations show the management LAN, remote access, and applications installed on the management system components. In particular:

  • Management network (thick solid).
  • RAS dial-in into the management network (thin solid).

click to view at full size

Figure A.14 Example management system

Management consoles

In this example management network, management consoles are decoupled from management servers and thus can be concentrated in the (highly secured) Network Operation Center(s). Management tools and applications must be carefully chosen to provide almost all management capabilities remotely.

Management consoles can run Windows NT Server, Windows NT Workstation, or Windows NT Professional editions. They would normally have a number of applications installed: System Management Server Administrator Console, Terminal Server (TS) Client, Telnet, Internet Explorer, and SNMP MIB browsers. These tools all provide remote management capabilities and therefore can be used in roaming environments by traveling technicians.

Management servers

In a distributed system, each management server serves only managed nodes in its jurisdiction—such as a farm or partition, a floor, a building, a campus, or a city. For example, a local management server can be running in each of the offices in Europe and North America, managing events and networks locally. Distributing management servers and partitioning them to manage only a limited number of nodes allows one to:

  • Lock the management servers in secure cabinets.
  • Reduce or eliminate network traffic (that is, servers running Windows NT in Asia are upgraded by the management server in Tokyo, not the one in New Jersey).
  • Eliminate single point of failure.

The same management server does not interact with managed nodes in other areas (however, it is not precluded from doing so).

We recommend that management servers run either Windows NT Server or Windows 2000 server editions to ensure better stability of the system, and provide additional services that are available only in server editions. Management servers host management applications that provide system and network management capabilities required for a site. Services and applications provided by Microsoft should be installed on the management servers (Performance Monitor, System Management Server (SMS), and Event Logging). SNMP trap managers or trap receivers should also be installed on the management servers.

Back-end servers (BES)

Back-end servers are machines with large storage disks that are used for persistent storage of information collected by management servers. It is not necessary to use separate machines to store the management data. However, most large business Web sites log gigabytes of data each day (for later data mining and exploitation) and use separate machines to store this information. Large databases are often used to store events logged by the managed nodes, performance counters, and statistical data. SMS databases can be located on the BESs as well. Back-end servers can also host utilities and tools that manipulate the data stored in the databases: harvesters, parsers, etc. This allows many customers to use their own highly customized or legacy tools.

Distributed vs. Centralized

Distributed management systems have several key advantages over a centralized management system. They offer better scalability and availability, and reduce or eliminate performance bottlenecks and a single point of failure. However, distributed management systems introduce some deficiencies, such as higher costs (associated with adding more equipment and administration) and growing complexity. When designing a management system for a site, carefully weigh the pros and cons of taking a centralized or distributed management approach.

Management System Requirements

Deployment and Installation

To successfully deploy new services and equipment, the management system must provide tools for deployment and installation. Deployment includes installing and configuring new equipment and replicating Web site content and data on new machines. The following tools and techniques are most often used to deploy new services and machines.

Unattended/automated server installation

To deploy new servers, use scripts to build a golden (or ideal) version of the server. Then, capture an image copy of the golden server's system disk using a tool such as Norton Ghost and Ghost Walker (http://www.ghost.com/), and use that golden image to build new servers.

SysPrep (Windows 2000)

SysPrep is a tool (available in the Windows 2000 Resource Kit) designed to deploy fully installed Windows 2000 installations on multiple machines. After performing the initial setup steps on a single system, administrators can run SysPrep to prepare the sample machine for duplication. Web servers of a site farm are normally based on the same image with minor configuration differences like name and IP addresses. Additionally, the combination of SysPrep and a winnt.sif answer file provide the tools for making the minor configuration necessary for each respective machine.

Content replication

Content Replication Service and Robocopy are most often used for content replication. Content Replication Service is part of the Microsoft Site Server product line (http://www.microsoft.com/siteserver/site/). Robocopy is a 32-bit Windows command-line application that simplifies the task of maintaining an identical copy of a folder tree in multiple locations. Robocopy is available in the Windows NT Resource Kit and the Windows 2000 Resource Kit.

Change and Configuration

System Management Server provides all the means necessary for change and configuration management of site servers. SMS automates many change and configuration management tasks, such as hardware inventory/software inventory, product compliance, software distribution/installation, and software metering.

More information on Microsoft System Management Server is located at http://www.microsoft.com/smsmgmt/. Other tools available from third-party vendors are listed at http://www.microsoft.com/ntserver/management/exec/vendor/ThrdPrty.asp.

Performance Monitoring

Continuous monitoring is essential for operating a site's 24x7 services. Many sites use extensive logging and counter-based monitoring, along with as much remote administration as possible, to both ensure continuous availability and to provide the data with which to improve their infrastructure. Tools used to monitor performance of site servers include Performance Monitor, SNMP MIB Browsers, and HTTPMon.

Event Management

Event management entails monitoring the health and status of site systems (usually in real time), alerting administrators to problems, and consolidating the event logs in a single place for ease of administration. The event monitoring tools may track individual servers or network components, or they may focus on application services like e-mail, transaction processing, or Web service. Event filtering, alerting, and visualization tools are an absolute necessity for sites with hundreds of machines in order to filter out important events from background noise. Tools such as Event Log, SNMP Agents, and SMS (for event-to-SNMP-trap conversion) can be used for event managing.

Out-of-Band Emergency Recovery

Repairing failed nodes when the management network itself is down presents a difficult manageability problem. When in-band intervention is impossible, out-of-band (OOB) management comes to the rescue.

OOB management refers to products that give technicians access to managed nodes using dial-up telephone lines or a serial cable and not using the management network. Therefore, a serial port must be available on every managed node for out-of-band access. Use OOB management to bring a failed service or node online to repair it in-band, analyze the reasons for failure, etc.

OOB requirements

OOB management should provide all or some of the following capabilities:

Operating System and Service Control

  • Restart the failed service or node.
  • Take the failed service or node offline. (This is important because the failed node can flood the network with notifications of failure.)
  • Set up and control firmware.
  • Change firmware configuration.
  • Set up OS and service.

BIOS and Boot Device Control

  • Configure hardware power management.
  • Automate BIOS configuration and hardware diagnostics.
  • Enable remote console input and output.

OOB solutions

Many solutions are available for performing the tasks just described. Table A.2 summarizes the most widely used solutions from Microsoft and third-party vendors.

Table A.2 Out-of-Band Solutions

CapabilityNameVendor
Terminal ServerAvailable with Windows NT 4.0 Terminal Server Edition or Windows 2000
TermServ
Microsoft
Seattle Labs
Setup and installationUnattended OS and Post OS shell scripts
Ghost
IC3
Remote Installation Service Microsoft (Windows 2000 only)
Microsoft
Norton
ImageCast
Microsoft
BIOS configuration and hardware diagnosticsIntegrated Remote Console (IRC)
Remote Insight Board (RIB)
Emerge Remote Server Access
Compaq
Compaq
Apex
Hardware power managementIntegrated Remote Console (IRC)
Remote Insight Board (RIB)
Remote Power Control
Compaq
Compaq
Baytech

OOB security

Dialing into the console port exposes the network to access. Prevent this by securing OOB operations. At a minimum, strong authentication of administrative staff should be required, usually with one-time (challenge-response) passwords provided by security tokens. Administrators are provided with either hardware or software-based tokens, which negotiate with an access server at the target site. This opens a connection to a terminal server, which in turn provides serial port access to a specific host. Ideally, employ link encryption as well in order to prevent snooping or possible compromise by an intruder. An increasingly popular solution is to use public-key based VPNs (Virtual Private Networks) to provide both strong authentication as well as encryption.

Automation of Management Tasks

Management system design should allow implementation of automated actions such as stopping or starting a service or entire node, running a script or a batch file when certain events occur, or attempting an out-of-band recovery if the management network is unavailable. Well-designed systems will automatically notify IT technicians of events or problems using e-mail, telephone, pagers, or cell phones.

Many tools and applications can automate management tasks:

  • Set an alert on a counter in the Alert View of Performance Monitor, thereby triggering a message to be sent, a program to be run, or a log to be started when the selected counter's value equals, exceeds, or falls below a specified setting.
  • SNMP Managers provide automation tools to generate notifications and start a program, batch, or script when certain traps are received.
  • BackOffice components such as Microsoft Exchange and SQL Server can trigger exception when service-specific events occur (for example, remote mail server doesn't respond to messages within a predefined interval). Microsoft Exchange Server, for example, can send e-mail, display on-screen alerts, or route notifications to an external application.
  • Use the Windows Scripting Host (WSH) or any other scripting mechanism to write flexible scripts, which monitor the system and generate messages or trigger automated jobs when needed.

Many third-party solutions that allow automation are also available, as listed on: http://www.microsoft.com/ntserver/management/exec/vendor/ThrdPrty.asp.

Security

Security of the management infrastructure is of paramount importance, because compromise of this subsystem can lead to compromise of every other component of a site. All of the elements of security discussed in the preceding section, which discusses security architecture, apply here.

Although very widely used, one of the most popular management protocols—SNMP—is poor from a security standpoint. The SNMP community string is a very weak password. While it does not permit a user to log on, it does permit someone to take control of a node. Carefully choose and tightly control management protocols for each site.



Microsoft Application Center 2000 Resource Kit 2001
Microsoft Application Center 2000 Resource Kit 2001
ISBN: N/A
EAN: N/A
Year: 2004
Pages: 183

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net