Flylib.com

Books Software

 
 
 

Chapter 9: Network Management

Chapter 9: Network Management

OVERVIEW

In the old days of small work group LANs, it was relatively easy for a system administrator to keep tabs on the status of desktop PCs, servers, and the network simply by looking at the lights on the front of the equipment. As these networks grew in complexity and scope, it became more than any person, or group of people, could do to know the status of all parts of the network at all times. This problem provided the challenge for the first network management system (NMS). The early NMS software was little more than a log reader, similar to the Event Viewer in Windows Server 2003 today. Next, the ability to read status and alert messages in a standard format was added. This standard format became the Simple Network Management Protocol (SNMP). Manufacturers quickly added the ability to format and send SNMP messages to all of their equipment. Today, virtually all network infrastructure devices such as routers, switches, bridges, and CSU/DSUs, as well as servers and operating systems, can report their status using SNMP. It is this capability that makes modern NMS packages like Microsoft Systems Management Server (SMS) and Microsoft Operations Manager (MOM), Citrix Resource Manager (RM), and HP OpenView possible. The ability to receive and collate SNMP messages is only the tip of the iceberg of what an NMS can do and what your organization should use it for.

Although on-demand access is, by nature, more centralized and architecturally simpler than distributed computing, this does not mitigate the need for a strong system management environment (SME). It is even more critical to establish service level agreements for services delivered and to use tools, such as an NMS, to manage them. This chapter discusses general SME messaging standards, SME characteristics including monitoring and reporting for on-demand access, and concepts for SME implementation using tools from Microsoft, Citrix, Hewlett-Packard, and others.

PEOPLE, PROCESSES, AND PRODUCT

Utilizing an NMS is only part of an organization's overall SME. An SME consists of the people, processes, and product ("three P s") within an organization that effectively manage the computing resources of that organization. "Product" is more accurately "technology," but "two P s and a T " doesn't have the same punch as "three P s." We find the simplest way to think of the interrelationship between the three P s is in terms of service level agreements (SLAs).

SERVICE LEVEL AGREEMENTS

An SLA in this context is an agreement between the IT staff and the user community about the services being provided, the manner in which they are delivered, the responsibilities of the IT support staff, and the responsibilities of the users. An SLA serves many important functions, including setting the expectations of the users about the scope of services being delivered and providing accountability and a baseline of measurement for the IT staff. The established SLAs in your organization also provide the framework for the SME. After all, if you don't first figure out what you are managing and how you will manage it, what good will a tool do you? In addition to incorporating the three P s, a service level agreement should address the following three areas of responsibility:

  • Availability This section should explain when the services are provided, the frequency (if appropriate), and the nature of the services.

  • Performance This section describes how the service is to be performed and any underlying processes related to the delivery of the service.

  • Usability This section should show how to measure whether the service is being used effectively. For example, a measure of success could be infrequent help desk calls.

Table 9-1 shows a sample SLA for an enterprise backup service.

Table 9-1: SLA for Enterprise Backup

Volumes To Back Up

  • Palo Alto Data Center

    • Network Appliance Filer cluster (400GB)

      HP 9000 Oracle database (120GB)

    • Backup Device: Spectra Logic Library, 8 AIT tapes drives

  • Denver Data Center

    • Network Appliance Filer cluster (800GB)

    • HP 9000 Oracle database (220GB)

    • Backup Device: Spectra Logic Library, 8 AIT tapes drives

Availability

  • Daily: Incremental backups

  • Weekly: Full backups

  • Monthly: Full backups

  • Quarterly: Full backups

  • Tape Rotation: Three months of daily tapes are used, then rotated .

  • Online backup: A snapshot is taken every 4 hours for the Network Appliances. The last 12 snapshots are available (48 hours).

  • Archive/grooming backup: Every two weeks

Performance

  • Backups are scheduled and designed to affect production system performance.

  • Five weeks of tapes per month are used.

  • Daily log reports are generated noting which tapes are in what backup set.

  • Full backups are taken offsite the following Wednesday and are returned according to a three-month cycle.

  • Sample files are restored and verified three times per week.

  • Archive/grooming backup: files not touched in 14 months are written to tape every two weeks and are deleted from production storage after three backups.

Usability

  • Problem response according to standard help desk SLA.

  • Nonpriority requests for restorations and archive turnaround is three days.

  • Service performance reports are published weekly to users via an intranet site.

Ideally, the SLA is an extension of the overall business goals. Defining a group of SLAs for an organization that has never used them can be a daunting task. The following tips will help you with the effort:

  • Start by deciding which parts of your infrastructure go directly to supporting your business goals, and define exactly how that happens.

  • Do not define an SLA in terms of your current support capability. Think "outside the box" regarding how a particular service should be delivered. The result will be your goal for the SLA. Now work backward and figure out what has to be done to reach the ideal SLA.

  • Rather than starting at the ground level with individual SLAs for particular services, try laying down some universal rules for a so-called Master SLA. After all, some things will apply to nearly every service you deliver. A good place to start is with the help desk, where all user calls are taken. Decide how the help desk will handle, prioritize, and assign calls. The problem response time, for example, will be a standard time for all nonpriority calls. Once that is established, you can think about whether different services may need different handling for priority calls. Decide what the mission and goals are of the IT staff overall and how they support the business. Work backward from that to how the service management function must be defined to align with those goals.

Establishing a viable SLA for the user community (whether corporate users or fee-for-service (ASP) users) mandates equivalent SLAs with your providers. For example, most WAN providers (Qwest, Sprint, AT&T) will guarantee various parameters (availability, bandwidth, latency) that impact your ability to deliver service to users. Ensure internal SLAs do not invoke more stringent quality and reliability guarantees than external SLAs.

The subject of defining and working with SLAs is adequate material for a book all its own. Our intention here is to get you started in framing your network management services in terms of SLAs. You will find them to be not only a great help in sorting through the "noise" of information collected, but also an invaluable communication tool for users, IT staff, and management alike.