Chapter 9:
Network Management
OVERVIEW
In the old days of small work group LANs, it was relatively easy
for a system administrator to keep tabs on the status of desktop
PCs, servers, and the network simply by looking at the lights on
the front of the equipment. As these networks grew in complexity
and scope, it became more than any person, or
group
of people,
could do to know the status of all
parts
of the network at all
times. This problem provided the challenge for the first network
management system (NMS). The early NMS software was little more
than a log reader, similar to the Event Viewer in Windows Server
2003 today. Next, the ability to read status and alert messages in
a standard format was added. This standard format became the Simple
Network Management Protocol (SNMP). Manufacturers quickly added the
ability to format and send SNMP messages to all of their equipment.
Today, virtually all network infrastructure devices such as
routers, switches, bridges, and CSU/DSUs, as well as servers and
operating systems, can report their status using SNMP. It is this
capability that makes modern NMS packages like Microsoft Systems
Management Server (SMS) and Microsoft Operations Manager (MOM),
Citrix Resource Manager (RM), and HP OpenView possible. The ability
to receive and
collate
SNMP messages is only the tip of the iceberg
of what an NMS can do and what your organization should use it
for.
Although on-demand access is, by nature, more centralized and
architecturally simpler than distributed computing, this does not
mitigate the need for a strong system management environment (SME).
It is even more critical to establish service level agreements for
services delivered and to use tools, such as an NMS, to manage
them. This chapter discusses general SME messaging standards, SME
characteristics including monitoring and reporting for on-demand
access, and concepts for SME implementation using tools from
Microsoft, Citrix, Hewlett-Packard, and others.
PEOPLE, PROCESSES, AND PRODUCT
Utilizing an NMS is only part of an organization's overall SME.
An SME consists of the people, processes, and product ("three
P
s") within an organization that effectively manage the
computing resources of that organization. "Product" is more
accurately "technology," but "two
P
s and a
T
"
doesn't have the same punch as "three
P
s." We find the
simplest way to think of the interrelationship between the three
P
s is in terms of service level agreements (SLAs).
SERVICE LEVEL AGREEMENTS
An SLA in this context is an agreement between the IT staff and the
user
community about the services being provided, the manner in which they are delivered, the responsibilities of the IT support staff, and the responsibilities of the users. An SLA serves many important functions, including setting the expectations of the users about the scope of services being delivered and providing accountability and a baseline of measurement for the IT staff. The established SLAs in your organization also provide the framework for the SME. After all, if you don't first figure out what you are managing and how you will manage it, what good will a tool do you? In addition to incorporating the three
P
s, a service level agreement should address the following three areas of responsibility:
-
–
Availability
This section should explain when the services are provided, the frequency (if appropriate), and the nature of the services.
-
–
Performance
This section describes how the service is to be performed and any underlying processes
related
to the delivery of the service.
-
–
Usability
This section should show how to measure whether the service is being used effectively. For example, a measure of success could be infrequent help desk calls.
Table 9-1 shows a sample SLA for an enterprise backup service.
Table 9-1:
SLA for Enterprise Backup
|
Volumes To Back Up
|
-
–
Palo Alto Data Center
-
–
Network Appliance Filer cluster (400GB)
–
HP 9000 Oracle database (120GB)
-
–
Backup Device: Spectra Logic Library, 8 AIT tapes
drives
-
Denver Data Center
-
–
Network Appliance Filer cluster (800GB)
-
–
HP 9000 Oracle database (220GB)
-
–
Backup Device: Spectra Logic Library, 8 AIT tapes drives
|
|
Availability
|
-
–
Daily: Incremental
backups
-
–
Weekly: Full backups
-
–
Monthly: Full backups
-
–
Quarterly: Full backups
-
–
Tape Rotation: Three months of daily tapes are used, then
rotated
.
-
–
Online backup: A snapshot is taken every 4 hours for the Network Appliances. The last 12 snapshots are available (48 hours).
-
–
Archive/grooming backup: Every two weeks
|
|
Performance
|
-
–
Backups are scheduled and designed to affect production system performance.
-
–
Five weeks of tapes per month are used.
-
–
Daily log
reports
are generated noting which tapes are in what backup set.
-
–
Full backups are taken offsite the following Wednesday and are returned according to a three-month cycle.
-
–
Sample files are restored and
verified
three times per week.
-
–
Archive/grooming backup: files not touched in 14 months are written to tape every two weeks and are deleted from production storage after three backups.
|
|
Usability
|
-
–
Problem response according to standard help desk SLA.
-
–
Nonpriority
requests
for restorations and archive
turnaround
is three days.
-
–
Service performance reports are published weekly to users via an intranet site.
|
Ideally, the SLA is an extension of the overall business goals. Defining a
group
of SLAs for an organization that has never used them can be a daunting task. The following tips will help you with the effort:
-
–
Start by deciding which
parts
of your infrastructure go directly to supporting your business goals, and define exactly how that happens.
-
–
Do not define an SLA in terms of your current support capability. Think "outside the box" regarding how a particular service
should
be delivered. The result will be your goal for the SLA. Now work backward and figure out what has to be done to reach the ideal SLA.
-
–
Rather than starting at the ground level with individual SLAs for particular services, try laying down some universal rules for a so-called Master SLA. After all, some things will apply to nearly every service you deliver. A good place to start is with the help desk, where all user calls are taken. Decide how the help desk will handle, prioritize, and assign calls. The problem response time, for example, will be a standard time for all nonpriority calls. Once that is established, you can think about whether different services may need different handling for priority calls. Decide what the mission and goals are of the IT staff overall and how they support the business. Work backward from that to how the service management function must be defined to align with those goals.
Establishing a
viable
SLA for the user community (whether corporate users or fee-for-service (ASP) users)
mandates
equivalent SLAs with your providers. For example, most WAN providers (Qwest, Sprint, AT&T) will guarantee various parameters (availability, bandwidth, latency) that impact your ability to deliver service to users. Ensure internal SLAs do not invoke more stringent quality and reliability
guarantees
than external SLAs.
The subject of defining and working with SLAs is adequate material for a book all its own. Our
intention
here is to get you started in framing your network management services in terms of SLAs. You will find them to be not only a great help in sorting through the "noise" of information collected, but also an invaluable communication tool for users, IT staff, and management alike.