Using Microsoft Operations Manager for Advanced Automation | Microsoft Windows Server 2003 Insider Solutions

Microsoft has made a huge push into the arena of enterprisewide system monitoring. Microsoft released Microsoft Operations Manager, or MOM, as a tool that integrates tightly in the monitoring and alerting process of Microsoft's technologies. The result is a monitoring application that supports hardware-level, port-level, service-level, and application-level monitoring. By having access to Microsoft's full source code for all Microsoft applications, the MOM developers were able to create management packs that could gather every last iota of useful information from an application and allow rules and thresholds to determine when to generate alerts or log useful information.

Understanding MOM

MOM, as a comprehensive monitoring and alerting package, consists of data providers, event correlation, filters, rules, knowledge packs, and knowledge base integration that work together to not only monitor a system, but also to link the administrator to solutions for problems. Rather than just identify a problem, MOM is able to link the user to the Microsoft Knowledge Base to suggest solutions to the issue that has arisen. Similarly, MOM enables you to store problem resolution in a local knowledge base so that other site administrators can learn from the past experiences of their coworkers. Rather than reinvent the wheel each time, MOM helps companies pool together the "islands of knowledge" that exist at any company. By putting the resources into a central location, it is easier for administrators to draw from it.

Benefits of MOM

MOM is oriented around three primary goals ”managing, tuning, and securing Windows and Windows-based applications.

In the area of managing, MOM offers full-time monitoring of all aspects of the Windows server “based environment. It provides proactive alerting and responses by using built-in filters and logic to recognize events and conditions that can lead to failure in the future.

Like most monitoring applications, MOM collects long- term trending data about the performance of a system. MOM takes this concept one step further by providing suggestions for improving performance and enabling you to compare the results of performance adjustments to historical information. This addresses one of the fundamental issues with performance tuning, which is having a valid benchmark of the data that can be referenced historically to see if changes to the system are actually improving the performance of the system. MOM provides the empirical data needed to measure the effect of system tuning.

Windows 2000 and Windows 2003 provide excellent auditing capabilities. The problem is that this can produce an incredible amount of data that must be reviewed regularly by the system administrator. The sheer volume of data will limit the amount of attention an administrator can give to the data. This makes it nearly impossible to really review the security logs for subtle security problems. The natural tendency of the system administrator is to reduce the number of items being audited . Although this frees up time for the administrator, it reduces the amount of valuable data entering the system. Unlike an administrator, MOM will tirelessly monitor the logs on every server round the clock, correlating individual events to identify potential hacking attempts or security breeches. MOM can be an administrator's best friend because it is able to take on the tedious task of reviewing the event logs on all servers in the enterprise to determine if the conditions for a failure are present.

Statistics suggest that 40% of system outages are caused by application failure, including software bugs , applications-level errors, and interoperability problems. Another 40% of outages are attributed to Operator Errors, including configuration errors, entering data incorrectly, and failure to monitor. The other 20% are attributed to hardware failures, power failures, natural disasters, and so on.

As you can see, application-level errors and operator errors together account for 80% of system outages. As such, the greatest return on investment for system uptime is to focus on application failures and operator errors. Although the end users are very good at spotting and reporting system outages, it is greatly preferred to predict potential outages and fix them proactively.

In large companies, administrators tend to work in groups with other administrators who are knowledgeable in a specific area. By putting these specialists together in teams , systems can be effectively managed by these experts. The downside to this philosophy is that a company ends up creating isolated containers of knowledge. Groups that specialize in managing a specific application might not be knowledgeable about the operating system that it runs on. Similarly, applications that are dependent on other applications are usually managed by administrators who only understand their own application, not the applications upon which they are dependent. The result of this is that information outside a group's area of expertise is not well utilized. An Exchange support group might be getting error messages in the event log that reference data about the connection to a SAN. Without SAN knowledge, the Exchange group can't know if the log entries are problems or simply informative messages. This can make it very easy to ignore potential problems. MOM attempts to combat this type of issue by providing its own expertise and knowledge. MOM can correlate events with other events and predict the actual outcome. MOM draws information from each of the separate systems in the network and places it in a single location. Equally important, MOM stores this information long term. A busy administrator can easily miss key event log entries because they are overwritten by other events. MOM, on the other hand, reads each and every log diligently and reacts to events based on filters and logic. By storing these key events centrally over a long period of time, administrators are able to go back and look at historic events on a server. By having access to all the data centrally , MOM is able to act on the big picture rather than only be able to react to individual system problems.

Similarly, by having access to all the data and seeing the big picture, MOM is able to filter out false positives by understanding what errors are actually results of a " lowest common denominator" error. For example, if MOM knows that the local router interface is down, it knows not to report all objects known to be on the far side of that router as down as well. It knows that the service checks or application parameter checks on those systems are failing because the system is unreachable. This drastically reduces the number of false positives and reduces the load on the system.

The other area in which MOM really shines is in helping to secure the servers in the enterprise. MOM is able to monitor remote servers for the presence of security patches and hot fixes. Because MOM is tied in with the Microsoft Knowledge Base, it is able to determine what patches should be on a system based on the services it sees the system running.

Having a centralized view of a distributed environment makes managing the security of the environment much easier. By being able to monitor such a large number of events on a server and having access to a centralized knowledge base, MOM is able to perform a basic level of Intrusion Detection as well. MOM will recognize patterns in traffic and events on a server that most administrators will miss.

Third-Party Monitoring and Alerting

There are many other third-party monitoring solutions on the market that provide various levels of monitoring, reporting, alerting, and trend analysis. Some of the more popular ones are

HPOpenview
Unicenter TNG
Servers Alive
What's Up? Gold
BMC Patrol
SiteScope
MRTG

Aside from HPOpenview and BMC Patrol, most of these applications are meant for smaller networks and do not provide the depth of monitoring options that an administrator would get from something like MOM. For small environments, these applications do a good job of alerting administrators when monitored parameters surpass a particular threshold. But these applications are insufficient for providing the capability to support knowledge base links, local knowledge bases, or event correlation with other events to determine holistic situations.

Improving Monitoring Via SMS

Most administrators view SMS as purely a tool for distributing software. Although it is very good at this task, a clever administrator can leverage the capabilities of SMS to further enhance their monitoring environment. Most monitoring packages deal exclusively with servers and network hardware. SMS, on the other hand, is focused mostly on desktops. Because licensing a monitoring package to monitor desktops is usually prohibitively expensive, SMS is a logical choice because it is already gathering information on all the desktops.

SMS software inventory reports are a great source of data to mine for potential intrusions into the network. Monitoring systems for unexpected software packages is a great way to catch viruses or Trojan horses that install themselves on a system. After all, one of the key points of monitoring is to improve network security. As any administrator will tell you, the greatest threat to his network's security is the end users. SMS is a great tool to keep tabs on end users' computers.