Before we discuss a powerful solution that provides the underlying mechanics to solve the previously proposed problems, we will visit a few of the traditional logging approaches. We will discuss their strong and weak points, and attempt to demonstrate their inadequacies in large mission-critical environments.
Periodic "Batch" Aggregation
Perhaps the oldest trick in the book when managing the logging of a multinode cluster is the simple aggregation of their local logs.
This configuration of log storage and eventual aggregation (shown in Figure 9.1) is probably the most commonly deployed architecture. In this configuration, web servers (or other log-producing application servers) write their logs to disk in a standard way. Periodically, a dedicated log management server connects to each log-producing server and "pulls" the logs. This typically happens on multihour or daily intervals. Although this configuration is simple and well tested, it leaves a bit to be desired.
Figure 9.1. Classic periodic log aggregation.
The first major problem is that when a server crashes, we possibly could have a large segment of logs (those journaled between the last fetch and the time of the crash) that are unavailable for analysis and incorporation. When the server returns to service, a decision must be made to dispose of the logs, archive the logs, or even go so far as pull the logs and retroactively incorporate the data found into the statistics for the time period in question. This has serious implications because the data is important and by incorporating it into historic data, previously calculated statistics and summarizations will be altered. It isn't intuitive (and I argue it is entirely unacceptable) to have statistics for a past period of time change.
Another serious shortcoming of this methodology is the lack of real-time exposure to the data. It is impossible to do passive, real-time analysis on logs because the aggregated data on which you can perform analysis is stale (and possibly incomplete). Real-time analysis is often not considered in architectures because the configuration simply does not afford the opportunity to perform it. As we will see later, real-time access to log data can offer tremendous benefits to systems engineering tasks, as well as enhance business knowledge.
Real-time Unicast Aggregation
The main disadvantages of batch aggregation revolve around the fact that there is significant latency. This can be addressed simply by pushing logs from the web servers in real-time. One such "push" implementation is logging via Syslog.
If you reconfigure your applications to write logs by the Syslog API, you can configure Syslog to use a foreign log host. In this configuration (shown in Figure 9.2), logs are effectively written in real-time to the log aggregation host.
Figure 9.2. Real-time unicast "push" log aggregation.
One serious shortcoming of a naive Syslog implementation is that Syslog uses UDP as a transport protocol. UDP is unreliable, and logging should be reliableso we have a fundamental mismatch. Conveniently, logging in real-time in unicast is a paradigm not tied to a specific implementation. Syslog-NG is a Syslog replacement that will use TCP to reliably log to a remote log host. mod_log_mysql is an Apache module that will insert Apache logs directly into a MySQL database. Both of these approaches follow this basic paradigm.
This technique eliminates the major disadvantage of periodic aggregation. We can now look at the log files on our aggregation host and watch them grow in real-time. This also means that by processing the new data added to log files, we can calculate real-time metrics and monitor for anomaliesthis is powerful!
Additionally, if we want to add a second logging host for redundancy, it will require reconfiguration of each log publishing serverbecause it is a "push" technology, the servers must know where to push. This is a legitimate approach, but because more advanced and flexible techniques exist, we won't dig into implementing this.
Passive "Sniffing" Log Aggregation
Due to the inherent difficulty of configuring both batch aggregation and push-style uni-cast aggregation, the concept of passive sniffing aggregation was conceived. Using this approach, we "monitor" the network to see what web traffic (or other protocol) is actually transiting, and we manufacture logs based on what we see. This technique compensates for needing to reconfigure web servers when the logging host changes and allows for a secondary failover instance to be run with no significant modification of network topology or reconfiguration of servers.
The beauty of this implementation is that you will manufacture logs for all the traffic you see, not just what hits the web servers you intended. In other words, it is more "Big Brother" in its nature. It uses the same tried-and-true techniques that exist in network intrusion detection systems. In a simple sense, the technique is similar to the traditional promiscuous network sniffing provided by tools such as tcpdump, snoop, ethereal, and snort. The log aggregator is designed to be able to passively reconstruct the protocol streams (which can be rather expensive in high throughput environments) so that it can introspect the stream and generate a standard-looking log for traffic seen on that protocol (for example, an Apache common log format file). I welcome you to pause at this point to ponder why this solution is imperfect.
In Figure 9.3, we see that the passive log aggregation unit is placed in front of the architecture so that it can witness all client-originating traffic. Although it can be deployed in other parts of the network, this is the best position to attempt to reproduce the logs we would expect from our batched aggregation and/or unicast push aggregation.
Figure 9.3. Passive "sniffing" log aggregation.
This approach is inadequate in several ways, however. First, it requires a network topology where a single point can monitor all incoming network trafficeffectively, one egress point. The larger your network, the less likely your topology will be conducive to this restriction. Not only does your topology have to lend itself to this approach, by implementing this, you add some inflexibility to the architecture. For instance, referring back to Chapter 6, we have several web serving clusters serving traffic from around the world, and, as such, there is no single egress point that we could monitor to "catch" all the web requests for logging. This becomes a matter of the technology fitting well into your architecture, and, often, these issues aren't showstoppers.
What else is wrong? Many protocols are specifically designed to prevent passive listening. HTTP over Secure Socket Layer (SSL) is one such protocol. SSL traffic is designed to be secure from eavesdropping and injections (man-in-the-middle attacks). As such, there is no good way that a passive log aggregator can determine the data payload of the HTTP sessions performed over SSL. A security expert can argue that if the passive listener has access to the secure key used by the web server (which would be simple in this case), you could effectively eavesdrop. However, there is no way to efficiently eavesdrop, and we are talking about scalable systems here. This shortcoming is a showstopping obstacle in almost every production architecture.
Let's think about what would happen if we tried to passively decrypt SSL sessions and reconstruct the payload and log the traffic. What happens when we don't have enough horsepower to get the job done? In any networked system, when too much is going on, you start dropping packets. As these packets are not destined to the aggregator, it cannot request a retransmission. (It would cause the TCP/IP session to break.) If the packets are dropped, there is no way to reconstruct the session, and you end up discarding the incomplete session and with it the logs that should have been generated. There will be countless "legitimate" TCP/IP sessions that will not be complete enough to log (for example, when a client disconnects or there is a sustained routing issue). As such, these passive aggregators typically log no errors, or if they do, they are mixed in with countless "errors" that don't represent an error due to resource saturation induced packet loss. Basically, you'll be losing logs without knowing it. Although this is an almost certainty if you were to attempt to tackle SSL traffic, you could also just be logging HTTP and somehow saturate your resources and lose logs and not really now it.
The bottom line here is that if your logs are vital to your business, this type of solution is prone to problems and generally a bad idea.