Logging Done Right | Scalable Internet Architectures

We want a few simple things to make our logs more reliable and the data inside them work for us on an operational level. When something goes wrong, we all turn to logs to attempt to backtrack from the symptom to the cause. If the cause is in the logs, why should we have to look for it? Why not simply be told about it?

From a reliability standpoint, we want our logsall of them. If a server crashes, we need to know what transpired up to the point of failure. It is much less useful to bring the machine back online (if possible) at a later point and attempt to retrieve and integrate the old as-of-yet unseen logs. Additionally, if our logs are important, we may want to write them to more than one location at a timeredundant logging servers.

Making logs work for your architecture on an operational level is something you may have never thought of, but you should. Logs hold invaluable data about events that have immediately transpired, and, in a simple way, they can provide a great deal of evidence that something has gone wrong. By aggregating logs in real-time, we can analyze them to determine whether they meet our expectations. Some common expectations that can be transformed into monitors are an upper bound on 500 errors from the servers (perhaps 0), an upper bound on 404 errors, an expected number of requests served per second either in raw numbers or a reasonable proximity to some historical trend. The goal is to construct business-level monitors that can trigger in real-time rather than the typical approach of looking at 24-hour statistical roll-ups and realizing problems exist after they have already had a significant impact on your business.

Although the real-time delivery of logs is a critical requirement not found in other logging methodologies, there are other advantages to be had. Particularly, an efficient substrate for multiple log subscribers is key to the overall usability of the solution. This means that if we are going to start using our logs in an operational sense and have fault tolerant log journaling in place, we will have many different processes subscribing to the log stream. This may not sound like a big deal, but logging incurs a not-insignificant cost. Writing logs to multiple locations can increase that cost linearly unless the right approach is taken. Here we want reliable multicast.

We need to be able to allow many subscribers to the log streams without the potential of bringing the system to its knees. A reliable protocol that uses IP multicast will allow us to scale up the number of subscribers without scaling up the network infrastructure to support it. Spread is one such tool that provides reliable messaging over IP multicast.