Why Is Logging a Challenge? | Scalable Internet Architectures

Logging is a challenge for two reasons: volume and aggregation. Let's take a large website as an example. In Chapter 6, "Static Content Serving for Speed and Glory," we had a goal to serve 11,750 requests per second at 70% capacity. To handle 11,750 requests per second at 70% we need an actual capacity of 16,786 requests per second, so we will round up and assume that our logging system must handle journaling and analyzing 17,000 log lines per second. Although this is peak traffic during our peak time (constituting 15% of the day's traffic), this still winds up being 360 million log events per day. That is quite a few log messages to write. Additionally, no single source is generating these events; they are streaming in from at least 20 different web servers.

If you compare the logs from your web server, database server, switch, workstation, and even your mobile phone, you will find one thing in common: time stamps. Every logging event occurred at a particular point in time, and knowing when it occurred is fundamental to its usefulness. This may seem obvious, but the importance of it is multi-faceted, and the classic "solution" to the problem may not be sufficient to solve the larger problem at hand.

Knowing when a logging event occurred allows us to quantize information for general consumption. A typical example of this is a report detailing the quantity of web request services broken out by HTTP code on 1-hour intervals, 30-minute intervals, and 5-minute intervals. This type of report is common because it is useful for determining overall traffic trends. Although this type of reporting requires all log events to have time stamps, the relative accuracy of those time stamps does not affect the outcome of the report. In other words, a time skew of 5, 10, or even 30 seconds across the web servers would not dramatically skew the results of such a report where the time granularity of the report is significantly larger than the time skew on machines.

Let's look at a misconfiguration in this situation. Perhaps a single machine is not configured to observe daylight savings time, but the others are. This would clearly damage the report, but how much? From a signal analysis perspective it would be as if the signal had 5% loss and an echo with one-hour latency with around 5.3% signal strength. This type of mild signal "damage" is easy for a human to tune out and otherwise compensate for. It does not give a perfect representation of what is going on at any given time, but the ability to understand overall trends is not compromised.

This may lead us to believe that, although time stamps are crucial, their accuracy is highly overrated. This might be true in the specific case of quantized reports; however, there are other uses for log events where the relative accuracy of time stamps is paramountenter the land of distributed system troubleshooting.

By definition, troubleshooting occurs only when there is trouble. Trouble in a deterministic system is simple cause and effect. The cause is your problem, and the effect is the trouble. On a single system, troubleshooting is at its most basic. You are looking for an event on the system that induced failure directly or destabilized the environment in a fashion that allowed a failure to be noticed. All the components playing a part are running on a single system, with a single clock, and thus all logs generated are written "in order" and allude to a cause and effect stream of occurrences.

Because this book is all about scalable systems, we don't deal much with how to solve problems that involve a single machine. When troubleshooting problems whose ill effects are the product of a set of actions that occur across a cluster of machines, we have a troubleshooting environment that is far from "basic."

For example, let's look at a flow through our website that results in the purchase of a paid service subscription. If we are alerted to the fact that the arrival-to-conversion rate has dropped by some non-nominal amount, we must look for a cause for that effect. To effectively troubleshoot this problem, we must have a view into each purchasing "session" and all the events that happen during that session. We must have them all in the order they occurred to be relevant to our troubleshooting task. In this situation, it is imperative that we log all events that happenclusterwidein an order that represents their relative transactional timing. In other words, events that comprise a "transaction" must be logged in the order they were executed. This is a difficult problem.