Flylib.com

Books Software

 
 
 

Chapter 2. Principles for Avoiding Failure


2. Principles for Avoiding Failure

Avoiding failure is a goal shared by every operations, business, and development team. Although I am more of a manager and software engineer now, my roots are in systems administration. During the dot-com bubble, I managed the systems and network administration for several large dot com infrastructures .

I may be biased (though I think I'm right) in saying that avoiding failure is more fundamental to systems administrators than many other disciplines. Why is this so? Because systems administrators are the ones awake at 3 a.m. when disaster strikes and who spend the next 36 consecutive bleary-eyed hours trying to fix the problem.

People on the business side want things desperately, but the want is incomplete because they do not have the skills to execute those wants personally . Instead they collaborate with two teams of people: the development group and the operations group. Because the development group works so closely with the business group to satisfy their demands, both groups tend to lose the valuable insights offered by the operations group. At the end of the day, things have to run. The sole purpose of an operations group is to make sure that things run smoothly. Having the operations group available and participating regularly in the goal-oriented business meetings and technical development meetings can be enormously valuable.

Systems administrators like things automated. Ideally, everything would run itself, and all our time would be spent reading our favorite websites and RSS feeds. In some environments this is possible. Large web architectures have a tendency to deviate from this utopia. Many sites still have break-neck development schedules and are constantly adapting business needs that demand newly developed technology to drive them. This means that the business demands the development team to relentlessly launch new code into production.

In my experience, developers have no qualms about pushing code live to satisfy urgent business needs without regard to the fact that it may capsize an entire production environment at the most inopportune time. This happens repeatedly. My assumption is that a developer feels that refusing to meet a demand from the business side is more likely to result in termination than the huge finger-pointing that will ensue post-launch .

It is not, however, a lost cause. By understanding the dynamics of the business you can formulate policies and procedures to prevent these things from happening. Regardless of whether formal procedures are even feasible in your company or whether fully written procedures for developing and writing procedures already exist, a set of principles for avoiding failure can be adopted. First and foremost is the education of the development team and the operations team regarding the ramifications of failure.



Working in Production Environments

There are so many different levels of production environments that it is difficult to speak to them all. The principles in this chapter can be applied to any important computing environment. Because this book is about Internet architectures, we won't address anything but the infrastructure used to directly service customers over the Internet.

Because most large architectures are run by multidisciplinary teams , there tend to be more than one set of guidelines for avoiding failure.

In the end it comes down to "don't be stupid." Although this is a simple and an intuitive expectation, what is clearly "stupid" on a small architecture is often subtle on large architectures run by multiple teams. There are three reasons that teams are used to manage projects:

  • The work to be done on large systems exceeds what is humanly possible for a single individual to accomplish.

  • It is less expensive and easier to find individuals with a deep, focused expertise in one of the technologies that powers the system than to find individuals with deep and broad expertise across every technology in the system.

  • The "key man problem"if your key man is hit by a bus (or simply quits), you must have business continuity.

All are obvious reasons, all are important, and all contribute a bit of chaos to the overall system. Through the application of good practices and the use of established tools, the chaos can be kept in check. Without the appropriate approach, working in a fast-moving production architecture is like working on a construction site without OSHA compliance and no hard hatstupid.

Although it may not seem an obvious application at first, in the end it boils down to Murphy's Lawmore specifically how to avoid it. Finagle's Law (a more generalized version of Murphy's Law) says: "Anything that can go wrong, will." However, more formally , Finagle's Law looks something like this:

((U+C+I) x (10-S))/20 x A x 1/(1-sin(F/10))

urgency (U): 1 <= U <= 9,

complexity (C): 1 <= C <= 9,

importance (I): 1 <= I <= 9,

skill (S): 1 <= S <= 9,

frequency (F): 1 <= F <= 9

This equation illustrates Finagle's Law (a.k.a. Sod's Law) more formally. Although this equation, commissioned by British Gas, may seem a bit contrived despite it regressing well for British Gas's dataset, it still provides an excellent insight into the nature of the problems architects face.