Chapter 2. Principles for Avoiding Failure | Scalable Internet Architectures

2. Principles for Avoiding Failure

Avoiding failure is a goal shared by every operations, business, and development team. Although I am more of a manager and software engineer now, my roots are in systems administration. During the dot-com bubble, I managed the systems and network administration for several large dot com infrastructures.

I may be biased (though I think I'm right) in saying that avoiding failure is more fundamental to systems administrators than many other disciplines. Why is this so? Because systems administrators are the ones awake at 3 a.m. when disaster strikes and who spend the next 36 consecutive bleary-eyed hours trying to fix the problem.

People on the business side want things desperately, but the want is incomplete because they do not have the skills to execute those wants personally. Instead they collaborate with two teams of people: the development group and the operations group. Because the development group works so closely with the business group to satisfy their demands, both groups tend to lose the valuable insights offered by the operations group. At the end of the day, things have to run. The sole purpose of an operations group is to make sure that things run smoothly. Having the operations group available and participating regularly in the goal-oriented business meetings and technical development meetings can be enormously valuable.

Systems administrators like things automated. Ideally, everything would run itself, and all our time would be spent reading our favorite websites and RSS feeds. In some environments this is possible. Large web architectures have a tendency to deviate from this utopia. Many sites still have break-neck development schedules and are constantly adapting business needs that demand newly developed technology to drive them. This means that the business demands the development team to relentlessly launch new code into production.

In my experience, developers have no qualms about pushing code live to satisfy urgent business needs without regard to the fact that it may capsize an entire production environment at the most inopportune time. This happens repeatedly. My assumption is that a developer feels that refusing to meet a demand from the business side is more likely to result in termination than the huge finger-pointing that will ensue post-launch.

It is not, however, a lost cause. By understanding the dynamics of the business you can formulate policies and procedures to prevent these things from happening. Regardless of whether formal procedures are even feasible in your company or whether fully written procedures for developing and writing procedures already exist, a set of principles for avoiding failure can be adopted. First and foremost is the education of the development team and the operations team regarding the ramifications of failure.