Coping with Release Cycles | Scalable Internet Architectures

Most architectures, even the small and simple ones, are much more complicated than they first appear. For truly mission-critical applications, every piece must be thoroughly tested and retested before it is deployed in production. Even the simplest of architectures has hardware, operating systems, and server software. More complicated architectures include core switches and routers (which have software upgrade requirements), databases, load balancers, firewalls, email servers, and so on.

Managing the release cycles of external software, operating systems, and hardware is a challenge in mission-critical environments. Flawless upgrades are a testament to a good operations group.

Managing internal release cycles for all custom-built applications and the application infrastructure that powers a large website is a slightly different beast because the burden is no longer solely on the operations group. Development teams must have established practices and procedures, and, more importantly, they must follow them.

Internal Release Cycles

The typical production environment, mission-critical or not, has three vital components: development, staging, and production.

Development

Development is where things break regularly, and experiments take place. New architectural strategies are developed and tested here, as well as all application implementation.

In particularly large and well-funded organizations, research and development are split into two entities. In this scenario, things do not regularly break in development, and no experimentation takes place. Development is for the creation of new code to implement new business requirements.

The research architecture is truly a playground for implementing new ideas. If a substantial amount of experimentation takes place, splitting these architectures is important. After all, having a team of developers sitting by idly watching others clean up the mess of an experiment "gone wrong" is not a good financial investment.

Why research at all? If your business isn't technology, there is a good argument not to do any experimentation. However, staying ahead of competitors often means trying new things and adopting different ideas before they do. This applies equally to technology and business. A close relationship with vendors sometimes satisfies this, but ultimately, the people who live and breathe the business (your team) are likely to have a more successful hand in creating innovative solutions that address your needs.

Staging

Applications and services are built in development, and as a part of their construction, they are tested. Yet staging is the real infrastructure for testing. It is not testing to see whether it works because that was done in development. Instead, here it is testing to make sure that it works.

This environment should be as close to the production environment as possible (usually an exact replica) down to the hardware and software versions. Why? Complex systems are, by their very definition, complex. This means that things can and will go wrong in entirely unexpected ways.

The other big advantage that comes with an identical staging and production environment is that new releases need not be pushed (moved from staging to production). Because the environments are identical, when a new release has been staged, tested, and approved, the production traffic is simply pointed to the staging environment, and their roles simply switch.

Staging new releases of internal (and external) components provides a proving ground where true production loads can be tested. The interaction of changed pieces and the vast number of other components can be witnessed, debugged, and optimized. Often, the problems that arise in staging result in destaging and redeveloping.

The architecture must allow operations and development teams to watch things break, spiral out of control, and otherwise croak. Watching these things happen leads to understanding the cause and in turn leads to solutions.

Most architectures are forced to cope with two different types of internal releases. The first is the obvious next feature release of the application. This contains all the business requirements specified, built, tested, and integrated since the last release. The other type of internal release is the bug fix. These are incremental and necessary fixes to the current release running in production.

Bug fixes are usually staged in an environment that is much smaller than the current production environment. Because they are minor changes, the likelihood that they will cause an unexpected impact on another part of the architecture is small. The true mission-critical environments have three identical production environments: one for production, one for staging revisions, and another for staging releases.

Production

Production is where it all happens. But in reality, it is where nothing should happen from the perspective of developers and administrators. Things should be quiet, uneventful, and routine in a production environment. Money and time are invested in development environments and staging environments to ensure this peace of mind.

A Small Dose of Reality

Few businesses can afford to invest in both a complete development and a deployment environment. This is not necessarily a horrible thing. Business, like economics, is based on the principle of cost versus benefit (cost-benefit), and success relies on making good decisions based on cost-benefit information to increase return on investment. The introduction of technology into a business does not necessarily change this. This is perhaps one of the most difficult lessons for a technically oriented person to learn: The best solution technically is not always the right solution for the business.

Over the years, I have consulted for many a client who wanted to avoid the infrastructure costs of a solid development and staging environment. Ultimately, this is a decision that every business must make. Because it is impossible to say what will happen if you don't have an adequate staging environment, I'll place some numbers from my experience on the potential costs of not having good procedures and policies and maintaining the appropriate infrastructure to support them.

I worked on an architecture that had about a million dollars invested in hardware and software for the production environment, but the owner was only willing to invest $10,000 in the development and staging environment combined. With resources limited that way, proper staging and thorough developmental testing were impossible. Given that, about 1 in 5 pushes into production had a mild negative impact due to unexpected bugs, and about 1 in 500 pushes failed catastrophically. Before we judge this to be an ideological error, understand that all these decisions simply come down to business sense.

The mild mistakes were fixed either by reverting the bad fragments or with a second push of corrected code, and the catastrophic errors were handled by reverting to a previous known-good copy of the production code. And it turns out that the nature of these failures generally did not cost the business anything and produced only marginal unrealized profits.

A fully fledged staging and development environment could have cost an additional two or three million dollars. The cost of regular small mistakes and the rare catastrophic error were found to be less than the initial investment and maintenance of an architecture that could reduce the likelihood of such mistakes.

But, all businesses are not the same. If a bank took this approach...well, I wouldn't have an account there.

External Release Cycles

External release cycles are the art of upgrading software and hardware products deployed throughout an architecture that are not maintained internally. This typically constitutes 99% of most architectures and usually includes things such as machinery, operating systems, databases, and web server software just for starters.

External releases are typically easier to handle on an individual basis because they come as neatly wrapped packages from the vendor (even the open-source products). However, because 99% of the architecture consists of external products from different vendors, each with its own release cycle, the problem is compounded into an almost unmanageable mess.

On top of the complications of attempting to roll many unrelated external releases into one controlled release to be performed on the architecture, you have emergency releases that complicate the whole matter.

Emergency releases are releases that must be applied with extreme haste to solve an issue that could expose the architecture from a security standpoint or to resolve a sudden and acute issue (related to performance or function) that is crippling the business.

Examples of emergency releases are abundant in the real world:

An exploit in the OpenSSL library is found, which sits under mod_ssl, which sits inside Apache to provide secure web access to your customers. Simply put, all your servers running that code are vulnerable to being compromised, and it is essential that you upgrade them as quickly as is safely possible.
A bug is found in the version of Oracle used in your architecture. Some much-needed performance tuning was done, and the bug is manifesting itself acutely. You open a ticket with Oracle, and they quickly track down the bug and provide you with a patch to apply. That patch must be applied immediately because the problem is crippling the applications that are using Oracle.

The preceding examples are two of a countless number of real-life emergency rollouts that I have encountered.

The truth is that managing external releases is the core responsibility of the operations group. It is not a simple task, and an entire book (or series of books) could be written explaining best practices for this management feat.