Stability and Control | Scalable Internet Architectures

Stability and control sound like obvious requirements in a mission-critical systemand indeed they are. However, it should be no surprise that the demands that business places on information technology infrastructure can easily destabilize a large production system. Data systems, web servers, mail servers, and the like are all tools to accomplish some greater business goal. It is absolutely crucial to remember that, and by doing so you can avoid pursuing the wrong technology challenges and protect the infrastructure from uncontrolled change.

Uncontrolled change is perhaps the most fearsome monster in the dungeons of technology. It is a monster that takes many shapes: feature creep, milestone hopping (pulling features from future product milestones into the current), sloppy version control, and even premature implementation. Because it comes in many forms, uncontrolled change can be difficult to recognize and as such often goes unnoticed, until...there is an infestation of uncontrolled change in a variety of shapes and forms in various components and the architecture simply implodes. I've seen such glorious failures, and they carry heavy casualtiesoften entire businesses.

There is no final ultimate solution to stability and control that will please everyone. However, it is important to satisfy each business unit's needs as completely as possible. (From a pessimist's standpoint, this means not dissatisfying anyone to the point of rebellion.) Effectively, the business side should want technical innovation and deployment "on-demand." Although horsepower is pretty much available in that fashion, architectural design, business logic adaptation, and the overall application development process are not. On the flip side, the technology side of the house wants complete specifications for projects so that the design investment occurs once, and maintenance is truly maintenance (bug fixes and minor revisions). However, the truth of business is that business changes and with that comes the need to reinvent the business processes and technologies that power it. The two organizational units are at odds.

You can spend a lifetime and invest a fortune devising and implementing a strategy where everyone wins. Honestly, I don't think you'll succeed (if you do, somebody could have been pushing harder and accomplishing more). I have found, however, that by exercising the proper amount of resistance with respect to uncontrolled change, you can ensure no losersand that's a darn good outcome. This "resistance" is simply using the right tools and techniques to manage and maintain your technology.

Rapid Development

Although this may not apply to everyone, I'll wager it applies to the vast majority of readers. Where I work, rapid development is about 95% of what we do. For many of the systems I touch, project deadlines are measured in hours or tens of hours. In the rapid development model, the luxuries of clearly delineated development, testing, and quality assurance phases are replaced by an amorphous development cycle that splits testing responsibilities across all participants. Although traditional development and testing cycles measure in weeks or months, these cycles spin so fast I can only sit back and say "Oh, neat. A pinwheel." If this sounds familiar, stability may seem like something beyond the horizon, but control is certainly possible.

How do you control such a beast? There isn't one solutionno silver bullet. But like any self-respecting engineer, we can grab the low-hanging fruit. The heaviest fruit on this tree is version control. Version control is simple, safe, reliable, and, quite frankly, you would be out of your mind not to use it. Simply sticking all your code in version control is the baseline, but by truly leveraging its power you can put in place a peace of mind that is invaluable.

The next step in controlling the beast is planninggood planning. What is a plan? Simply put, a plan is the steps taken to perform a change from configuration A to configuration B. Realistically, those steps are planned before you perform them. A change in configuration is anything that affects the production environment: code pushes, database schema changes, vendor software fixes and upgrades, OS patches and upgrades, hardware changes, network changes, and so on. Where I work, all these things are given a single name: push. A push is a controlled and planned action that changes the production architecture.

So, if you think up what you want to do and then do it, you had a plan, right? Yes, but this lack of diligence leads to catastrophic failures. Planning is time consuming and challenging, but it isn't rocket science. A successful push plan has four parts: a plan to get from A to B, a plan to get from B to A, a plan to restore A from bare metal, and a successful test of the first two. (Testing a bare metal restore for every push would be suicide, or at least leave you constantly contemplating it.) Although it is always a good idea to have a tested plan for reverting a change (backing out a push), in a rapid development environment it is fundamental. Rapid development often leads to less testing and quality assurance, which means that changes are pushed, and bad things happen.

In a general sense, it just makes sense. From the technical side, however, there is more than just the safety of the IT infrastructure. With a thorough, documented plan, you can look a peer, boss, or customer in the eye and explain that pushes can and do go wrong, but you were prepared. You had a 100% confidence that you could recover, a reasonable ETA for rectifying the situation (because you had tested it), and the unexpected down-time was kept to a minimum because the recovery efforts were preplanned and executed with confidence.

Although there are many a fruit on this tree of solutions, the last one I'll mention is unit testing. Rapid development suffers from rapid changes. Rapid changes can lead to breakage that goes unnoticed. Unit testing is one solution to that problem, and, although it is not nearly as "low hanging" as version control and proper planning, it is so powerful that I would be remiss not to give it a chance to tell its story.

Unit Testing

When a topic is presented, you can tell whether someone likes the concept. If the person presents the disadvantages last, it allows them to elucidate on them as they see fit, and it leaves you with the disadvantages fresh on your mind. On the other hand, if the presenter goes over the disadvantages first, he is merely qualifying them and then moves on to the merits.

Unit testing is difficult and has some serious disadvantages, but it can prevent disaster. Although many of the customers, clients, colleagues, and open projects I have worked with did not subscribe to the unit testing philosophy, I will pointedly discuss its disadvantages first so that the power and purpose of unit testing is the message that concludes my mere lip service to this philosophy.

Unit testing plays on the concept that most information systems are built in a modular manner. Although these modules aren't entirely autonomous (they often sport a vast number of interconnects and dependencies), they are self-contained from the standpoint of code base and maintenance.

The reason unit testing is so difficult is because its challenges are deceptive. As with any test, you control the inputs, so the test is limited to the input sets you provide. In no way will any form of testing ever ensure that your architecture will work under every possible condition. Period. With that said, testing is designed to ensure that a system will arrive at an expected outcome given a prescribed input. This is immensely valuable, but if you can understand that this is all that testing provides, you can manage your (and other's) expectations.

Writing a full, start-to-finish business testing infrastructure can be (and usually is) much more work than writing the systems that will be tested. This is why the vast majority of systems out there do not have business-cycle testing.

Unit testing, on the other hand, is designed to keep tests small, easy to manufacture, and easy to maintain. An example of business-cycle testing is a test that attempts to simulate a web client to access the pages and post the form data, walk through the site and place an order resulting in a account creation, credit card data insertion, and successful billing. An example of unit testing is a test that takes the credit card 4444333322221111, performs a mod10 validation test, and arrives at the expected resultpretty different.

Unit testing does not ensure that the various infrastructure modules will be used correctly, but it does ensure that if they are used correctly that they function as they should.

The tests are small and can easily be written by the same party who writes the code to be tested; so why is this hard? Well, for starters, it takes a bit of practice and experience to learn to write good, thorough unit tests. Bad unit tests don't end up testing much and are a complete waste of time.

Next, it is dramatically more useful if you unit test everything (over unit testing only bits and pieces). Having 10 unit tests across two modules in a system with 500 modules doesn't buy you much at all, and I argue that it gives you a false sense of confidence. Because the unit tests are not actually required for the corresponding module to function, you can be delinquent and not write unit tests as thoroughly as you shouldor even omit them entirely. When you slip, you slide, and your unit testing framework is compromised. Unit testing is a religion: You need to live it, breathe it, preach it, and evangelize it. Often, if you want it to succeed in your enterprise you might need to pull an old religion trick and just simply force it on everyone else.

I am not trying to say that it is bad to subscribe to unit testing for a specific module when other modules in your architecture do not. That is good, but the value add is dramatically less than a complete unit testing picture. This, of course, makes unit testing difficult to adopt completely in systems with large legacy code bases.

Because unit testing development must go hand-in-hand with existing development, it means that projects will take longer (marginally). This change can be difficult to swallow in work environments that have hectic schedules and do not want to compromise productivity. And although I can tell you and provide case studies that demonstrate that unit testing will save you both development time and downtime in the future, that didn't seem to work well for extreme programminga programming technique that is powerful but not widely adopted.

Now that you know why unit testing is difficult, let's see what it buys you. This is best done by example, and no example serves like a real-world example.

In one of the systems I regularly work on, there is a database layer. The system talks to Oracle, and we manage that through an abstracted database layer so that we can control how queries are made and log how often queries are made, what bind values are most commonly used, and how they perform (wall clock time during query preparation, execution, and data retrieval). Generally, it is useful, and every other part of the system that interacts with the database leverages this module to do so.

One night assume the story goes: We changed the database abstraction layer, and the unit test failed; therefore the unit test was successful. That would be a fairly lame test case example. In fact, the change set and problem set were wildly obtuse and complicated. We needed to take advantage of cached queries using Perl DBI/DBD::Oracle's prepare_cached method. This was all well and good and solved the acute issue we were having (with a frequently invoked data insertion query). However, this change was made on a global level, and all queries now used prepare_cached. This did not cause the database abstraction unit test to fail, nor did it cause the transaction processing system unit test to fail. Instead, an administrative data manipulation module had its unit test fail. It turns out that there was a problem with the use of some stale cursors that would be triggered if the database connection was lost and reestablished. This error was subtle with a rather obtuse outcome in that certain processes here or there would suddenly begin to malfunction seemingly out of the blue.

Because I regularly troubleshoot obtuse bugs, I'd have to say that the subtle ones are the worst. Things malfunction infrequently and even self-correct. Nevertheless, each malfunction could cause a user-visible error that could cause the user not to retry a failed financial transaction or could lower the user's confidence level in such a fashion that you could lose a customer forever. These kinds of errors are simply unacceptable. Although unit testing is not a silver bullet, any reasonable means of reducing the amount of failures and increasing the overall product quality should be seriously considered for inclusion in an organization's standard operating practices.

Version Control

As mentioned previously, version control is likely the heaviest fruit on the tree of stability and control. Version control systems (VCSs) are designed to manage bits of data, tracking when, why, and by whom the changes to those bits were made. It sounds simple enough, and any good development team will tell you they already use version control, but using it well is the key.

Several version control systems are available today. Arguing over which is better and which is worse gets you nowhere. However, each has a set of highly overlapping features, and by ranking the importance of those features in your architecture, you can quickly arrive at the right tool for the job.

One of the first mistakes made by organizations choosing version control solutions is vendor bias. Many of the more popular systems are used by the open source community. This places an interesting angle on vendor bias.

I submit that the concept of version control is simply not that complicated. That is not to say that building a reliable, robust VCS is a trivial exercise, but it is well within the realm of open source technologies. In fact, commercial version control systems simply don't offer that many advantages in the realm of systems and development. (They do have some good advantages still with respect to managing version control on files with proprietary formats such as Microsoft Word documentsthis simply doesn't apply to us.)

Online debates between world-renowned open source developers show that developers tend to harp on features such as distributed, disconnected operation, managing vast numbers of changesets, and the suitability of the system to allow the easy application of vast numbers of external patches. I argue that this is biased toward a genre of development that is dramatically different from the needs of the typical lone Internet architecture.

The typical commercial architecture places tremendous value on the code and configuration that drives it. There is a desire to keep it unified, authoritative, protected, and privatekeep it secret, keep it safe. This means that the importance of good distributed and disconnected operation is mostly irrelevant. Instead, availability is the remaining feature requirement in that realm.

Additionally, individual components of the architecture typically have small teams that work closely together to handle development and maintenance. This means that the management of changesets and patches is vital, but vast quantities of outstanding change-sets and patches are unlikely to occur.

So, now that I've disqualified the typical concerns of the average open source developer, what is important?

Stability, durability, consistency, and restorabilityThis system will hold all that is dear to the architecture. You must be able to have complete confidence that it will work as documented. In VCS systems, the term check in has been replaced by commit over time. As in database systems, if a commit succeeds, it must completely succeed, and if it fails, it must be as if it never happenedthere is no room for error. It is also vital that there are clean, effective ways of consistently backing up the data to allow for complete and safe disaster recovery.
Branching and taggingIt is vital that the system be capable of managing several concurrent copies of the intellectual assets. Development branches, production branches, personal branches, product version branches, and so on. Tagging or at least the ability to understand and repeat a "snapshot" of the tree is vital for production change control.
UsabilityA VCS system is utterly useless if its supposed users are not educated on how to use it best. Users must be comfortable enough with the system that it is a productivity tool and not a hindrance. Additionally, it is crucial that the tool run seamlessly everywhere in your architecture. It must run on every hardware platform, architecture, and operating system you run. If it does not run on some of your production systems, it is highly likely that things pertaining to that system will not embrace version control as extensively everything else. (This is a great vote for open source systems.)
ChangesetsAlthough integrated changeset support is not fundamental in these systems, the capability to understand that the changes to several indirectly related bits can be correlated together as a single high-level change is important. This makes it much easier to determine what changes were necessary across the architecture to effect a desired change. It also dramatically simplifies reverting the change or applying just that change to another branch.

Each of these features is fundamental. Given the size of your architecture, how much you will manage in the VCS, and the number of users interacting with it, you can weight the preceding features in each VCS product and choose the one right for your architecture.

In our environment, we have between 10 and 20 people regularly touching the VCS. The code and configuration information therein is deployed on approximately 150 machines. Our developers are most familiar with Concurrent Versioning System (CVS). Our managed assets come in rather small at about two gigabytes.

Originally we used CVS for all asset management. However, over time, the need to branch and merge efficiently on large repositories and the desire to have changesets began to surpass the importance of developer usability. The inability to tag large trees often (an inherent problem in the implementation of CVS) started to make the tool a hindrance instead of a productivity booster. After a few large commits failing halfway through and leaving the CVS trees in an inconsistent state, we decided to reevaluate which features were most important.

As we are a consultancy, we are paid to be productive. The idea of retraining our entire CVS user base on a new technology (let alone a new VCS paradigm) was something we wanted to avoid. Capitalizing on existing knowledge and introducing a minimal interruption to overall productivity was paramount. In the end, we did a vast amount of experimentation and evaluation of different systems and found that the features provided by Subversion and the low barrier to entry was the best choice for us.

At the end of the day, the actual VCS you implement is of no consequence so long as it sports the features that are important in your architecture and is adopted religiously by the all the parties involved. One crucial step of selecting a VCS is to engage its user community and ask for positive and negative experiences on architectures that are most similar to yours. If you can't find a user of a specific VCS with an architecture that resembles yours, ask yourself if you want to be the first.

So, you have a robust, reliable version control system, what do you do with it? How does it aid the overall management of your architecture?

Version Control in Action

Version control allows you to understand how your code changes, by whom, and for what reason. As such, almost all development work that takes place today is done in a VCS. However, more often than not, that VCS is not leveraged for deployment.

Let's poke around in a Subversion repository a bit and demonstrate how you can use Subversion to manage deployment.

Because this book really isn't about installing and administrating Subversion, we'll assume that you have a working Subversion install hosting your "superapp" repository at https://svn/superapp/. You can find more information about Subversion at http://subversion.tigris.org/.

In Subversion you lay out your repository in three main sections:

/trunk/ is used for mainline development. All new features start here.
/production/ holds only production-ready code. Code placed in this location has been developed and tested in /trunk/, and basic quality assurance has been performed. Only severe, critical bugs are fixed directly in this section and from there moved back into /trunk/.
/tags/ holds copies of the /production/ branch that are suitable for a push.

The concepts of software management are well out of the scope of this book, so we won't delve into the policies and procedures for committing code in /trunk/ or moving that to /production/. However, we can use the /tags/ section for launching code. Typically in a software engineering environment, the /tags/ section of a Subversion repository is for product releases. For example, version 1.3.2 of your superapp would be placed on /tags/1.3.2/ and from https://svn/superapp/tags /1.3.2/ you would roll your "release."

In a fast moving web architecture, it is not uncommon to have more than one production release per day. Additionally, the "application" tends not to be a shrink-wrapped software productit isn't installed but rather patched. The "application" is an entire architecture with a huge number of subproducts, configurations, and dependencies, and often has multimachine deployments that are not identical. We actually have one client who has performed 10 production code pushes in a single day.

One of the most fundamental rules of science is that you must have controls in your experiments. In many ways a production push is like an experiment. You expect everything to go as planned; you expect your hypothesis (success) to be met. Occasionally, however, the experiment yields an unexpected outcome. When things go wrong, we immediately look at our controls to ensure that the environment for the experiment was the environment we expected. With a daily production push schedule, the environment you push in was not the environment that the application was developed in.

Subtle bugs in systems sometimes take days or weeks to manifest effects large enough or acute enough to notice. This, in itself, poses a quandary. What caused the problem?

In a production troubleshooting situation, religious use of version control is a life-saver. It allows for both systems administrators and developers to review the concise logs about what has changed, when it changed, and why it changed. Additionally, by pushing tags into production, reversion to a previous "system state" is easier. "Easier" is not to be confused with "easy." There are still many things to take into account.

New application code is often accompanied by changes to database schema, scheduled maintenance jobs, and systems and network configuration changes. Although reverting the system to a previous state is not always as easy as simply checking out that previous tag and pushing it into production, but, as discussed already, it is critical to accompany all substantial production changes with a plan, and that plan has a reversion plan that has been tested.

What a good VCS can provide is a simple, consistent implementation of a plan for reverting the simpler production pushes. A vast majority of the production pushes I see on a daily basis are minor feature or maintenance related. As such, they can all be reverted by checking out a previous "stable" tag. Beyond that, most are unrelated changes and fixes, so instead of reverting to an older tag, we can just back out the changeset that caused the problem.

When used right, a good VCS can allow you to "roll back the clock" in your production and development environments. This sort of flexibility and immediately available change history is invaluable.

A Different Approach to Disaster Recovery

Nothing will ever replace the need for a bare-metal recovery. When a machine dies and must be replaced and reinstalled, it is a relatively simple step-by-step approach: bootstrap, restore configuration from tape, apply configuration, restore data from tape, test, make live. However, there are a variety of other failures that, although disastrous, do not require a bare-metal recovery.

Imagine a reboot where your interfaces have the wrong configuration (wrong IPs), or a missing software configuration file. A web server restarts and some of the virtual hosts are "missing" or otherwise malfunctioning. Perhaps it is brought to your attention that some critical recurrent job hasn't been running and though it should be in the crontab on the machine it isn't.

Although mysterious things tend to happen less in strict production environments, they happen in development often. And the more aggressive your production push schedule is, the more likely oddities will arise in the production environment. One of the most coveted skill sets in any large, fast-paced architecture is keen production troubleshooting skills. Things are changed during the investigation and problem-solving process, and the solutions to these acute problems often lead to online production reconfiguration. This can make managing the overall production configuration challenging to say the least.

How do you solve this challenge induced by untracked, poorly documented, "emergency response" style configuration changes? Well, simply put, track it and document itusing your VCS.

On your version control server, set up a process that will back up the important files from all your production servers. This includes configuration files, custom kernels, and package applications installed after the full OS install (such as your installations of Apache, Oracle, Postgres, MySQL, and so on). The bulk of this data should range between 10 megabytes and 500 megabytes, entirely reasonable for a version control system. From these backups you will commit changes. Using a protocol such as rsync for synchronization of these data sources, they can be inexpensively replicated allowing for short sync/commit cycles.

Alternatively, the approach can be embraced more completely by placing important files on a system directly under version control and making each system responsible for applying changes directly to the VCS. This eliminates a step in the process but requires a bit more finesse in the configuration to ensure that all appropriate files are backed up. In our systems, we rsync /etc/ and other important configuration directories to a central control box and apply the changes to the repository; when new files are added on a production machine, they are automatically placed under revision controlit has been a life-saver.

The nice thing about this approach is that by looking only at core configuration information and static applications, the file sets are unlikely to change with any frequency. This also means that the changeset notification messages (the emails sent to the team on every commit) are infrequent and useful for keeping the whole operations team both up-to-date on intended changes and aware of unintentional reconfigurationssuch as those that occur during hectic troubleshooting sessions.

The advantage of backups cannot be replaced. However, the ability to restore the configuration of a server to a specific point in time is much less valuable than the ability to understand how it has changed over that time. It will give you far better insight into the cause and effect of changes.