Section 2.3. Distributed Development | Open Sources 2.0: The Continuing Evolution

2.3. Distributed Development

Distributed development is more than just a fad or even a trend. Organizations and companies large and small are using diverse, globally distributed teams to develop their software. The free software development movement showed the world how to develop internationally. Well before SourceForge.net became a site that every programmer had heard of, projects working together over the Internet or far-flung connected corporate networks developed much of the software that we use today.

In fact, the tools they developed to do that are now considered the baseline standard for developerd everywhere. What company in its right mind doesn't mandate that its programmers use some form of version control and bug tracking? I ask this rhetorically, but for a long time in the software business, you couldn't make this assumption. Small development shops would back up their data, for sure, but that's not version control.

Distributed development is about more than just version control. It's also about communications and bug tracking and distribution of the end result of software.

2.3.1. Understanding Version Control

Programming is an inherently incremental process. Code, then build, then test. Repeat. Do not fold, spindle, or mutilate.^[5] Each step requires the developer to save the program and run it through a compiler or interpreter. After enough of these cycles, the program can do a new thing or an old thing better, and the developer checks the code into a repository, preferably not on his machine. Then the repository can be backed up or saved on a hierarchical storage system. Then, should a developer's workstation crash, the worst case is that the only work lost is that done since the last check-in.

^[5] This sentence is famous for being printed on punch cards, an early way of providing computers with data. If they were folded, spindled, or mutilated, they jammed the readerswhich makes one speculate what the punch card programmer used for version control. The answer is right there in front of you: as the cards went through revisions, they swapped out cards and retained the old, original cards.

What is actually stored from check-in to check-in is the difference from one version to the next. Consider a 100-line program, in which three lines in a program read:

for (i=1; i < 1; i++) {     printf("Hello World\n"); }

and one link needs to be changed to:

for (i=1; i < 100; i++) {     printf("Hello to a vast collection of worlds!\n"); }

which would then be checked back in. The system would note that only one line had changed and store only the difference between the two files. This way, we avoid wasting storage on what is mostly the same data.

The value of having these iterations can't be overstated. Having a previously known, working (or even broken) copy can help in the event of an editing problem, or when you're trying to track down a bug that simply wasn't there a revision ago. In desperate cases, you can revert to a previous version and start from there. This is like the undo option in your favorite word processor, but one that persists from day to day.

Version control isn't used just in development. I know of IT shops and people who keep entire configuration directories (/etc) in version control to protect against editing typos and to help with the rapid setup of new systems. Some people like to keep their home directory in a version control system for the ultimate in document protection. There is even a wiki project that sits on top of the Subversion version control system.

Additionally, good version control systems allow for branchingsay, for a development and a release branch. The most popular version control system that many open source projects use is CVS.

2.3.1.1. CVS

CVS, the concurrent versioning system, allows developers all over the world to work on a local copy of a codebase, going through the familiar "code, build, test" cycles, and check in the differences. CVS is the old standby of version control, much in the same way RCS/SCCS was before it. There are clients for every development environment and it is a rare professional developer who hasn't been exposed to it.

Since it is easy to use and install and it enjoys wide vendor support, CVS continues to be used all over the world and is the dominant version control platform.

2.3.1.2. Subversion

Only the rise of Subversion has brought real competition to the free version control space. With a much more advanced data store than CVS and with clients available for all platforms, Subversion (SVN) is also very good at dealing with binary data and branching. Both are things that CVS isn't very good at. SVN is also very efficient to use remotely, and CVS is not; CVS was designed for local users and remote use was tacked on later. Additionally, SVN supports a variety of access control methods, supporting any authentication scheme that Apache does (Subversion is an Apache project), which includes LDAP, SMB, or any the developers wish to roll for themselves.

2.3.1.3. What About SourceSafe?

SourceSafe isn't really version control. Local version control, whether CVS or SourceSafe, is just backup, requiring a level of hardware reliability that simply doesn't exist on a desktop. Since SourceSafe is not designed to be used remotely, you take the life of your codebase in your hands when you use it. There are some SourceSafe remoting programs out there if you must use SourceSafe, but I can't recommend them so long as decent, free SVN and CVS plug-ins exist for Visual Studio.

2.3.1.4. The Special Case of BitKeeper

BitKeeper, which was written by Larry McVoy, was chosen by Linus Torvalds to use for version control for the Linux kernel. For the Linux kernel, BitKeeper was a very good choice, given the kinds of problems that arise with Linux kernel development. Written for distributed development, BitKeeper is very good at managing multiple repositories and multiple incoming patch streams.

Why is this important? With most version control systems, all your repositories are slaves of one master and resolving differences between different slaves and masters can be very difficult.

The only "problem" with the kernel team's use of BitKeeper was that BitKeeper was not a free software program, although it was available for the use of free software developers at no charge. I say was because Larry McVoy recently decided to pull the free version, thus making it impossible for Linux kernel developers to work on the program without paying a large fee.^[6]

^[6] The kernel team is in the process of moving off of BitKeeper as of this writing.

A great number of developers lamented the use of a proprietary tool for free software development, and the movement off BitKeeper, while disruptive, is a welcome change.

BitKeeper is a tool designed with the open source software model in mind. It has found success among large proprietary development houses specifically because the problems that faced the kernel team in 2001 are the same ones that increasingly face proprietary development shops. All of these teams, not just those working on open source development projects, now face multiple, far-flung teams that are engaged in collaborative development and struggle to do it effectively.