Section 1.3. The Elements of Version Control | Subversion Version Control. Using The Subversion Version Control System in Development Projects

1.3. The Elements of Version Control

So, version control is, in its essence, exactly what its name purports it to be: the tracking, controlling, and merging of different versions (called revisions) of a project over time. In practice, as with almost anything, this is not nearly as simple as it sounds. Version control systems are complex software tools with a wealth of different features that vary widely from system to system. Conceptually, though, they are in fact fairly simple, and most version control systems can be grasped with an understanding of a few basic concepts.

1.3.1. The Repository and Working Directory

Most version control systems store versioned projects in a central repository. The repository may simply be a structured directory on a server with each versioned file stored separately, or it may be a database containing entries for the various files in a project. It may even be a complex distributed system that redundantly stores the versioned project all over the world.

Regardless of what the repository looks like, the one commonality among version control systems is that developers do not work directly on the files in the repository. Instead, they have some sort of working directory accessible from their development machine, where they can make local modifications.

Working directories generally allow individual developers to work locally, adding and testing changes as necessary, during the development process. Once a change or set of changes is deemed complete, the developer is able to commit the changes back into the repository, where they become a part of the project.

Once a change has been committed to the repository, the other developers working on the project are able to update their working copies to include the latest versions of the project's files. This allows the other developers to test the new changes with the uncommitted local modifications in their own working directories, and fix things as necessaryor demand that the developers responsible for the modifications fix their changes, as appropriate.

1.3.2. Revisions

Version control systems don't just store the most recent state of a project. Instead, they store a history of changes to the project over time. Whenever a developer commits changes to a project, those changes are stored in a revision. Depending on the version control system, revisions will either be global points that refer to the state of the entire repository at a given point, or they will exist as file-level revisions that refer to the state of an individual file.

In a file-level revision system, each file has a revision history independent of the rest of the project. For example, let's say a repository consists of foo.c and bar.c. If foo.c has had ten different modifications committed, the version control system may give its revision as ten, but lists bar.c as only being at revision five (if that is the number of modifications committed for bar.c). This sort of revision scheme tends to be unwieldy, and can make it difficult to keep track of the overall state of the project at a particular point in time. For instance, being required to know that the repository consisted of revision 5 of bar.c, revision 10 of foo.c, and revision 8 of ReadMe.txt at 3:26 PM last Tuesday would be a nightmare. To make the tracking of project states easier, most version control systems allow you to refer to revisions by the state they existed in at a given date and time, or a specific tag. Tags are essentially snapshots of the repository at a specific point, which can be used as a reference (I talk more about tags in a little while).

Other version control systems (such as Subversion) have revisions that are global across the entire repository. In this type of system, a modification committed to a single file would increment the revisions of all files in the repository. Thus, "revision 10" would refer to a snapshot of the whole system at the time of the tenth commit. Figure 1.1 shows an example of a repository that uses global revision numbers. You will notice how the revision number increases by one each time a commit occurs, even though the files are not the same for each commit. This gives a huge advantage over file-level revisions, because you no longer have a need to keep track of the relationships between different revisions of different files. Consequently, explicit tags tend to become less necessary. For instance, with a file-level system, you might make a tag before every significant feature, to allow you to roll back the whole repository to a consistent pre-change point. When you have global revisions, though, each revision is essentially a tag itself. This makes it much easier to move the entire repository between revisions, or to compare revisions.

Figure 1.1. A repository with global revision numbers.

The differences between two revisions of a repository are often referred to as changesets. In addition to allowing a developer to retrieve specific revisions from a repository, most VCSs allow the retrieval of changesets. The changesets can usually consist of either changes to the entire repository between two revisions, or changes to a specific subset of files in the repository. Some VCS implementations will also allow one or more revisions to be grouped into a changeset that can be used later to roll back changes to a repository or apply those changes to a different repository.

1.3.3. Logs

Keeping track of code changes is important. However, to truly keep an organized development process going, it isn't sufficient. It is useful to know that three lines of code were added to a source file, but what you really want to know is why those lines of code were added, and what logical change that addition makes to the project. That might not be too hard to discern if the change is small, as with three lines, but if ten thousand lines are added, using a diff to figure out what changed may be practically impossible.

One way to keep track of logical changes would be to keep notes on what is happening in source code comment blocks. This keeps the information close to the source, but it quickly becomes unwieldy. Comments can be hard to find, if someone simply wishes to know what changed from one version to another. Additionally, logical changes that require many small changes scattered throughout the project are especially difficult to document concisely in source code comments.

Moreover, keeping track of logical changes can be invaluable when debugging a project. In any actively developed project, it is inevitable that mysterious bugs will appear in places that worked fine just a few days ago. When that happens, the first thing you'll want to do is figure out what changed since the point where the project last worked. If you don't have a good set of logs showing the logical changes made in that period of time, figuring out where the offending modification lies may be an arduous job.

It would also be nice to be able to compile a change log, showing changes that have occurred from one project release to another. Keeping track of this in source comments would also be extremely difficult, and would probably require some sort of special tagging to allow a script to extract changes. Another option would be to keep track of the changes that occur in a separate file. Such a file would be hard to maintain, though, and could easily become out of sync and inaccurate. A much better solution to tracking logical project changes is to include a log along with each repository commit. The log would allow developers to enter the reason and substance of the change they are committing, in plain English.

Not surprisingly, any version control system worth using keeps logs that can easily be used for exactly the purposes described in the preceding. When a developer commits a set of changes to a repository, the VCS will either read the log entry to attach to the commit from an external source like a text file or command line parameter, or it will present the developer with a text editor so that she can enter the log entry right then. If the log entry is well structured, it can be used down the road to do things like automatically create a changelog or list of fixed bugs.

1.3.4. Tagging

As I discussed earlier, most version control systems provide a means by which you can tag revisions, so that they can be referred back to at a later date. This frees the developer from reliance on references with poor contextual relation, such as revision numbers or dates. Tags can be placed at development milestones, to allow development in a project to continue, without hurting the ability for someone to later go back and see a snapshot of the project's source at that milestone. For example, if a tag is placed at each project release, it is easy to go back at a later date to search out the cause of a bug that has been discovered, even if the current code of the project has diverged and no longer contains the portion of the project where the bug occurred.

1.3.5. Branching

Tags are useful, but what happens when the code that has been tagged needs to change, thereby diverting from the main development trunk, such as in the preceding bug fix example? It's not sufficient to simply find the cause of a bug that has been discovered in a previous release. In most cases, you will also want to fix that bug and release a patch to your users. Typically, version control systems will support this type of divergence by allowing you to create parallel paths of development for the repository (or a subset of it). These parallel paths are usually referred to as branches.

At any given revision, a branch can be created, with development continuing from that point in two parallel paths. Each branch can be worked on independently, and changes committed to one branch will not affect any other branches of the project. Later, if the developers decide that a change made on one branch would be useful on one or more of the project's other branches, it is usually possible to merge all or part of two branches together.

You can see an example of a repository that has branched in Figure 1.2. In this case, a branch was created after the version 1.0 tag was created at the release of the product on the main branch, which has allowed the version 1.x release of the project to continue making releases (1.1, 1.2, and so on), which the main project branch moves on to develop release 2.0.

Figure 1.2. A repository with branches and tags.

1.3.6. Locking versus Merging

Up to this point, the discussion has focused mostly on the work of a single developer when using a version control system. In a real development environment, though, a project can easily involve dozens of developers working on the same general area of a system. Good practice dictates that in the best of all worlds, no two developers will ever be modifying exactly the same part of a system at the same time. In reality, however, most of us don't actually live in the best of all worlds, and a multitude of practical reasons can lead to two developers needing to make changes on the same source file at the same time.

When the division of work on a project collides at one spot, there needs to be a way to arbitrate who can modify the offending file or section. There are two primary ways in which version control systems tend to handle these collisions: file locking and merging. Most version control systems will use one of those two methods most of the time, but many use a hybrid system that allows some combination of both methods to be used. Because of the inherit limitations of locking (as discussed shortly), most modern VCSs will support merging. There are many locking only systems, but they are almost universally older and obsolete (with the notable exception of Microsoft's Visual SourceSafe, which is still used by many development shops).

In a file locking system, the developer locks a versioned file when beginning to make changes. While the file is locked, no other developer will be allowed to make any changes. Then, when the developer has finished, the changed file can be committed back into the repository and the file can be unlocked.

The upside to file locking is that it enforces very organized division of work, in order to minimize the number of times where two developers need to modify the same source file or section simultaneously. It can also work better than merging when working with files that are not easily merged automatically, such as graphics, or files in proprietary binary formats. On the other hand, locking can disrupt development if progress is blocked by two developers competing to work on the same file. The problem can be magnified further if one of those developers forgets to unlock the file when finished. In general, locking scales very poorly and is unworkable for broad use by projects involving more than a few people.

The second method for handling collision avoidance/resolution is the use of automatic or semi-automatic file merges. In this model, developers can modify files in their working copies of the repository without regard for what others are doing.^[1] Then, once the developers have finished making their changes, they can commit those changes, and the version control system will check to see if there are any collisions caused by two people simultaneously editing a file. In many cases, the system will automatically merge two files together; but if it can't, most systems will provide the developer making the commit with information about the conflicted area, to allow the developer to merge the changes by hand.

^[1] Can and should are two different issues completely. Developers should always communicate with their fellow developers to know what they are working on.

The main advantage of a merging system is that it frees developers to work independently of other developers. This is especially advantageous in projects where multiple developers are likely to be working in different areas of the same file simultaneously (since most version control locking is at the file level). It also scales better than locking, but it can cause problems if communication between developers is poor, since frequent collisions requiring hand-merging can lead to delays and wasted/duplicated effort.

Of course, neither locking or merging is a perfect approach to the integration of multiple developers. For instance, file locking doesn't provide any assurances that other sections of the project won't change in a way that is incompatible with the changes made to the locked section, but similarly, file merging only catches conflicts that occur in the exact same location of a given file. In light of these granular deficiencies in the methods used in version control for managing integration, every version control system also relies on the developers themselves to use the tools given to them alongside good practices that will help ensure smooth integration of multiple developers. For instance, in most version control systems, the recommended best practice is to always update your local working copy and test it with the changes from other developers before committing a new revision to the repository.