4.5. SCM AnnoyancesThis section describes some of the common problems that people run into when they use SCM tools with a project. Some problems such as merging are hard work due to the basic nature of the problem, but all the problems can be tamed with a little forethought. 4.5.1. Branches and TagsTo recap, a tag is a name for all the versions of a group of files at one moment in time, just as though you had made a copy of all the files as they were at that moment. A branch does the same thing, but allows SCM-controlled changes to the files later on. Figure 4-3 shows an example of this. Figure 4-3. Changing a file on a branchBranches are vital because they allow you to make changes to an older version of the productfor example, when you need to fix a bug in a file belonging to the last release of a product. At the same time, you can make changes for the next release to a different version of the same file. If you don't use branches but instead only fix bugs in future releases, this can put pressure on the project to create premature releases.
However, you should try to minimize the number of active branches in your project. Branches make things more complicated because there are now more changes to manage. Imagine three versions of a product: the oldest one is the one that is being maintained, the middle one is the one that is being made available to customers right now, and the newest one is next year's "yup, that bug's fixed in the next release" version. A set of changes to fix some problem has to be created for one of the three versions, tested there, then ported to the other two versions and then tested there too. Even if it is straightforward to port the changes to the other versions, the amount of testing work for one bug has just been tripled. Tracking the same bug in multiple releases is also a hard thing to do well with most bug tracking tools (see Section 7.3.3).
To really see why the number of branches in a product should be minimized, look at Figure 4-4. Each of the source files is named on the vertical axis, and each different version of each source file is a solid circle in the horizontal direction. Every branch that is created is a (logical) copy of all the files into the third axis, the one labeled Branches. Just the copies of File 1 and File 2 are shown, and there have been three changed versions of File 1 on Branch 1. Now this third dimension has an odd characteristic compared with the other two: it's very easy to move in one direction (creating a branch), but it's always much more work to move in the other direction (merging). The more branches of a project that you keep active, the more time you will spend building, testing, and documenting the changes to the project. For the sake of simplicity, I recommend keeping the number of active branches small: two or three at most for a medium-sized commercial product. Figure 4-4. Branches are in a different dimensionTo inexperienced project managers, the concept of branching may seem like an easy answer to many of a project's growing pains. Got a new product? Just put it on a branch. Developing for a new hardware platform? Put it on a branch. Don't like that developer's coding style? Put him on a branch. Some SCM tools even encourage you to think like this. My advice is simple: avoid it! You should use just enough branches for your project and no more. The next section discusses what to do when you do have to create a branch. 4.5.2. When to Branch? When to Tag?The previous section was pretty emphatic about why you want to minimize the number of active branches in a project. So when is creating a branch appropriate? There are just two common cases:
These two cases can be summarized as "branch on incompatible policies." That is, create a branch when the guidelines for committing files are different. For example, the rules about who can commit to a release branch are usually different from the more open nature of the main development branch. Since the two sets of rules are different for the same source files, a branch is probably necessary. A useful article that expands this idea is "High-Level Best Practices in Software Configuration Management," from http://www.perforce.com/perforce/bestpractices.html. (There are other articles that encourage each developer to have his own branch for his work, or even a branch per changeset, but these approaches assume effortless merging abilities from your SCM tool, which is rarely the case in practice.)
When is it a good idea to tag a project? Good practice is to create a tag whenever anything happens to the project that you might want to reproduce. Examples are creating a release, giving an internal demo, reaching a point in time that you might want to branch from one day, or just getting a build to work again. Since tags are just a way to name a set of particular versions of files, they don't involve the dreaded third dimension of Figure 4-4. Consequently, they require much less effort to work withthere are no merge headaches to deal with later on. However, depending on the SCM tool and the size of the project, tagging may take hours rather than minutes or require locking the repository to stop the files being changed during this time. 4.5.3. Naming Branches and TagsThe naming of branches and tags has surprisingly wide effects on a project. Tag names become associated with builds, test results, and eventually releases, so they appear in many of the related tools such as bug tracking systems. A document with the name of each branch, the branch point tag, and the intended purpose of the branch can help to reduce confusion about how to use different branches. Since there are generally many more tags than branches, it's easier to simply make the tag and branch names meaningful. Section 3.5 describes the idea of build labels, which are a good basis for tag names.
Before you settle on a naming scheme for your branches and tags, note that some SCM tools have nonintuitive quirks about what a name can look like. In CVS, for example, names must start with a letter, not a numeral, so 2_1_release is not permitted. Periods and spaces are also not allowed, so release 2.1 won't work, but hyphens and underscores are permitted (though underscores tend to disappear when the name is used as part of an HTML link). Branch and tag names also have to be unique within a file in CVS; that is, you can't tag two different versions of a file with, say, ALPHA_RELEASE, even if the versions are on different branches. CVS also makes no distinction between tag names and branch names, and working out whether a name is a tag or branch after the fact can be tedious. Create a document that describes the chosen naming scheme for your project's tags and branches, and try to make sure that the naming scheme follows the release numbering scheme (see Section 9.2.3) as closely as possible. If you can enforce the chosen naming scheme using the SCM tool itself, so much the better. Restrict who is allowed to create branches, make sure they know what is expected for branch and tag names, and make sure that they have some good sense about when to create a branch. Once you know who can create branches, automate the process as much as possible for them. A simple naming scheme that has been used successfully with CVS is as follows:
Some examples of tags and branch names using this scheme are:
Dates can be troublesome in branch and tag names, especially if the project has people from different countries reading the dates. Some people like to have the name of the tag that was used as the branch point (or root) of a branch included in the branch name. This seems to make the branch name overly long, in my opinion, and you should be able to use the SCM tool itself to tell you where the branch came from. 4.5.4. Merge MadnessMerging is taking the changes that were made to files on one branch and making the same changes to another branch. Perhaps the branch was where some experimental changes were developed, and now they're ready for everyone else to use. Perhaps a bug was fixed on a branch for one series of releases, and the same bug needs to be fixed in a different series of releases. Branching is so tempting, so easy: just copy all those files and make your changes to the copies. Merging is so much harder, and only gets harder as the original and the copies diverge over time. Indeed, there are people who make a whole career out of merging different versions of classical texts back together, word by painful word, but you probably don't want to spend your career merging files. Even with the merge tools that are mentioned next, merges still take time, usually because some human intervention is necessary when the tools can't figure out what to do. Large merges inevitably destabilize the branch they are merged into, so extra testing effort is needed after the merge is complete. In most SCM tools, automated merging uses the diff and patch tools in some manner. diff uses an algorithmic equivalent of finding the shortest path between two points to create the minimum number of hunks, which are groups of lines that could be removed or added to one file to transform it into the other file. patch takes these hunks and applies them to one file to create the other file, along with some smart attempts to cope with changes to where the hunks should be applied within the file. Many SCM tools help you only with merges between branched versions of the same file, not between separate files. For more information about diff and patch, see "Comparing and Merging Files" at http://www.gnu.org/software/diffutils/manual. So what makes an automated merge fail? Generally, if two files have a common ancestor and both files have had the same lines changed, it is unclear which changes are the correct ones to use. In this case, the changes are conflicts, and someone has to resolve them by choosing one or another of the changes. Luckily for SCM and branches, developers tend not to modify the same lines of code at the same time as other developers. You may be pleasantly surprised by how few conflicts there are when merging changes from one branch to another. Some SCM tools (including CVSNT, Arch, Perforce, and BitKeeper) automatically keep track of when files were merged. If you have a large number of files to merge and they have many conflicts, then graphical merge tools may be useful. Some of the better-known standalone merge tools are the commercial Araxis Merge (Windows only) and Guiffy (all platforms), and the open source WinMerge (Windows only) and xxdiff (for Unix). One good way to organize larger merges is to designate a small number of people as "mergemeisters" and let them perform the merge and resolve as many conflicts as possible. Then have the mergemeisters call in the appropriate people for each group of files that still need to be merged by hand. 4.5.5. SecuritySome other important aspects of SCM to consider are those related to security. The source code is the heart of your project, where all your intentions, shortcuts, and errors are plain to see. Several large companies including Microsoft and Cisco have been the targets of successful exploits aimed at acquiring their source code. Even the repository of the source to the CVS tool has itself been cracked. An SCM tool must make sure that only authorized people can read and change files, and it must keep a record of such actions for audits. It must also be able to protect its own files from accidental or malicious corruption, and it should not be vulnerable to denial-of-service attacks. Some practical suggestions for securing your SCM tool, and CVS in particular, include:
An excellent source of further information about this topic is the paper "Software Configuration Management (SCM) Security," by David A. Wheeler, which is available from http://www.dwheeler.com/essays/scm-security.html. 4.5.6. Access WarsThe development of a software product is often broken up into functional groups, such as networking, GUI developers, testers, technical writers, and toolsmiths. Not surprisingly, the way that a product's source code is stored in an SCM tool tends to reflect how the groups are divided. Disagreements about who gets to make changes ("commit rights") in each group's files is a common source of irritation in a project.
In many projects, it is considered polite to mention proposed changes in another group's files to that group before you make them; you can also send diffs by email to the group. Otherwise, someone in the affected group always seems to take offense, whether at the changes themselves, or because they were surprised by who made the changes, or because "you might do it again, and it might break something in the future!" There's not much you can do to argue with that, so you might as well coordinate changes in other groups' files with them beforehand: egoless programming only goes so far when it's a whole group's ego. Even more far-reaching than these seemingly petty territorial conflicts are the effects on a project when different groups start to deny others read access to their files. These aren't the files containing the name of the next CEO of the company or telling where the last project leader was buried. These are cases such as one group of developers allowing only compiled versions of their libraries to be used by other groups, or the Technical Publications group wanting people to use copies of only those documents that they have personally issued. This kind of information restriction hinders effective software development. Still, looking at the issue from a different angle, preventing your salespeople from promising features in the next release based on a single comment they saw committed to the source code a few weeks ago can actually make software development more coherent. As with all information, it's what you expect the owner to do with it that matters most. The beauty of SCM tools is that if someone else makes changes that you don't like to your group's files, you can not only talk to him but also back out his changes. 4.5.7. Filenames to AvoidAll filesystems have their quirks about what characters are valid in filenames and how long filenames can be. SCM tools have their own set of restrictions on the names of files. First, a little history. Filenames with spaces in them were most uncommon in older Unix filesystems. Windows 95 began to make them more popular, but Windows also dragged along "8.3" (pronounced "eight dot three") filename restrictions from its DOS ancestry, where the filename could be at most eight characters long, with an extension of up to three characters. Other characters in filenames that have been known to break cross-platform compatibility, or even corrupt the files stored in SCM tools, are /, \, and newline characters. Just to be safe, these characters are all still worth avoiding in filenames. For example, since CVS was originally developed on Unix, filenames longer than 8.3 were just fine, but support for spaces came later. Unfortunately, the format originally chosen for passing the names of files and their versions to the CVS info scripts, which are part of customizing a CVS server for your site, did not really support spaces in the filenames until more recently, around Version 1.12.6. Windows filesystems are set up by default to be insensitive to the case of filenames. So three files named FileWriter.java, Filewriter.java, and filewriter.java (which differ only in the case of one or two characters) would all be treated as the same file in a Windows filesystem. On Unix, and most other operating systems, they would be three different files. This becomes a problem when a Windows user tries to extract these files from a Unix server; it's not clear which file the Windows user will finally see, since the three filenames may be identical in their local filesystem. It should be noted that the same problem occurs with tools such as FTP and with shared filesystems such as NFS. The most obvious solution is to use names that are unique on case-insensitive filesystems. In general, avoid using the name or abbreviated name of the SCM tool as a filename or directory name. A particularly unpleasant problem can occur if you are working in Unix and are using CVS to store information about CVSfor example, some documents about how you configured CVS for your environment. You won't be permitted to create a subdirectory named CVS, because one already exists as part of how CVS works. However, you can create a subdirectory named cvs, because cvs is a different directory name from CVS in the Unix filesystem. Unpleasant surprises are now in store for anyone who tries to check out the subdirectory to a Windows system. The cvs directory will interfere with the CVS directory that is used by CVS. My suggestion here is to call the subdirectory scm. Some more general advice about the naming of files and directories in a project:
4.5.8. Backups and SCMSCM tools behave like backups for their users' files, but it is good to remember that unless the SCM tool's own data is properly backed up, the users' files are no better protected than if the users had just copied their files over to another machine. Backups of an SCM tool's data serve at least three purposes:
Standard server backup practices can usually be followed for SCM servers. If necessary, quiesce or shut down the server, export the data from the database or copy the files, compress, encrypt, and uniquely identify the backup files, and archive them off site on permanent media. As with any backup strategy, all this effort is wasted if you don't periodically test that the SCM server can be recreated using a recent backup. Keeping one or more identical SCM servers on standby is useful both for testing recovery of backups and for periodic maintenance. Personally, I like to make my own nightly backups to CD and DVD for all the SCM data that I am responsible for, and then have an IT department also back up the SCM machines. One place to read more about basic backup and recovery best practices is Chapter 11 of Essential System Administration, by Æleen Frisch (O'Reilly). The backup files' size can vary quite erratically due to compression artifacts, but the total size of the files always grows every few days, since version control systems can't discard information if they are to reconstruct the past correctly. Large unexpected changes in the size of consecutive backups can occur and are worth investigating, usually by comparing the contents of the different backups. What happens in the worst case, if you lose all your SCM data? If you're lucky, someone will have a recent copy of the files on her local machine. You can recreate the recent state of the project by adding these files back into the SCM tool. For this reason, it's a good idea to regularly check out the entire contents of the repository onto at least one machine. Automated builds have to do this regularly anyway. 4.5.8.1. Backing up CVSExample 4-1 shows an example script that can be used on a locked repository to create a gzip'd tarball of the repository. The backup file should be copied to another machine after it has been created. On a Unix server, this kind of script is typically set up to run nightly, using a cron job. Scripts used to back up CVS repositories should expect to encounter filenames with spaces in them. Example 4-1. A shell script for backing up a CVS repository#!/bin/bash # # Backup a CVS repository to a gzipped tarball. Also generate output # describing what has changed since the last backup. # # The root of the local CVS repository, the one to be backed up CVSROOT=/usr/local/cvs # The uniquely-identified backup filename backup_home=/backups backup_file=${backup_home}/cvs_backup_`date +"%m%d%Y.tgz"` # Record what has changed between each consecutive backup cd ${CVSROOT} if [ -f ${backup_home}/du.today ] then mv ${backup_home}/du.today ${backup_home}/du.yesterday fi du -k [A-Za-z0-9]* | sort +1 > ${backup_home}/du.today diff -N ${backup_home}/du.yesterday ${backup_home}/du.today # Create a list of all the files in the repository. Note that only # files whose _full_ name starts with [A-Za-z0-9] are matched. Make # sure that empty directories and soft links are handled correctly # (find -type f loses both of these). repos_filelist=/tmp/all_files.$$ find [A-Za-z0-9]* -not -type l -print > ${repos_filelist} # You could also use grep -v here to select portions of the # repository, and you may want to add this script to the list of files # that are backed up. tar --files-from ${repos_filelist} --no-recursion -czf ${backup_file} chmod ogu-w ${backup_file} # Clean up rm -f ${repos_filelist} # And copy the backup file to another machine ... The source to CVS contains a useful script in the contrib directory named validate_repo.pl, also known as check_cvs in earlier versions. This script can be run nightly to confirm that the repository has not been corrupted in any obvious way. |