Prevention | Preventative Programming Techniques: Avoid and Correct Common Mistakes (Charles River Media Programming)

< Day Day Up >

Preventing the simplest form of cut-and-paste is relatively easy, avoid using the cut-and-paste feature of your editor whenever possible. However, to avoid all of the problems that the cut-and-paste epidemic implies, we must be cognizant of other forms of duplication as well. The complete goal of preventing cut-and-paste programming can be summed up in two rules:

There should be one and only one copy of any human editable information.
Any human editable information should not be derivable from other sources.

These rules encompass not only using cut-and-paste, but also preventing the creation of circumstances that force the use of cut-and-paste to make changes. This is an important risk management investment, as it will greatly reduce the chances for needless human error. Now let us examine ways that we can accomplish this task beyond the simple avoidance of cut-and-paste.

Know Your Code

The first step to reducing the amount of duplicated code within a project is to know what other code is written or being written for the project. Code reuse is essential to avoiding duplicate code. An important piece of achieving this goal is proper communication with other programmers. Understanding the general area that everyone is working on will allow you to direct queries about existing functionality to the correct programmer quickly and easily. On small teams, communication is probably the most efficient method for discovering existing functionality within the project’s code base.

Extreme Programming attempts to maximize the level of communication using a technique called pair programming, where two programmers work on the same code together, with one writing the code and the other offering advice. With pairs of programmers switching partners on a regular basis, understanding of the code base disseminates quickly through the team. Extreme Programming also suggests regular code refactoring, which we will see later is a cure for the problems of cut-and-paste. Even if you do not want to engage in pair programming, keeping close to the other programmers you are working with will greatly increase everyone’s productivity.

On larger projects, particularly projects where programmers are not centrally located, there is a need for tools to facilitate communication about what code already exists. Even though these tools are not essential to smaller projects, they can still be beneficial.

The first and most important tool is automated documentation, which extracts the documentation directly from the current code and comments. Later when we talk about docuphobia we will go into detail about the best practices for proper documentation, but here we will take a closer look at how automated documentation in particular can benefit code understanding and reuse. When we talk about automated documentation tools, we are discussing tools such as Doxygen and Doc-O-Matic that extract code documentation from the structure of the code and the comments associated with the code. These produce documentation that can be easily browsed or searched to find code that you might need. More information on these tools is available in the First Aid Kit for Docuphobia. Once found, you can reuse the code directly and avoid duplicating both the work and the code yourself.

Compiling of automated documentation is best done with a build of the code on a daily basis. This ensures that the documentation matches a currently working build. Otherwise, if any errors occur within the build, they can be resolved before the documentation is updated. There are also tools for checking to ensure that the documentation matches the basics of the code to which it is associated. Although this cannot correct for errors in description, it can point out missing or syntactically incorrect documentation. More information on one such tool, iDoc, can be found in the First Aid Kit for Docuphobia. These documentation errors should be included with the build report and marked for timely resolution. Preferably, this documentation is then available on a local HTTP server for easy access by all team members.

Another useful set of tools to improve your understanding of the code base you are working with is code visualization tools. These provide tools for visual representation of class hierarchies, relations, and other important aspects that are normally difficult to see from the code alone (Figure 2.1). They also commonly provide advanced search tools for quickly locating code with various syntactic properties. Although this cannot take the place of well-documented code, it can be a useful addition to existing documentation, or a fallback when you are stuck with poor to no documentation.

click to expand
Figure 2.1: The Structure window in IntelliJ’s IDEA allows the user to visualize the structure of a class, including where an overridden method originates and what accessor methods are present.

Know Your Libraries

One sure way to avoid writing duplicate code is not to write the code at all. However, what does that really mean? If you need the functionality, how can you avoid writing the code that accomplishes that functionality? The solution is extremely simple, but often missed: use third-party code or libraries. If the code to solve your problem has already been written, then reusing that code will avoid the possibility that you will be writing duplicate code since you didn’t write any code at all.

In addition, libraries might provide new and clever methods for avoiding code duplication that you have never seen before. To illustrate this, we will look at a C++ library called boost that provides advanced functionality and techniques supported by experts in the C++ language community. Although not a part of the standard library, many proposals for the next revision of the standard library are derived from work done on boost. First, let us look at a common goal when creating a certain type of object in C++. The basic idea is to prevent the object from being copied or assigned, because this operation does not make semantic sense for this object. To accomplish this, we declare, but do not define, the copy constructor and assignment operator as private members of the class. This prevents access to this functionality except from within the class, and by not defining the copy and assignment functions a link error will occur if they are accessed within the class. An example of such a class is as follows:

   class t_DoNotCopy    {       // ...    private:       t_DoNotCopy(const t_DoNotCopy &);       const t_DoNotCopy &operator=(          const t_DoNotCopy &);    };

This solution is elegant, but it must be duplicated in every class for which copy is to be disabled. The boost library provides a more concise solution to this by defining the following class:

//  boost utility.hpp header file     //  (C) Copyright boost.org 1999. Permission to copy, //  use, modify, sell and distribute this software //  is granted provided this copyright notice //  appears in all copies. This software is //  provided "as is" without express or implied //  warranty, and with no claim as to its //  suitability for any purpose. //  See http://www.boost.org for most recent //  version including documentation. //  class noncopyable //  Private copy constructor and copy //  assignment ensure classes derived from //  class noncopyable cannot be copied. //  Contributed by Dave Abrahams class noncopyable { protected: noncopyable(){} ~noncopyable(){} private: // emphasize the following // members are private noncopyable( const noncopyable& ); const noncopyable& operator=( const noncopyable& ); }; // noncopyable

By deriving privately from this class, the amount of code that must be duplicated for each class is greatly reduced. The new version of the class given earlier now shows the savings that this library introduced:

   class t_DoNotCopy : boost::noncopyable    {       // ...    };

Note

There is always a caveat. Actually, there are several that we will look into in detail when we talk about NIH (Not-Invented-Here) Syndrome, which deals directly with the reasons for code reuse. However, one of these caveats deserves particular mention here as it relates directly to the duplication of code. Do not use library functions or third-party code from two different sources that accomplish the same task. This will make your code less readable and require more work in maintaining it when any of the libraries change. Be sure to evaluate and decide which libraries to use before writing the code so that the libraries can be used consistently across the application.

High-Level Languages

At the dawn of computers, programs were entered one byte at a time directly into a form the machine could process. There was no choice but to write every bit of code that was needed, even if that meant duplication. Time has changed all this as computer languages continue to evolve in a direction that reduces the effort needed to create applications. Many of these advances allow code to be written in a more generic fashion so that code duplication, and hence cut-and-paste, can be reduced or eliminated altogether. The following sections examine the most common language features that reduce code duplication. Most of them are present in several different modern computer languages, but the examples will be drawn from only one language at a time for clarity.

Functions

One of the earliest and most fundamental concepts to be introduced to higher-level programming languages was the function. Functions provide a simple means for reducing code duplication by allowing the parameterization of functionality for easy reuse. Functions exist in some form in almost every programming language, and it is a certainty that every programmer understands the basics of their use. Despite this, functions are often not used to full advantage in many applications. Therefore, let us take a deeper look at how functions prevent cut-and-paste and why they should be used more often.

One common problem with the way many programmers write functions is in the granularity of the function. Functions often become stuffed with several responsibilities, hence growing large and unreadable. There are several reasons to avoid this and write small functions with clear responsibilities. The functions become immediately more readable and their purpose is easier to grasp. Documenting the function is easier and therefore likely to be more accurate. With better documentation, it becomes easier for both you and other programmers to find and reuse the function. This chain of reasoning should make it clear that breaking code into functions with well-defined responsibilities encourages code reuse and avoid duplication. So, what are the arguments against it? The most common argument is performance, as most language implementations of functions have some overhead associated with each function call. However, in Chapter 1, “Premature Optimization,” the reasons that this should not be a concern during development are discussed in detail. It is preferable to write code that is easier to read, maintain, and reuse until optimizations are necessary.

The other common reason for not breaking code down to the proper number of functions is laziness. Programmers often do not want to spend the time creating new function definitions. This is a misplaced sense of laziness, however, since in reality this creates more work later. In addition, modern editors contain many tools to help ease the amount of work involved in writing code. What this really means is trading a little more time initially to save time later, both in maintenance time and by the fact that it will be easier to avoid writing duplicate code.

CD-ROM To make this more concrete, let us look at a sample function that is written poorly and then see how it should have been written. For this example, which can also be found on the companion CD-ROM in Source/Examples/Chapter2/function.cpp, we will need the following definition of an object that can be damaged and some properties for that object:

   /// Information on an object that can be damaged. /// Most objects with require two copies of this /// object to exist, one storing the base    /// values and one storing the current values.    struct DamagableObject    {       /// Current hitpoints.       unsigned int hitpoints;       /// Current attack power.       unsigned int attackPower;       /// Current defensive power.       unsigned int defensivePower;    };

For simplicity it is represented as a structure upon which our function will act, but later we will describe how proper function use is important in object-oriented programming as well. Next, we look at the function that performs damage allocation to an object:

   /**   Apply effects from damage to an object.     *   @param   io_object - current object state     *   @param   i_base - original object state     *   @param   i_damage - number of damage points  *            to apply     *   @return   true if damage destroys object,  *         false otherwise     *   @notes   Single monolithic version of function.     */    bool AllocateDamage_BeforeRefactoring( DamagableObject &io_object,       const DamagableObject &i_base, unsigned int i_damage)    {       // Check for destruction.       if(i_damage > io_object.hitpoints) {          io_object.hitpoints = 0;          io_object.attackPower = 0;          io_object.defensivePower = 0;          return(true);       }       // Adjust hitpoints.       io_object.hitpoints -= i_damage;       // Update stats.       io_object.attackPower =          (i_base.attackPower * io_object.hitpoints) / i_base.hitpoints;       io_object.defensivePower =          (i_base.defensivePower * io_object.hitpoints) / i_base.hitpoints;       // Apply damage visuals.       if(io_object.hitpoints < 1) {          cout << "Damage Effect: Explosion" << endl;       } else if(io_object.hitpoints < (i_base.hitpoints / 4)) {          cout << "Damage Effect: Heavy Smoke and Fire" << endl;       } else if(io_object.hitpoints < (i_base.hitpoints / 2)) {          cout << "Damage Effect: Heavy Smoke" << endl;       } else if(io_object.hitpoints < ((i_base.hitpoints * 3) / 4)) {          cout << "Damage Effect: Moderate Smoke" << endl;       } else if(io_object.hitpoints < i_base.hitpoints) {          cout << "Damage Effect: Light Smoke" << endl;       }       // Object not destroyed.       return(false); }

Notice the use of comments within the function. These are necessary to understanding the workings of the function, and even so, it requires a closer read to determine everything the function is doing. This is because the function has assumed several responsibilities by itself: adjusting hit points, adjusting attack and defensive power, applying visual effects, and determining if the object is destroyed. We can instead write each of these responsibilities as a separate function:

   /**   Modify the hitpoints of an object based on  *   damage points.     *   @param   io_object - current object state     *   @param   i_damage - number of damage points  *            to apply     */    void AdjustHitpoints(DamagableObject &io_object, unsigned int i_damage)    {       if(i_damage > io_object.hitpoints) {          io_object.hitpoints = 0;       } else {          io_object.hitpoints -= i_damage;       }    }    /**   Update the attack and defensive powers of  *   an object based on its current state.     *   @param   io_object - current object state     *   @param   i_base - original object state     */    void UpdatePower(DamagableObject &io_object, const DamagableObject &i_base)    {       io_object.attackPower =          (i_base.attackPower * io_object.hitpoints) / i_base.hitpoints;       io_object.defensivePower =          (i_base.defensivePower * io_object.hitpoints) / i_base.hitpoints;    }    /**   Update the visual appearance of the object  *   based on current hitpoint level relative to   *   original hitpoint level.     *   @param   i_object - current object state     *   @param   i_base - original object state     *   @notes   Prints damage effect to standard out  *         as a test.     */    void UpdateVisualDamage( const DamagableObject &i_object, const DamagableObject &i_base)    {       if(i_object.hitpoints < 1) {          cout << "Damage Effect: Explosion" << endl;       } else if(i_object.hitpoints < (i_base.hitpoints / 4)) {          cout << "Damage Effect: Heavy Smoke and Fire" << endl;       } else if(i_object.hitpoints < (i_base.hitpoints / 2)) {          cout << "Damage Effect: Heavy Smoke" << endl;       } else if(i_object.hitpoints < ((i_base.hitpoints * 3) / 4)) {          cout << "Damage Effect: Moderate Smoke" << endl;       } else if(i_object.hitpoints < i_base.hitpoints) {          cout << "Damage Effect: Light Smoke" << endl;       }    }    /** Check to see if an object is destroyed.     *   @param   i_object - current object state     *   @return   true if object is destroyed,  *         false otherwise     */    bool IsDestroyed(const DamagableObject &i_object)    {       return(i_object.hitpoints < 1);    }

Notice the improved clarity provided by the function comments that were not possible before. The damage allocation function did not and should not include this level of detail in its comments because its functionality could change. Now we can rewrite the original function in a much clearer form:

   /** Apply effects from damage to an object.     *   @param   io_object - current object state     *   @param   i_base - original object state     *   @param   i_damage - number of damage points  *            to apply     *   @return   true if object is destroyed,  *         false otherwise     *   @notes   Refactored function uses several  *         smaller functions to accomplish  *         its task with greater readability  *         and reusability.     */    bool AllocateDamage_AfterRefactoring( DamagableObject &io_object,       const DamagableObject &i_base, unsigned int i_damage)    {       AdjustHitpoints(io_object, i_damage);       UpdatePower(io_object, i_base);       UpdateVisualDamage(io_object, i_base);       return(IsDestroyed(io_object));    }

Most important, other functions now have access to the separate functions that were originally performed internal to the damage allocation function. In particular, the destruction test could be useful in several places. Functions like this are generally the lowest level to which functionality is broken down for reuse, but there is a need for a higher-level structure for reuse to avoid bogging down large projects in a myriad of functions. One such language feature is the object provided by object-oriented languages discussed next.

Objects

Functional programming languages are useful for particular types of projects, but many modern applications benefit from what is known as object-oriented programming languages. These languages offer support for enforcing encapsulation and abstraction of data to varying degrees. Used properly, this allows a more intuitive mapping from the problem domain to source code, thus facilitating better organization for reuse. As with functions, one of the benefits of reuse is less duplication of code.

However, because objects generally hide data and implementation details, there is a danger that this could hinder some forms of code reuse. There are several steps to preventing this problem, first of which is to document the internal or private implementation of an object in addition to the externals. This documentation should be kept separate from the documentation of the public implementation to prevent confusion. The public documentation is used under normal circumstances, but if the desired functionality cannot be found, then the private documentation can be searched as well. If a solution is found within the private implementation, refactoring is generally required to make that functionality properly accessible. Do not fall into the temptation of cut-and-paste, as it will lead to the many problems we discussed earlier. Additionally, do not attempt to hack access to the functionality. Both of these are shortsighted methods that will lead to trouble later.

CD-ROM To make this concept clearer, let us look at an example. The code for the classes presented in this example can also be found on the companion CD-ROM in Source/Examples/Chapter2/object.cpp. Here is the public part of a terrain class that uses a hermite spline algorithm internally to interpolate the terrain height:

   class t_Terrain    {    public:       /**   Construct a new terrain from a height map.        *   @param   i_width - number of elements in  *            the height map along the  *            x dimension        *   @param   i_length - number of elements in  *            the height map along the        *            z dimension        *   @param   i_heightMap - array of height  *            values of size i_width *        *            i_length where the y value  *            of the terrain at (x,z)        *            is located in the array at  *            [(z * i_width) + x]. The        *            array is copied and is not  *            required after construction        */       t_Terrain(unsigned int i_width, unsigned int i_length,          const double *i_heightMap);       /** Clean up terrain resources.        */       ~t_Terrain();       /** Get the height (y) value at (x,z).        *   @param   i_x - value in the direction  *            of the terrain width        *   @param   i_z - value in the direction  *            of the terrain length        *   @return   Terrain height.        *   @pre      0 <= i_x < i_width and  *         0 <= i_z < i_length where        *         i_width and i_length are the  *         values provided to the        *         terrain constructor.        */       double m_GetHeight(double i_x, double i_z) const;    };

From the public interface, it is not obvious and should not be obvious that the terrain class uses a hermite spline algorithm internally. The reason for this is straightforward: we do not want users of the class to rely on the implementation details. However, if we were looking for a hermite spline algorithm, it would be useful to see the internal implementation documented as well:

   class t_Terrain    {    public:       t_Terrain(unsigned int i_width, unsigned int i_length,          const double *i_heightMap) :       m_width(i_width),       m_length(i_length),       m_heightMap(new double[m_width * m_length])       {          std::copy(i_heightMap, i_heightMap + (m_width * m_length), m_heightMap); }       ~t_Terrain()       {          delete [] m_heightMap;       }       double m_GetHeight(double i_x, double i_z) const       {          int l_xIndex = static_cast<int>(i_x);          int l_zIndex = static_cast<int>(i_z);          double l_xDelta = i_x – static_cast<double>(l_xIndex);          double l_zDelta = i_z – static_cast<double>(l_zIndex);          return(m_HermiteSpline(             l_xDelta,             m_HermiteSpline(                l_zDelta,                m_GetHeight(l_xIndex - 1, l_zIndex - 1),                m_GetHeight(l_xIndex - 1, l_zIndex),                m_GetHeight(l_xIndex - 1, l_zIndex + 1),                m_GetHeight(l_xIndex - 1, l_zIndex + 2)          ),             m_HermiteSpline(                l_zDelta,                m_GetHeight(l_xIndex, l_zIndex - 1),                m_GetHeight(l_xIndex, l_zIndex),                m_GetHeight(l_xIndex, l_zIndex + 1),                m_GetHeight(l_xIndex, l_zIndex + 2)          ),             m_HermiteSpline(                l_zDelta,                m_GetHeight(l_xIndex + 1, l_zIndex - 1),                m_GetHeight(l_xIndex + 1, l_zIndex),                m_GetHeight(l_xIndex + 1, l_zIndex + 1),                m_GetHeight(l_xIndex + 1, l_zIndex + 2)          ),          m_HermiteSpline(                l_zDelta,                m_GetHeight(l_xIndex + 2, l_zIndex - 1),                m_GetHeight(l_xIndex + 2, l_zIndex),                m_GetHeight(l_xIndex + 2, l_zIndex + 1),             m_GetHeight(l_xIndex + 2, l_zIndex + 2)          )));    }    private:       /**   Interpolate value using hermite spline  *   algorithm and four values.        *   @param   i_t - interpolate (i_t, i_h)  *            given (-1, i_s0) ->        *            (0, i_h0) -> (1, i_h1) ->  *            (2, i_s1)        *   @param   i_s0 - value at -1        *   @param   i_h0 - value at 0        *   @param   i_h1 - value at 1        *   @param   i_s1 - value at 2        *   @return Value at i_t.        */       double m_HermiteSpline(double i_t,          double i_s0, double i_h0, double i_h1, double i_s1) const       {       double l_t2 = i_t * i_t;       double l_t3 = l_t2 * i_t;       double l_r0 = i_h0 - i_s0;       double l_r1 = i_s1 - i_h1;       return(          ((2*l_t3 - 3*l_t2 + 1)   * i_h0) +          ((-2*l_t3 + 3*l_t2)      * i_h1) +          ((l_t3 - 2*l_t2 + i_t)   * l_r0) +          ((l_t3 - l_t2)         * l_r1)          );       }       /**   Get the height (y) value at (x,z)  *   grid coordinate.        *   @param   i_x - value in the direction  *            of the terrain width        *   @param   i_z - value in the direction  *            of the terrain length        *   @return   Terrain height.        *   @pre      0 <= i_x < i_width and  *         0 <= i_z < i_length where        *         i_width and i_length are the  *         values provided to the        *         terrain constructor.        */       double m_GetHeight(int i_x, int i_z) const       {          return(m_heightMap[ (m_ClampIndex(i_z, k_LENGTH) * m_width) + m_ClampIndex(i_x, k_WIDTH)]);       }       /// Grid index clamp types.       enum t_ClampType { k_WIDTH, k_LENGTH };       /** Clamp an index based on width or length.        *   @param   i_index - index to clamp        *   @param   i_type - clamp to  *            width(k_WIDTH) or  *            length(k_LENGTH)        *   @return   Clamped index.        */       int m_ClampIndex(int i_index, t_ClampType i_type) const       {          switch(i_type) {             case k_WIDTH:                if(i_index < 0) {                   return(0);                } else if(i_index >= static_cast<int>(m_width)) {                   return(m_width - 1);                }                return(i_index);          case k_LENGTH:                if(i_index < 0) {                   return(0);                } else if(i_index >= static_cast<int>(m_length)) {                   return(m_length - 1);             }                return(i_index);          }       return(0);       }       /// Terrain dimension.       unsigned int m_width, m_length;       /// Internal copy of height map.       double *m_heightMap;    };

To understand this, imagine someone else wrote the terrain class and now you are writing an animation class:

   class t_AnimationChannel    {    public:       /**   Construct a new animation channel from  *   a set of keys.        *   @param   i_frames - Number of frames in  *            the animation at one        *            per second.        *   @param   i_keys - Animation value at  *            each frame.        */       t_AnimationChannel(unsigned int i_frames, const double *i_keys);       /** Clean up animation channel resources.        */       ~t_AnimationChannel();       /** Get the value at a particular time.        *   @param   i_time - Time in seconds from  *            animation start.        *   @return   Value at i_time.        *   @pre      0 <= i_time < i_frames  *         where i_frames is the value        *         provided to the constructor.        */       double m_GetValue(double i_time) const;    };

Further, you decide to use a hermite spline algorithm to implement it. Search the public interface documentation for your project, you find nothing. However, extending the search to private implementations, you come across the already existing piece of code. You can then extract it from the terrain class:

   /**   Interpolate value using hermite spline  *   algorithm and four values.     *   @param   i_t - interpolate (i_t, i_h)  *            given (-1, i_s0) ->     *            (0, i_h0) -> (1, i_h1) ->  *            (2, i_s1)     *   @param   i_s0 - value at -1     *   @param   i_h0 - value at 0     *   @param   i_h1 - value at 1     *   @param   i_s1 - value at 2     *   @return Value at i_t.     */    double g_HermiteSpline(double i_t,       double i_s0, double i_h0, double i_h1, double i_s1) {       double l_t2 = i_t * i_t;       double l_t3 = l_t2 * i_t;       double l_r0 = i_h0 - i_s0;       double l_r1 = i_s1 - i_h1;       return(          ((2*l_t3 - 3*l_t2 + 1)   * i_h0) +          ((-2*l_t3 + 3*l_t2)      * i_h1) +          ((l_t3 - 2*l_t2 + i_t)   * l_r0) +          ((l_t3 - l_t2)         * l_r1)          ); }

After this and rewriting the terrain class to use the new publicly available algorithm, you can write your class:

   class t_AnimationChannel    {    public:       t_AnimationChannel(unsigned int i_frames, const double *i_keys) :       m_frames(i_frames),       m_keys(new double[i_frames])       {          std::copy(i_keys, i_keys + m_frames, m_keys);       }       ~t_AnimationChannel()       {          delete [] m_keys;       }       double m_GetValue(double i_time) const       {          int l_timeIndex = static_cast<int>(i_time);          double l_timeDelta = i_time – static_cast<double>(l_timeIndex);          return(g_HermiteSpline(l_timeDelta,             m_ClampIndex(l_timeIndex - 1),             m_ClampIndex(l_timeIndex),             m_ClampIndex(l_timeIndex + 1),             m_ClampIndex(l_timeIndex + 2)));       }    private:       /** Clamp index to number of frames.        *        */       int m_ClampIndex(int i_index) const       {          if(i_index < 0) {             return(0);          } else if(i_index >= static_cast<int>(m_frames)) {             return(m_frames - 1);       }          return(i_index);       }       /// Number of frames.       unsigned int m_frames;       /// Values at each frame.       double *m_keys;    };

Another step to assisting in reuse is based on the previous section on functions. Objects generally act upon their data through some form of functions. This means that the lessons about proper granularity and level of responsibility apply equally as well to the function associated with objects. The concept of limiting the responsibilities of a function to make its role clear applies equally to objects themselves. Clear object roles lead to a more reusable set of objects.

Templates

So far, everything we have been talking about has been language independent, but now we will take a moment to look at a language feature particular to C++: templates. Do not take this to mean that other languages, such as Eiffel, do not support similar features, just that we will be focusing on the C++ implementation in this section. If you do not know C++, some explanation will be provided to help, but you can skip to the next section if you desire. With that aside, let us look at how templates can reduce the duplication of code in addition to providing protection from errors.

First, imagine that C++ did not have templates. If you wanted to be able to sum an array of integers, an array of doubles, and an array of strings, you would need to write the following three functions:

   int accumulate_int(const int *i_integers, unsigned int i_size)    {       int l_result = i_integers[0];       for(unsigned int l_index = 1; l_index < i_size; ++l_index) {          l_result += i_integers[l_index];       }       return(l_result);    }    double accumulate_double(const double *i_numbers, unsigned int i_size)    {       double l_result = i_numbers[0];       for(unsigned int l_index = 1; l_index < i_size; ++l_index) {          l_result += i_numbers[l_index];       }       return(l_result);    }    string accumulate_string(const string *i_strings, unsigned int i_size)    {       string l_result = i_strings[0];       for(unsigned int l_index = 1; l_index < i_size; ++l_index) {          l_result += i_strings[l_index];       }       return(l_result);    }

Additionally, if you later need to sum another type, then another function would be required. The major problem here is that each of these functions contains code with only minor differences. Each of these introduces potential for an error, and if the functions were even slightly more complex, an error would be extremely likely. Now here is the solution with templates:

   template <typename t_ValueType> t_ValueType       accumulate_generic(const t_ValueType *i_array, unsigned int i_size)    {       t_ValueType l_result = i_array[0];       for(unsigned int l_index = 1; l_index < i_size; ++l_index) {          l_result += i_array[l_index];       }       return(l_result);    }

This not only encapsulates the functionality of the three separate functions into one instance of the code, it also accounts for future functions that are similar. Templates provide a method to write generic functions within the strict type system of C++. This supports the concept of generic programming, which is primarily aimed at reducing the duplication of code for common algorithms while maintaining type safety, thus allowing errors to be caught at compile time. Java handles this differently by requiring that all objects be derived from a single base class, and performs type checking at run time. Many modern languages support this concept of generic programming, but tend to do it in their own unique way.

For those who know C or C++, you might be thinking of other ways that this could be solved without using templates and without duplicating code. There are several solutions fitting this criterion, but templates offer two critical advantages: safety and clarity. Common errors can be caught at compile time with templates that other solutions would only be able to handle at run time. Since templates are an integral part of the C++ language, they are supported by editors and debuggers, which allows clearer and easier to maintain code to be written. To make this clear, here is such a solution:

#define accumulate_macro(o_result, i_array, i_size) \    { \       o_result = i_array[0]; \       for(unsigned int l_index = 1; \ l_index < i_size; ++l_index) { \          o_result += i_array[l_index]; \       } \    }

This solution suffers from several problems. It cannot return a value, so its use must be different from the previous functions. It does not show up as a function even though a call to it will look like a function call. Moreover, perhaps one of the most serious problems, debuggers treat it as a single line of code that makes it impossible to place a breakpoint, or debug stopping point, in the middle of the function. None of these problems exist for the template solution.

The point of this is to emphasize that although there might be several solutions to a problem that avoids code duplication, it is important to consider other factors such as compile time versus run time error checking before making a final decision. Knowing the advanced language features and when to use them is important to reducing code duplication; just be careful to fully comment the result to aid programmers who have less experience with the feature.

Generic Programming

C++ templates are an example of a language feature that supports generic programming. The language Eiffel also provides similar support for generic programming. Each of these allows a type safe approach to development that still provides the necessary tools to reduce code duplication.

To understand the importance of generic programming, you must understand the importance of type safety and how type safety relates to the development process. Type safety allows errors involving the use of invalid types to be caught at compile time rather than run time, which is when languages lacking type safety would be capable of catching these errors. In fact, in some languages that lack type safety, these errors would not be caught at all.

In either scenario lacking type safety, the errors occur at run time and this places the burden of finding them on the testing phase of development. This can be a serious disadvantage because the errors might not be caught at all. The major difference is that the compiler catches all errors related to compiling the source code, whereas testing requires the programmer to generate the proper test case in order to discover an error. If the test case is missed, the error will also be missed.

You should therefore always take advantage of language features that allow generic programming. Additionally, you should consider this language feature when deciding which language to use for your project. Support for generic programming can greatly reduce the risk of uncaught errors when used properly.

Preprocessor

What do you do if you are already using a language that lacks the proper functionality to help you prevent certain forms of code duplication? Not all hope is lost. Since most language source code is written in plain text, it is generally easy to apply text processing to the source code before passing it to the language compiler or interpreter. In fact, C and C++ programmers generally assume the existence of a standard preprocessor for handling certain forms of duplication that the compiler does not.

Before you get too excited, and start writing preprocessors to handle everything, you must also consider the disadvantages of them. The primary shortcoming of preprocessors is their lack of integration with the language, including tools such as compilers and debuggers. Even the standard C preprocessor often creates problems with most debuggers by making it more difficult to resolve the location of errors. Non-standard preprocessors also make it more difficult to share code and require more training for new programmers. What this means is that you should prefer solutions inherent in the language to those that use a preprocessor. However, since the solution is not always possible using the language, it becomes necessary to use preprocessors to avoid worse problems such as code duplication.

Earlier in the template section, you saw an example of the C++ processor at work when we presented an alternative solution not using templates. In this case, it was better to use templates because they are more integrated into the language. However, in the Java language there is currently no support for generic programming; all type checking and conversion must be accomplished at run time. In most cases, this reasonable approach makes programming in Java relatively easy to understand. Nevertheless, situations do occur when compile time type checking is necessary, for either optimization or for critical systems that cannot afford to throw an exception. In this case, writing a custom preprocessor is the only approach until the Java language is extended to support true generic programming.

Practical Preprocessing

One project found a situation when the C++ language, even with templates, did not support the proper syntax to avoid considerable duplication. The basic system was to evaluate a collection of conditionals representing rules and execute the associated code if the conditional was true. The problem was that the number of rules was large, so a more optimal method than evaluating all conditionals on every pass was required. The solution to this was to provide object types for the conditionals that tracked their state and indicated when a change was made. By storing a list of conditionals an object was associated with, the conditionals could be reevaluated only if the object had changed. This provided a considerable performance improvement, but there was a problem.

The code for associating the conditionals with their corresponding objects was repetitive and syntactically cumbersome. It was easy to omit objects and make other minor errors. These errors were difficult to debug and often did not appear except under particular conditions.

To solve this problem, a preprocessor was written that would take a file containing a list of rules made up of conditionals matched with code blocks. This preprocessor would parse each conditional to determine what objects were contained within it. Then, a new file was automatically generated containing all the necessary code for associating the objects and conditionals, plus matching the code block with the conditional to be run when the conditional evaluated to true. With this system in place, the code could be written with greatly reduced risk of error. As an added bonus, tracing and debugging information could be added to the automatically generated code to ease the debugging process. Without automation, this code would not have been written because of the tedium involved. In this case, the benefits outweighed the disadvantages and the preprocessor saved many errors.

One major disadvantage that often arises when writing custom preprocessors is their interaction with the compiler and debugger. Because the syntax of the code written for the custom preprocessor is not understood by the compiler or debugger, these tools can only generate information that refers to the files generated by the preprocessor. This problem can be reduced by providing a utility to translate locations in the generated code back to the location in the original code provided to the preprocessor. This can then be run by hand to locate errors and warnings in the original code. The usefulness of this utility can be further increased if the compiler and debugger are extensible, allowing the utility to be directly integrated into these tools and removing any intermediate steps.

Aspect-Oriented Programming

A relatively new language feature, aspect-oriented programming, is emerging that can assist in reducing code duplication, particularly for debugging and profiling code. There are aspects of production code that will also benefit from the new language feature. You will likely start using it for debugging purposes, so you might wonder why worrying about code duplication in throwaway code is important. It is just as easy to create a bug in debugging code as it is in production code, but it can be even harder to track down that bug since you will be likely to look in the production code first. So, how can aspect-oriented programming reduce code duplication? Let us start with a little explanation of exactly what it is and then present an example.

Aspect-oriented programming encapsulates development concerns that cut across the normal structure of the programming language being used. In other words, aspects are meant to encapsulate common code that does not easily fit within the normal structure of the language. In the case of object-oriented languages, this means that the common code aspect must be inserted, or woven, into multiple places across several objects. These insertions are each similar in format, with only minor changes for each new location. Because of this similarity, it is common to use cut-and-paste techniques to perform these insertions that lead to code that is difficult if not impossible to maintain. Aspect-oriented programming, however, removes this code into a single location for clarity and maintainability, and the aspect code is automatically woven back into the object code at compile time (Figure 2.2A, B, and C).

click to expand
Figure 2.2: A) Code with tracing before using aspects, the programmer would have to edit each instance of the tracing code. B) Tracing code has been removed and collected into a single aspect; this is now what the programmer will view and edit. C) Tracing code is woven back into the object code; this would not normally be seen by the programmer.

This encapsulation therefore provides two major advantages: separation of the concern into an entity that can be maintained on its own, and the removal of code duplication caused by the concern’s extent. Thus, aspects provide a mechanism for reducing cut-and-paste that is not present in the target language.

Defining aspects is generally accomplished by providing a method of describing where and how in the normal program structure to insert the aspect code. By abstracting the nature of these descriptions, aspects can be reused across multiple projects. The abstraction also reduces the maintenance of the aspect, requiring no updates to the aspect when only minor changes are made to the code into which the aspect is woven.

For example, calls to trace programming execution are often spread across multiple functions and objects within an application. The goal of aspect-oriented software development is to allow the trace concern to be encapsulated in an aspect for easy updating and removal from the final production code. There is one minor disadvantage to this scheme in that it separates part of the functionality of a function or object from the unit itself. However, in the case of crosscutting concerns, the bond between the aspect and the object is generally not as strong as the bond with the rest of the aspect code.

CD-ROM Now, to make this more tangible, let us take a class that must have an initialization method called before any other methods. The code for this example can also be found on the companion CD-ROM in Source/JavaExamples/com/crm/ppt/examples/chapter2/WithoutAspectExObject.java, Source/JavaExamples/com/crm/ppt/examples/chapter2/AspectExampleObject.java, and Source/JavaExamples/com/crm/ppt/examples/chapter2/AspectExampleAspect.java. Although it is preferable to avoid this type of object, sometimes it is necessary. Here is a first attempt at this class:

/**  * Example object that requires initialization.  */ public final class AspectExampleObject {        /**   True when object is initialized  *   and hence usable.  */    boolean isInitialized;        /** Creates a new instance */ public AspectExampleObject() { }    /** Initialize object. */ void init() { isInitialized = true; }    /**   Do something that requires the object  *   be initialized.  */ void doSomething() { if(isInitialized) { System.out.println( "AspectExampleObject doing something..."); } else { System.out.println( "AspectExampleObject tried to do"); System.out.println( "something, but failed catastrophically"); System.out.println( "because it was not initialized."); } }    /**   More methods that require an  *   initialized object... */    /** Uninitialize object. */ void done() { isInitialized = false; } }

The major problem with this object is the possibility that its methods can be called before it is initialized. This could cause all kinds of unpredictable results and mayhem. Therefore, to protect against that, we can add a test at the beginning of each method that throws an exception if the object is not initialized:

/**  * Throw exception on uninitialized objects  * upon method call.  */ public final class AspectExampleObject {        /**   ... */    /**   Do something that requires the object  *   be initialized.  */ void doSomething() {       if(!isInitialized) {          throw new java.lang.IllegalStateException();       } if(isInitialized) { System.out.println( "AspectExampleObject doing something..."); } else { System.out.println( "AspectExampleObject tried to do"); System.out.println( "something, but failed catastrophically"); System.out.println( "because it was not initialized."); } }    /**   More methods that require an  *   initialized object... */    /** Uninitialize object. */ void done() { if(!isInitialized) {          throw new java.lang.IllegalStateException();       } isInitialized = false; } }

Notice that the code had to be duplicated in the majority of the object’s methods. This is the type of code duplication to avoid. What we really want is a way to place a single piece of code at the beginning of the functions without duplicating it. This can be accomplished succinctly with the following aspect:

/**  * Throw exception on uninitialized   * objects upon method call.  */ aspect AspectExampleAspect {        /**     *   All methods on AspectExampleObject  *   that cannot be called until     *   the object is initialized.     */ pointcut afterInitMethods(AspectExampleObject aeo):       target(aeo) &&       call(* AspectExampleObject.*(..)) &&       !call(void AspectExampleObject.init());        /**     *   Throw an exection at afterInitMethods  *   pointcuts if the     *   object is not initialized.     */    before(AspectExampleObject aeo) : afterInitMethods(aeo) { if(!aeo.isInitialized) {          throw new java.lang.IllegalStateException( "AspectExampleObject uninitialized [" + thisJoinPoint + "]");       }    }        }

Another advantage to this particular aspect is that it could be reused with other objects that follow the necessary conventions. This further reduces the unwanted duplication of code. Current support for aspect-oriented programming is limited and, even for the more complete systems like AspectJ it is still in the preprocessor stage. As this and other advanced language features become integrated into the core languages, it would be wise to take advantage of the new capabilities, such as reduction of code duplication, which they provide.

Automation

Sometimes it is necessary for data to exist in more than one location at one time. It is essential to remove human participation from this process. To accomplish this, you can use one of the many tools available for this purpose. One such tool is the code preprocessor that we mentioned earlier. However, code is not the only part of development that is prone to cut-and-paste problems.

Almost all modern applications have a number of files, or assets, that are not created from compiling code. As the number of assets grows larger, it becomes more difficult to manage. If information requires updating in more than one location, it is almost guaranteed that errors will occur. Let us examine some general tools that can be used to automatically duplicate information when necessary and discuss how programmers should use them.

Dangers of Manual Duplication

One project discovered firsthand the dangers of requiring information to be updated in multiple locations. The user interface for the application under construction used a large number of images that were to be placed in different locations. Unfortunately, the user interface configuration files required that the size of a control be provided. This meant that every time an image was resized, the corresponding control configuration also required an update. This led to many problems, one particularly problematic because it initially looked like the art had not been updated. After recreating the art three times, the real error was finally found, but not after wasting considerable time for several team members.

The most flexible and therefore most demanding tool at a programmer’s disposal is the variety of scripting languages created for the purpose of file and text manipulation. These allow for quick implementation of tools that can handle a wide variety of tasks associated with copying and modifying assets. The primary disadvantage to using these tools comes when other programmers need to modify the tools. Depending on the scripting language chosen, other programmers might not understand the language. Even if they do understand the language, some scripting languages such as Perl have such limited structure that it can still be hard to understand what the script is doing. Perl can be excellent for quick one-run scripts that are then deleted, but sticking to a standard language with a reasonable amount of structure, such as Python or Ruby, is the best bet for any script that will be around for more than an hour. See Table 2.1 for a comparison of Perl, Python and Ruby. Another disadvantage of scripting languages is that they are generally interpreted, which requires that the interpreter be installed on the machine on which they are run. This second disadvantage is shared with most other tool solutions, however, so it should not be a major concern. If possible, keep a distributable copy of the files necessary for installation with the scripts for easy distribution.

Table 2.1: Comparison of Three Major Scripting Languages
	Perl	Python	Ruby
Code Readability	Poor	Good	Good
Community Support	Excellent	Good	Good
Object Oriented/Procedural	Both	Both	OO
Inheritance	Multiple	Multiple	Single/Multiple Mix-ins
Garbage Collection	Mark/Sweep	Reference Count	Reference Count
Regular Expressions	Integral	Library	Integral

Another obvious choice is to write your own automation tools in the same language as the application. This also has its advantages and disadvantages. The primary advantage is that the tools and understanding will already be there for members of the same team. Nevertheless, many of the higher-level languages used for application development do not have full-featured text and file manipulation utilities that come with the standard language. This can be alleviated to a degree by using third-party libraries to fill in this functionality, but you are then back to the necessity of ensuring that these are installed and understood by any programmer who must maintain the tools. An advantage over interpreted scripting tools is retained, however, because the tool users are not required to install anything additional. It can be argued that it takes less time to develop the tools in the interpreted scripting languages, but this is really only true if the programmer has a good grasp of the scripting language. In the end, the decision between scripted or compiled tools has to be made on a team-by-team and even tool-by-tool basis.

The final step in proper automation is to collect all the various tools, scripts, and batch files, and integrate them properly into the workflow of the programmer. A minimal and complete set of common operations that can be performed with one mouse click or a single command should be created. This usually requires a master script or batch file for each operation that sets off all required tools and scripts. If there are dependencies inherent in the process, it might be better to use common tools such as Make or Ant. These allow operations to be easily written to depend on other operations. Finally, it is beneficial to make these available from a single graphical user interface (GUI) or the IDE (integrated development environment) used by the programmer or programmers.

Avoiding Asset Duplication

While circumstances can dictate the use of automation tools, a better approach is to avoid the need for them altogether. If you have enough control over how the application reads the assets, it should be possible to avoid the need to read the same values from multiple locations.

This is best introduced by giving a simple example. Here is some pseudo-code that illustrates what could happen:

   Laser    {       integer range;       initialize()       {          // ...          read range;          // ...       }       fire()       {          // ...          use range;          // ...       }    }    AI    {       integer range;       initialize()       {          // ...          read weapon range;          // ...       }       check_target()       {          if target in range             fire weapon;       }    }

This requires the range value of the weapon to be available from two different locations; this puts undue responsibility on asset creation to synchronize these values. We can refactor this solution into the following pseudo-code:

   Weapon    {       integer range;       initialize()       {          //...          read range;          //...       }       fire();    }    Laser derived from Weapon    {       fire()       {          //...          use range;          //...       }    }    AI    {       check_target()       {          if target is in Weapon.range             fire Weapon;       }    }

Now there is only one location where the range value is read, thus avoiding duplication of the value in the assets. This is as important or more important than avoiding code duplication, because finding duplication errors in assets is often a more difficult task than finding code duplication. This becomes more likely in a large system; therefore, it is a good idea for at least one person to understand the asset layout so that he can track down these duplications.

Generative Programming

Up until now, we have talked about practical methods to help prevent cut-and-paste that are available today. Even aspect-oriented programming, while innovative technology, is available for use as a solid working language feature for the Java language. However, other technologies are on the horizon that will be important to reducing code duplication and, even more important, information duplication in general.

To understand the importance of these technologies, first you must understand more about information duplication versus code duplication. Understanding that cut-and-pasting of code is error prone and detrimental to software development is only part of the struggle, because in general the code does not contain all the information that is necessary to develop a software product. In addition, there is information such as customer requirements, domain knowledge, and higher-level design and architecture. Much of the code written represents a translation of this information into a form the computer can read and execute. While it cannot be denied that some of this translation requires the creative skills of a human programmer, there are still many automated translations not being performed that would show significant gains in development time without corresponding loss in performance or completeness.

The topic of generative programming is much more complex and involved that a few pages could do justice, so here we will only look at how this future technology can reduce information duplication on a larger scale than current programming technologies. More information can be found in [Czarnecki00] and new research papers that are appearing continuously.

While automated programming has been a dream since near the beginning of programming itself, generative programming is an attempt to take a more practical approach to the idea. Rather than attempt to fully automate the process of software development, generative programming aims for the more realistic goal of automating only the processes that make sense to automate with the current technology level.

The primary idea is to take a family of related systems in a specific domain and automate the creation of applications from a high-level customer specification of the desired application features. Every piece of code is created, integrated, and built automatically except for custom requests that are not available as part of the current generators. These custom components could even be integrated into the generation process for future use, improving the automation process even further. This frees the developers to work on only those parts of the application that cannot be automated, greatly increasing development time.

There is, however, an obvious initial cost to building the necessary libraries and generators that can compose these applications. Here some of the language features and technologies we discussed earlier come into play, along with a few other experimental technologies. To facilitate the generation of code to support a range of features, the concept of active libraries [Czarnecki00] is necessary. An active library acts as a compile-time code generator parameterized on the desired features for the specific instantiation of the library. This provides the necessary flexibility for generating a range of applications without the large run time overhead that standard libraries impose. Language features such as templates and aspects that support techniques such as generic programming and aspect oriented programming are essential to creating these active libraries.

While many of the necessary language features already exist, their use is hindered by the difficult syntax and layout capabilities of many current editors. One possible resolution to this problem is the current effort by IntentSoft to develop the Intentional Programming environment. Starting as a research project at Microsoft, the main concept of Intentional Programming is to represent the software as a construct that preserves the intentions of the programmer rather than forcing the programmer to translate those intentions to text. This is accomplished through an extensible environment of editors, compilers, debuggers, and other programming tools. This allows the concepts necessary for developing active libraries to be generated, edited, and viewed in a more intuitive fashion that greatly assists the development process. Additionally, Intentional Programming supports another goal of generative programming by allowing the source to be represent and stored using domain specific terms and concepts. This allows the library developer to provide a structure that is directly related to the parameterization of the library based on domain-specific knowledge (Figure 2.3). This technology, and ones similar to it, should be kept in mind for adoption on future projects, particularly if reuse is important.

click to expand
Figure 2.3: The editor, compiler, and debugger work together with one or more domain specific plug-ins to allow the programmer to view and interact with the source in a domain-specific manner.

So far, we have been talking about what generative programming is, rather than how it reduces information duplication. Obviously, since it relies on several of the language features that reduce code duplication, it is meant to reduce code duplication. With the introduction of Intentional Programming, however, we see the first glimpse of its uses in reducing information duplication. By allowing the programming environment to be extended by domain-specific concepts, it can also be extended to support design knowledge as well. Thus, Intentional Programming provides the necessary technology to represent the domain knowledge and design information directly in the source of the application or library.

Generative programming aims to go one step further than this, taking customer specifications at a high-level, translating those into the correct parameters with which to instantiate the necessary active libraries, and then hooking everything together to form a complete application. Small slots are left for developers to implement code that is not available in the active libraries, and therefore must be custom coded based on the customer’s individual request. Thus, code duplication is reduced through the use of active libraries and their associated language features and programming techniques. In addition, information duplication is reduced by incorporating domain and design knowledge into the source through systems such as Intentional Programming. Finally, information duplication is further reduced by creating a generative system that requests only a high-level feature description and a small amount of custom code. Human-editable information is reduced to the minimum necessary, reducing the possibility of human error and therefore project risk (Figure 2.4).

click to expand
Figure 2.4: Generative programming only requires human interaction during the customer’s feature specification and the bits of custom code that are not available in the active libraries.

Note that generative programming as a whole is not for every project, but if you are creating applications for multiple customers that are similar in nature but with several variation points, then generative programming should be of great interest to you. Even if your applications tend to differ greatly, you can still benefit from the active library concepts that are a necessary foundation for generative programming. Active libraries are particularly important to overcoming the performance concerns that plague NIH Syndrome, another of the major illnesses.

One caveat does arise when adopting the different technologies of generative programming such as active libraries. Many of these technologies require more processing power and can therefore increase compile times. This results from both their inherent complexity and the newness of the technology. As time passes, the speed of editors, compilers, and debuggers will increase and remove part of this overhead. The rest must be removed by increasing the power of the hardware used for development. Keep this in mind when deciding how to use these new technologies, as lengthening the compile times of code that must be compiled for the many changes and programmers on a project can add substantial time to development.

< Day Day Up >