Recipe11.1.Creating a Hash Code for a Data Type

Recipe 11.1. Creating a Hash Code for a Data Type

Problem

You have created a class or structure that will be used as a key in a Hashtable or Dictionary<T,U>. You need to overload the GetHashCode method in order to return a good distribution of hash values (the Discussion section defines a good distribution of hash values). You also need to choose the best hash-code algorithm to use in the GetHashCode method of your object.

Solution

The following procedures implement hash-code algorithms and can be used to override the GetHashCode method. Included in the discussion of each method are the pros and cons of using it, as well as why you would want to use one instead of another.

In addition, it is desirable, for performance reasons, to use the return value of the GetHashCode method to determine whether the data contained within two objects is equal. Calling GetHashCode to return a hash value of two objects and comparing their hash values can be faster than calling the default implementation of Equals on the Object type, which individually tests the equality of all pertinent data within two objects. In fact, some developers even opt to compare hash-code values returned from GetHashCode within their overloaded Equals method. Using a custom implementation of the Equals method in this fashion is faster than the default implementation of the Object.Equals method.

The simple hash

This hash accepts a variable number of integer values and XORs each value to obtain a hash code. This simple algorithm has a good chance of producing an adequate distribution and good performance. Remember to profile and measure it to confirm that it works as well for your particular data set. It fails when you need to integrate values greater in size than an integer. Its code is:

 public int SimpleHash(params int[] values) {     int hashCode = 0;     if (values != null)     {         foreach (int val in values)         {             hashCode ^= val;         }     }     return (hashCode); }

The folding hash

This hash allows you to integrate the long data type into a hash algorithm. It takes the upper 32 bits of the long value and folds them over the lower 32 bits of this value. The actual process of folding the two values is implemented by XORing them and using the result. Once again, this is a good performing algorithm with good distribution properties, but, again, it fails when you need to go beyond the long data type. A sample implementation is:

 public int FoldingHash(params long[] values) {     int hashCode = 0;     if (values != null)     {         int tempLowerVal = 0;         int tempUpperVal = 0;         foreach (long val in values)         {             tempLowerVal = (int)(val & 0x000000007FFFFFFF);             tempUpperVal = (int)((val >> 32) & 0xFFFFFFFF);             hashCode^= tempLowerVal ^ tempUpperVal;         }     }     return (hashCode); }

The contained object cache

This hash obtains the hash codes from a variable number of object types. The only types that should be passed in to this method are reference-type fields contained within your object. This method XORs all the values returned by the GetHashCode method of each object. Its source code is:

 public int ContainedObjHash(params object[] values) {     int hashCode = 0;     if (values != null)     {         foreach (object val in values)         {             hashCode ^= val.GetHashCode( );         }     }     return (hashCode); }

The CryptoHash method

Potentially the best method of obtaining a hash value for an object is to use the hashing classes built in to the FCL. The CryptoHash method returns a hash value for some input using the MACTripleDES class. This method returns a very good distribution for the hash value, although you may pay for it in performance. If you do not require a near-perfect hash value and are looking for an excellent distribution, consider using this approach to calculate a hash value:

 private readonly byte[] Key = new byte[16] {1,122,3,11,65,7,9,45,42,98,                                             77,34,99,45,167,211}; public int CryptoHash(string strValue) {     int hashCode = 0;     if (strValue != null)     {         byte[] encodedUnHashedString =                 Encoding.Unicode.GetBytes(strValue);         // Replace the following key with your own         // key value.         MACTripleDES hashingObj = new MACTripleDES(Key);         byte[] code =                 hashingObj.ComputeHash(encodedUnHashedString);         // Use the BitConverter class to take the         // first 4 bytes, fold them over the last 4 bytes         // and use them as an int for the hash code.         int hashCodeStart = BitConverter.ToInt32(code, 0);         int hashCodeEnd = BitConverter.ToInt32(code, 4);         hashCode = hashCodeStart ^ hashCodeEnd;     }     return (hashCode); }

The CryptoHash method using a nonstring

This method shows how other, nonstring data types can be used with the built-in hashing classes to obtain a hash code. This method converts a numeric value to a string and then to a byte array. The array is then used to create the hash value using the SHA256Managed class. Finally, the first four values in the byte array are concatenated together to obtain a hash code. The code is:

 private readonly byte[] Key = new byte[16] {1,122,3,11,65,7,9,45,42,98,                                             77,34,99,45,167,211}; public int CryptoHash(long longValue) {     int hashCode = 0;     byte[] encodedUnHashedString =                 Encoding.Unicode.GetBytes(longValue.ToString( ));     MACTripleDES hashingObj = new MACTripleDES(Key);     byte[] code = hashingObj.ComputeHash(encodedUnHashedString);     // Use the BitConverter class to take the     // first 4 bytes, fold them over the last 4 bytes     // and use them as an int for the hash code.     int hashCodeStart = BitConverter.ToInt32(code, 0);     int hashCodeEnd = BitConverter.ToInt32(code, 4);     hashCode = hashCodeStart ^ hashCodeEnd;     return (hashCode); }

The shift and add hash

This method uses each character in the input string, strValue, to determine a hash value. This algorithm produces a good distribution of hash codes even when it is fed similar strings. However, it will break down when long strings that end with the same characters are passed. While this may not happen many times with your applications, it is something to be aware of. If performance is critical, this is an excellent method to use. Its code is:

 public int ShiftAndAddHash (string strValue) {     int hashCode = 0;     long workHashCode = 0;     if (strValue != null)     {         for (int counter=0; counter<strValue.Length; counter++)         {             workHashCode = (workHashCode << (counter % 4)) +                        (int)strValue[counter];         }         workHashCode = workHashCode % (127);     }     hashCode = (int)workHashCode;     return (hashCode); }

The calculated hash

This method is a rather widely accepted method of creating a good hash value that accepts several different data types and uses a different algorithm to compute the hash value for each. It calculates the hash code as follows:

It assigns an arbitrary odd primary number to the HashCode variable. This variable will eventually contain the final hash code. Good primary numbers to use are 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, or 67. Obviously, others exist beyond this set, but this should give you a good starting point.
For numeric types equal to or less than the size of an int and char data types, it multiplies the current HashCode by the primary number selected and then adds to this value the value of the numeric type cast to an integer.
For numeric types greater than the size of an int, it multiplies the current HashCode by the primary number selected and then adds to this the folded version of this numeric value. (For more information on folding, see "The folding hash" method earlier in this recipe.)
For char, floating-point, or decimal data types, it multiplies the current HashCode by the primary number selected, casts the numeric value to an integer, and then uses the folding method to calculate its value.
For bool data types, it multiplies the current HashCode by the primary number selected and then adds a 1 for true and 0 for false (you can reverse this behavior if you wish).
For object data types, it multiplies the current HashCode by the primary number selected and then adds the return value of GetHashCode called on this object. If an object is set to null, use the value 0 in your calculations.
For an array or collection, it determines the contained type(s) and uses each element of the array or collection to calculate the hash value, as follows (in the case of an integer array named MyArray):

 foreach (int element in myArray) { hashCode = (hashCode * 31) + element; }

This algorithm will produce a good distributed hash code for your object and has the added benefit of being able to employ any data type. This is a high-performing algorithm for simple, moderately complex, and even many complex objects. However, for extremely complex objectsones that contain many large arrays, large Hashtables, or other objects that use a slower hash-code algorithmthis algorithm will start performing badly. In this extreme case, you may want to consider switching to another hash-code algorithm to speed performance or simply paring down the amount of fields used in the calculation. Be careful if you choose this second method to increase performance; you could inadvertently cause the algorithm to produce similar values for differing objects. The code for the calculated hash method is:

 public int CalcHash(short someShort, int someInt, long someLong,                     float someFloat, object someObject) {     int hashCode = 7;     hashCode = hashCode * 31 + (int)someShort;     hashCode = hashCode * 31 + someInt;     hashCode = hashCode * 31 +                         (int)(someLong ^ (someLong >> 32));     long someFloatToLong = (long)someFloat;     hashCode = hashCode * 31 +              (int)(someFloatToLong ^ (someFloatToLong >> 32));     if (someObject != null)     {         hashCode = hashCode * 31 +                             someObject.GetHashCode( );     }     return (hashCode); }

The string-concatenation hash

This technique converts its input into a string and then uses that string's GetHashCode method to automatically generate a hash code for an object. It accepts an integer array, but you can substitute any type that can be converted into a string. You can also use several different types of arguments as input to this method. This method iterates through each integer in the array passed as an argument to the method. The ToString method is called on each value to return a string. The ToString method of an int data type returns the value contained in that int. Each string value is appended to the string variable HashString. Finally, the GetHashCode method is called on the HashString variable to return a suitable hash code.

This method is simple and efficient, but it does not work well with objects that have not overridden the ToString method to return something other than their data type. It may be best to simply call the GetHashCode method on each of these objects individually. You should use your own judgment and the rules found in this recipe to make your decision.

 public int ConcatStringGetHashCode(int[] someIntArray) {     int hashCode = 0;     StringBuilder hashString = new StringBuilder( );     if (someIntArray != null)     {         foreach (int i in someIntArray)         {             hashString.Append(i.ToString( ) + "^");         }     }     hashCode = hashString.GetHashCode( );     return (hashCode); }

The following using directives must be added to any file containing this code:

 using System; using System.Text; using System.Security.Cryptography;

Discussion

The GetHashCode method is called when you are using an instance of this class as the key in a Hashtable or Dictionary<T,U> object. Whenever your object is added to a Hashtable or Dictionary<T,U> as a key, its GetHashCode method is. A hash code is also obtained from your object when a search is performed for it in the Hashtable or Dictionary<T,U>.

The following class implements the SimpleHash algorithm for the overloaded GetHashCode method:

 public class SimpleClass {     private int x = 0;     private int y = 0;     public override int GetHashCode( )     {         return(SimpleHash(x, y));     }     private static int SimpleHash(params int[] values)     {         int hashCode = 0;         if (values != null)         {             foreach (int val in values)             {                 hashCode ^= val;             }         }         return (hashCode);     } }

This class can then be used as a key in a Hashtable or Dictionary<T,U> in code like the following:

 SimpleClass simpleClass = new SimpleClass( ); Hashtable hashTable = new Hashtable( ); hashTable.Add(simpleClass, 100); Dictionary<SimpleClass, int> dict = new Dictionary<SimpleClass, int>(); dict.Add(simpleClass, 100);

There are several rules for writing a good GetHashCode method and a good hash-code algorithm:

This method should return the same value for two different objects that have value equality. Value equality means that two objects contain the same data.
The hash algorithm should return a good distribution of values for the best performance in a Hashtable or Dictionary<T,U>. A good distribution of values means that the hash values returned by the GetHashCode method are usually different for objects of the same type, unless those objects have value equality. Note that objects containing very similar data should also return a unique hash value. This distribution allows the Hashtable or Dictionary<T,U> to work more efficiently and thus perform better.
This method should not throw an exception.
Both the Equals method and GetHashCode method should be overridden together.
The GetHashCode method should compute the hash code using the exact set of variables that the overridden Equals method uses when calculating equality.
The hash algorithm should be as fast as possible to speed up the process of adding and searching for keys in a Hashtable or Dictionary<T,U>.
Use the GetHashCode values of any contained objects when calculating the hash code of the parent object.
Use the GetHashCode values of all elements of an array when calculating the array's hash code.

The System.Int32, System.UInt32, and System.IntPtr data types in the FCL use an additional hash-code algorithm not covered in the Solution section. Basically, these data types return the value that they contain as a hash code. Most likely, your objects will not be so simple as to contain a single numeric value, but if they are, this method works extremely well.

You may also want to combine specific algorithms to suit your purposes. For instance, if your object contains one or more string types and one or more long data types, you can combine the ContainedObjHash method and the FoldingHash method to create a hash value for your object. The return values from each method can either be added or XORed together.

Once an object is in use as a key in a Hashtable or Dictionary<T,U>, it should never return a different value for the hash code. Originally, it was documented that hash codes must be immutable, as the authors of Hashtable or Dictionary<T,U> thought that this should be dealt with by whoever writes GetHashCode. It doesn't take much thought to realize that for mutable types, if you require both that the hash code never changes and that Equals represents the equality of the mutable objects and that if a.Equals(b), then a.GetHashCode( ) == b.GetHashCode( ), then the only possible implementation of GetHashCode is one that returns the same integer constant for all values.

The GetHashCode method is called when you are using this object as the key in a Hashtable or Dictionary<T,U> object. Whenever your object is added to a Hashtable or Dictionary<T,U> as a key, the GetHashCode method is called on your object to obtain a hash code. This hash code must not change while your object is a key in the Hashtable or Dictionary<T,U>. If it does, the Hashtable or Dictionary<T,U> will not be able to find your object.