Section 11.4. Equivalence


11.4. Equivalence

My name is Gerald Knight. Gerald D Knight is also my name. They are not equal, but since they both mean me, they are equivalent. This happens with the names of people and businesses, with addresses, and with common words like state names or days of the week. When data is entered by hand or comes from more than one computer system, equivalence problems can make it impossible to use.

There is a solution. When you use a spellchecking program, one of the features is word suggestion . If the program does not recognize the word you typed, it suggests similar words. So the spellcheck program knows how to measure the similarity between two strings of characters. It measures equivalence.

We define equivalence as the percentage of characters from one string that occur in another string in the same order. And we can build a custom Function that does the same thing in Excel.

The details of the algorithm are beyond the scope of this book. Basically, it builds a matrix with one string across the top and the other down the left side. The code counts places in the matrix where the letters for both words are the same. It works its way through the strings backwards, and keeps up with the number of matches as it goes. Here is the code:

 Public Function Str_Comp(st1 As String, st2 As String) As Double Dim MtchTbl(100, 100) Dim MyMax, ThisMax As Double Dim i, j, ii, jj As Integer ' Remove leading and trailing spaces and ' set to proper case st1 = Trim(Application.WorksheetFunction.Proper(st1)) st2 = Trim(Application.WorksheetFunction.Proper(st2)) ' mymax will be the number of letters in st1 that ' occur in st2 in the same order MyMax = 0 For i = Len(st1) To 1 Step -1     For j = Len(st2) To 1 Step -1         If Mid(st1, i, 1) = Mid(st2, j, 1) Then             ThisMax = 0             For ii = i + 1 To Len(st1)                 For jj = j + 1 To Len(st2)                     If MtchTbl(ii, jj) > ThisMax Then                         ThisMax = MtchTbl(ii, jj)                     End If                 Next jj             Next ii             MtchTbl(i, j) = ThisMax + 1             If ThisMax + 1 > MyMax Then                 MyMax = ThisMax + 1             End If         End If     Next j Next i ' divide mymax by the length of st1 ' to get the percentage match Str_Comp = MyMax / Len(st1) End Function 

This is a function so it can be used on the spreadsheet. The arguments are both strings. The function returns the percentage of characters in the first string that occur in the second string in the same order.

The formula =Str_Comp("afce","abcdefghi") returns a value of 0.75. There are four characters in "afce" and three of them (a, c, and e) are in the second string in the same order. The letter f is in both strings but not in the same order. The Str_Comp function processes string in proper case. So it considers the first character of a word to be uppercase and the others to be lowercase.

Figure 11-7 shows how this function is used. The entry in A1 has to be a month, but the data was entered by hand so we cannot be sure what will actually be in the cell. In the range E1:E12 are the correct names of the months. The entry in A1 has to be equivalent to one of these. The formula in F1 is =Str_Comp(A$1,E1). This is filled down to F12. The values in F1:F12 are the equivalence scores between A1 and each of the months.

Figure 11-7. Working with the Str_Comp function


In this example the match with November is 100%. This is because all of the characters in A1 occur in November in the same order. The formula in cell B4 is =MAX(F1:F12). This is the value of the best match we got. Cell B5 is =MATCH(B4,F1:F12,0). This tells us which item in the range F1:F12 contains the maximum value. In this case it is 11 because the maximum value (1) is in row 11. Cell B6 checks to be sure the maximum value only occurs once in F1:F12. If the maximum value is in the range more than once, we cannot decide which month is indicated since we have a tie. The formula is =COUNTIF(F1:F12,"=" & B4). We only have a good answer if this value is 1.

Cell B1 has the final answer. It contains =IF(B6=1,INDEX(E1:E12,B5),"Unknown"). If the value in B6 is 1, this formula retrieves the name of the month from the range E1:E12. If B6 is not 1, Unknown is displayed.

This technique can be used to clean up manually keyed data or data coming from an OCR operation. It can help when sets of data from different systems are to be merged, such as two customer files. Variations in addresses and names can be resolved.

Equivalence problems can involve large amounts of data. Merging data from different systems can result in thousands of possible matches. The Str_Comp function can be used inside other macros to read and match large files.



Analyzing Business Data with Excel
Analyzing Business Data with Excel
ISBN: 0596100736
EAN: 2147483647
Year: 2006
Pages: 101
Authors: Gerald Knight

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net