COMPGED Function


COMPGED Function

Compares two strings by computing the generalized edit distance

Category: Character

Syntax

COMPGED ( string-1 , string-2 <, cutoff ><, modifiers >)

Arguments

string-1

  • specifies a character constant, variable, or expression.

string-2

  • specifies a character constant, variable, or expression.

cutoff

  • is a numeric constant, variable, or expression. If the acutal generalized edit distance is greater than the value of cutoff , the value that is returned is equal to the value of cutoff .

  • Tip: Using a small value of cutoff improves the efficiency of COMPGED if the values of string-1 and string-2 are long.

modifiers

  • specifies a character string that can modify the action of the COMPGED function. You can use one or more of the following characters as a valid modifier:

    i or I

    ignores the case in string-1 and string-2 .

    l or L

    removes leading blanks in string-1 and string-2 before comparing the values.

    n or N

    removes quotation marks from any argument that is an n-literal and ignores the case of string-1 and string-2 .

    : ( colon )

    truncates the longer of string-1 or string-2 to the length of the shorter string, or to one, whichever is greater.

  • TIP: COMPGED ignores blanks that are used as modifiers.

Details

The Order in Which Modifiers Appear The order in which the modifiers appear in the COMPGED function is relevant.

  • 'LN' first removes leading blanks from each string and then removes quotation marks from n-literals.

  • 'NL' first removes quotation marks from n-literals and then removes leading blanks from each string.

Definition of Generalized Edit Distance Generalized edit distance is a generalization of Levenshtein edit distance, which is a measure of dissimilarity between two strings. The Levenshtein edit distance is the number of deletions, insertions, or replacements of single characters that are required to transform string-1 into string-2 .

Computing the Generalized Edit Distance The COMPGED function returns the generalized edit distance between string-1 and string-2 . The generalized edit distance is the minimum-cost sequence of operations for constructing string-1 from string-2 .

The algorithm for computing the sum of the costs involves a pointer that points to a character in string-2 (the input string). An output string is constructped by a sequence of operations that might advance the pointer, add one or more characters to the output string, or both. Initially, the pointer points to the first character in the input string, and the output string is empty.

The operations and their costs are described in the following table.

Operation

Default Cost in Units

Description of Operation

APPEND

50

When the output string is longer than the input string, add anyone character to the end of the output string without moving the pointer.

BLANK

10

Do any of the following:

  • Add one space character to the end of the output string without moving the pointer.

  • When the character at the pointer is a space character, advance the pointer by one position without changing the output string.

  • When the character at the pointer is a space character, add one space character to the end of the output string, and advance the pointer by one position.

If the cost for BLANK is set to zero by the COMPCOST function, the COMPGED function removes all space characters from both strings before doing the comparison.

DELETE

100

Advance the pointer by one position without changing the output string.

DOUBLE

20

Add the character at the pointer to the end of the output string without moving the pointer.

FDELETE

200

When the output string is empty, advance the pointer by one position without changing the output string.

FINSERT

200

When the pointer is in position one, add anyone character to the end of the output string without moving the pointer.

FREPLACE

200

When the pointer is in position one and the output string is empty, add anyone character to the end of the output string, and advance the pointer by one position.

INSERT

100

Add anyone character to the end of the output string without moving the pointer.

MATCH

Copy the character at the pointer from the input string to the end of the output string, and advance the pointer by one position.

PUNCTUATION

30

Do any of the following:

  • Add one punctuation character to the end of the output string without moving the pointer.

  • When the character at the pointer is a punctuation character, advance the pointer by one position without changing the output string.

  • When the character at the pointer is a punctuation character, add one punctuation character to the end of the output string, and advance the pointer by one position.

If the cost for PUNCTUATION is set to zero by the COMPCOST function, the COMPGED function removes all punctuation characters from both strings before doing the comparison.

REPLACE

100

Add any one character to the end of the output string, and advance the pointer by one position.

SINGLE

20

When the character at the pointer is the same as the character that follows in the input string, advance the pointer by one position without changing the output string.

SWAP

20

Copy the character that follows the pointer from the input string to the output string. Then copy the character at the pointer from the input string to the output string. Advance the pointer two positions .

TRUNCATE

10

When the output string is shorter than the input string, advance the pointer by one position without changing the output string.

To set the cost of the string operations, you can use the CALL COMPCOST routine or use default costs. If you use the default costs, the values that are returned by COMPGED are approximately 100 times greater than the values that are returned by COMPLEV.

Examples of Errors The rationale for determining the generalized edit distance is based on the number and kinds of typographical errors that can occur. COMPGED assigns a cost to each error and determines the minimum sum of these costs that could be incurred. Some kinds of errors can be more serious than others. For example, inserting an extra letter at the beginning of a string might be more serious than omitting a letter from the end of a string. For another example, if you type a word or phrase that exists in string-2 and introduce a typographical error, you might produce string-1 instead of string-2 .

Making the Generalized Edit Distance Symmetric Generalized edit distance is not necessarily symmetric. That is, the value that is returned by COMPGED(string1, string2) is not always equal to the value that is returned by COMPGED(string2, string1) . To make the generalized edit distance symmetric, use the CALL COMPCOST routine to assign equal costs to the operations within each of the following pairs:

  • INSERT, DELETE

  • FINSERT, FDELETE

  • APPEND, TRUNCATE

  • DOUBLE, SINGLE

Comparisons

You can compute the Levenshtein edit distance by using the COMPLEV function. You can compute the generalized edit distance by using the CALL COMPCOST routine and the COMPGED function. Computing generalized edit distance requires considerably more computer time than does computing Levenshtein edit distance. But generalized edit distance usually provides a more useful measure than Levenshtein edit distance for applications such as fuzzy file merging and text mining.

Examples

The following example uses the default costs to calculate the generalized edit distance.

 options nodate pageno=1 linesize=70 pagesize=60;  data test;     infile datalines missover;     input String1 $char8. +1 String2 $char8. +1 Operation .;     GED=compged(string1, string2);     datalines;  baboon   baboon   match  baXboon  baboon   insert  baoon    baboon   delete  baXoon   baboon   replace  baboonX  baboon   append  baboo    baboon   truncate  babboon  baboon   double  babon    baboon   single  baobon   baboon   swap  bab oon  baboon   blank  bab,oon  baboon   punctuation  bXaoon   baboon   insert+delete  bXaYoon  baboon   insert+replace  bXoon    baboon   delete+replace  Xbaboon  baboon   finsert  aboon    baboon   trick question: swap+delete  Xaboon   baboon   freplace  axoon    baboon   fdelete+replace  axoo     baboon   fdelete+replace+truncate  axon     baboon   fdelete+replace+single  baby     baboon   replace+truncate*2  balloon  baboon   replace+insert  ;  proc print data=test label;     label GED='Generalized Edit Distance';     var String1 String2 GED Operation;  run; 

The following output shows the results.

Output 4.21: Generalized Edit Distance Based on Operation
start example
 The SAS System                                1                                Generalized                                    Edit   Obs   String1     String2      Distance    Operation     1   baboon      baboon           0       match     2   baXboon     baboon         100       insert     3   baoon       baboon         100       delete     4   baXoon      baboon         100       replace     5   baboonX     baboon          50       append     6   baboo       baboon          10       truncate     7   babboon     baboon          20       double     8   babon       baboon          20       single     9   baobon      baboon          20       swap    10   bab oon     baboon          10       blank    11   bab,oon     baboon          30       punctuation    12   bXaoon      baboon         200       insert+delete    13   bXaYoon     baboon         200       insert+replace    14   bXoon       baboon         200       delete+replace    15   Xbaboon     baboon         200       finsert    16   aboon       baboon         200       trick question: swap+delete    17   Xaboon      baboon         200       freplace    18   axoon       baboon         300       fdelete+replace    19   axoo        baboon         310       fdelete+replace+truncate    20   axon        baboon         320       fdelete+replace+single    21   baby        baboon         120       replace+truncate*2    22   balloon     baboon         200       replace+insert 
end example
 

See Also

Functions:

  • 'COMPARE Function' on page 445

  • 'CALL COMPCOST Routine' on page 342

  • 'COMPLEV Function' on page 454




SAS 9.1 Language Reference Dictionary, Volumes 1, 2 and 3
SAS 9.1 Language Reference Dictionary, Volumes 1, 2 and 3
ISBN: N/A
EAN: N/A
Year: 2004
Pages: 704

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net