Determines the likelihood of two words matching, expressed as the asymmetric spelling distance between the two words
Category: Character
SPEDIS ( query , keyword )
query
identifies the word to query for the likelihood of a match. SPEDIS removes trailing blanks before comparing the value.
keyword
specifies a target word for the query. SPEDIS removes trailing blanks before comparing the value.
SPEDIS returns the distance between the query and a keyword, a nonnegative value that is usually less than 100 but never greater than 200 with the default costs.
SPEDIS computes an asymmetric spelling distance between two words as the normalized cost for converting the keyword to the query word by using a sequence of operations. SPEDIS( QUERY , KEYWORD ) is not the same as SPEDIS( KEYWORD , QUERY ).
Costs for each operation that is required to convert the keyword to the query are
Operation | Cost | Explanation |
---|---|---|
match |
| no change |
singlet | 25 | delete one of a double letter |
doublet | 50 | double a letter |
swap | 50 | reverse the order of two consecutive letters |
truncate | 50 | delete a letter from the end |
append | 35 | add a letter to the end |
delete | 50 | delete a letter from the middle |
insert | 100 | insert a letter in the middle |
replace | 100 | replace a letter in the middle |
firstdel | 100 | delete the first letter |
firstins | 200 | insert a letter at the beginning |
firstrep | 200 | replace the first letter |
The distance is the sum of the costs divided by the length of the query. If this ratio is greater than one, the result is rounded down to the nearest whole number.
The SPEDIS function is similar to the COMPLEV and COMPGED functions, but COMPLEV and COMPGED are much faster, especially for long strings.
options nodate pageno=1 linesize=64; data words; input Operation $ Query $ Keyword $; Distance = spedis(query,keyword); Cost = distance * length(query); datalines; match fuzzy fuzzy singlet fuzy fuzzy doublet fuuzzy fuzzy swap fzuzy fuzzy truncate fuzz fuzzy append fuzzys fuzzy delete fzzy fuzzy insert fluzzy fuzzy replace fizzy fuzzy firstdel uzzy fuzzy firstins pfuzzy fuzzy firstrep wuzzy fuzzy several floozy fuzzy ; proc print data = words; run;
The output from the DATA step is as follows .
The SAS System 1 Obs Operation Query Keyword Distance Cost 1 match fuzzy fuzzy 0 0 2 singlet fuzy fuzzy 6 24 3 doublet fuuzzy fuzzy 8 48 4 swap fzuzy fuzzy 10 50 5 truncate fuzz fuzzy 12 48 6 append fuzzys fuzzy 5 30 7 delete fzzy fuzzy 12 48 8 insert fluzzy fuzzy 16 96 9 replace fizzy fuzzy 20 100 10 firstdel uzzy fuzzy 25 100 11 firstins pfuzzy fuzzy 33 198 12 firstrep wuzzy fuzzy 40 200 13 several floozy fuzzy 50 300
Functions:
'COMPLEV Function' on page 454
'COMPGED Function' on page 449