Recipe 3.15 Program: Soundex Name Comparisons


The difficulties in comparing (American-style) names inspired the development of the Soundex algorithm, in which each of a given set of consonants maps to a particular number. This was apparently devised for use by the Census Bureau to map similar-sounding names together on the grounds that in those days many people were illiterate and could not spell their family names consistently. But it is still useful today for example, in a company-wide telephone book application. The names Darwin and Derwin, for example, map to D650, and Darwent maps to D653, which puts it adjacent to D650. All of these are historical variants of the same name. Suppose we needed to sort lines containing these names together: if we could output the Soundex numbers at the beginning of each line, this would be easy. Here is a simple demonstration of the Soundex class:

/** Simple demonstration of Soundex.  */ public class SoundexSimple {     /** main */     public static void main(String[] args) {          String[] names = {             "Darwin, Ian",             "Davidson, Greg",             "Darwent, William",             "Derwin, Daemon"         };         for (int i = 0; i< names.length; i++)             System.out.println(Soundex.soundex(names[i]) + ' ' + names[i]);     } }

Let's run it:

> jikes +E -d . SoundexSimple.java > java SoundexSimple | sort D132 Davidson, Greg D650 Darwin, Ian D650 Derwin, Daemon D653 Darwent, William >

As you can see, the Darwin-variant names (including Daemon Derwin[5]) all sort together and are distinct from the Davidson (and Davis, Davies, etc.) names that normally appear between Darwin and Derwin when using a simple alphabetic sort. The Soundex algorithm has done its work.

[5] In Unix terminology, a daemon is a server. The word has nothing to do with demons but refers to a helper or assistant. Derwin Daemon is actually a character in Susannah Coleman's "Source Wars" online comic strip; see http://darby.daemonnews.org.

Here is the Soundex class itself: it uses Strings and StringBuilders to convert names into Soundex codes. A JUnit test (see Recipe 1.14) is also online as SoundexTest.java.

import com.darwinsys.util.Debug; /**  * Soundex - the Soundex Algorithm, as described by Knuth  * <p>  * This class implements the soundex algorithm as described by Donald  * Knuth in Volume 3 of <I>The Art of Computer Programming</I>.  The  * algorithm is intended to hash words (in particular surnames) into  * a small space using a simple model which approximates the sound of  * the word when spoken by an English speaker.  Each word is reduced  * to a four character string, the first character being an upper case  * letter and the remaining three being digits. Double letters are  * collapsed to a single digit.  *   * <h2>EXAMPLES</h2>  * Knuth's examples of various names and the soundex codes they map  * to are:  * <b>Euler, Ellery -> E460  * <b>Gauss, Ghosh -> G200  * <b>Hilbert, Heilbronn -> H416  * <b>Knuth, Kant -> K530  * <b>Lloyd, Ladd -> L300  * <b>Lukasiewicz, Lissajous -> L222  *   * <h2>LIMITATIONS</h2>  * As the soundex algorithm was originally used a <B>long</B> time ago  * in the United States of America, it uses only the English alphabet  * and pronunciation.  * <p>  * As it is mapping a large space (arbitrary length strings) onto a  * small space (single letter plus 3 digits) no inference can be made  * about the similarity of two strings which end up with the same  * soundex code.  For example, both "Hilbert" and "Heilbronn" end up  * with a soundex code of "H416".  * <p>  * The soundex( ) method is static, as it maintains no per-instance  * state; this means you never need to instantiate this class.  *  * @author Perl implementation by Mike Stok (<stok@cybercom.net>) from  * the description given by Knuth.  Ian Phillips (<ian@pipex.net>) and  * Rich Pinder (<rpinder@hsc.usc.edu>) supplied ideas and spotted  * mistakes.  */ public class Soundex {     /* Implements the mapping      * from: AEHIOUWYBFPVCGJKQSXZDTLMNR      * to:   00000000111122222222334556      */     public static final char[] MAP = {         //A  B   C   D   E   F   G   H   I   J   K   L   M         '0','1','2','3','0','1','2','0','0','2','2','4','5',         //N  O   P   W   R   S   T   U   V   W   X   Y   Z         '5','0','1','2','6','2','3','0','1','0','2','0','2'     };     /** Convert the given String to its Soundex code.      * @return null if the given string can't be mapped to Soundex.      */     public static String soundex(String s) {         // Algorithm works on uppercase (mainframe era).         String t = s.toUpperCase( );         StringBuffer res = new StringBuffer( );         char c, prev = '?';         // Main loop: find up to 4 chars that map.         for (int i=0; i<t.length( ) && res.length( ) < 4 &&             (c = t.charAt(i)) != ','; i++) {             // Check to see if the given character is alphabetic.             // Text is already converted to uppercase. Algorithm             // handles only ASCII letters; do NOT use Character.isLetter( )!             // Also, skip double letters.             if (c>='A' && c<='Z' && c != prev) {                 prev = c;                 // First char is installed unchanged, for sorting.                 if (i==0)                     res.append(c);                 else {                     char m = MAP[c-'A'];                     Debug.println("inner", c + " --> " + m);                     if (m != '0')                         res.append(m);                 }             }         }         if (res.length( ) == 0)             return null;         for (int i=res.length( ); i<4; i++)             res.append('0');         return res.toString( );     } }



Java Cookbook
Java Cookbook, Second Edition
ISBN: 0596007019
EAN: 2147483647
Year: 2003
Pages: 409
Authors: Ian F Darwin

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net