Friday, May 11, 2012

Soundex in DB2

Soundex is a phonetic algorithm for indexing names by sound, as pronounced in English. The goal is for homophones to be encoded to the same representation so that they can be matched despite minor differences in spelling.

SOUNDEX converts an alphanumeric string to a four-character code to find similar-sounding words or names. The first character of the code is the first character of character_expression and the second through fourth characters of the code are numbers that represent the letters in the expression. Vowels in character_expression are ignored unless they are the first letter of the string. Zeroes are added at the end if necessary to produce a four-character code.
The following tables defines the numbers that represent the various letters.
Number
Represents the Letters
1
B, F, P, V
2
C, G, J, K, Q, S, X, Z
3
D, T
4
L
5
M, N
6
R
Ignored
A, E, I, O, U, H, W, and Y.
For example, the SOUNDEX code for the expression 'Washington' is W252. W, 2 for the S, 5 for the N, 2 for the G. The remaining letters are disregarded. For more information about the SOUNDEX code,


  1. Names With Double Letters If the surname has any double letters, they should be treated as one letter. For example:
    • Gutierrez is coded G-362 (G, 3 for the T, 6 for the first R, second R ignored, 2 for the Z).
  2. Names with Letters Side-by-Side that have the Same Soundex Code Number If the surname has different letters side-by-side that have the same number in the soundex coding guide, they should be treated as one letter. Examples:
    • Pfister is coded as P-236 (P, F ignored, 2 for the S, 3 for the T, 6 for the R).
    • Jackson is coded as J-250 (J, 2 for the C, K ignored, S ignored, 5 for the N, 0 added).
    • Tymczak is coded as T-522 (T, 5 for the M, 2 for the C, Z ignored, 2 for the K). Since the vowel "A" separates the Z and K, the K is coded.
  3. Names with Prefixes If a surname has a prefix, such as Van, Con, De, Di, La, or Le, code both with and without the prefix because the surname might be listed under either code. Note, however, that Mc and Mac are not considered prefixes.
    For example, VanDeusen might be coded two ways: V-532 (V, 5 for N, 3 for D, 2 for S)
    or
    D-250 (D, 2 for the S, 5 for the N, 0 added).
  4. Consonant Separators If a vowel (A, E, I, O, U) separates two consonants that have the same soundex code, the consonant to the right of the vowel is coded. Example:
    Tymczak is coded as T-522 (T, 5 for the M, 2 for the C, Z ignored (see "Side-by-Side" rule above), 2 for the K). Since the vowel "A" separates the Z and K, the K is coded.

    If "H" or "W" separate two consonants that have the same soundex code, the consonant to the right of the vowel is not coded. Example:
    Ashcraft is coded A-261 (A, 2 for the S, C ignored, 6 for the R, 1 for the F). It is not coded A-226.
SELECT * FROM CUSTOMER
WHERE SOUNDEX(NM_NAME) = SOUNDEX('POLLY')

No comments:

Post a Comment