Fuzzy Matching

Fuzzy matching is a technique for identifying similarities between strings that do not exactly match each other. Fuzzy matching lets you apply a threshold matching percentage, above which you can say that two strings "approximately match".

Fuzzy Matching in ETL

Fuzzy matching is useful for ETL data cleansing operations. For example, you may need to detect that personal names were entered into the system with different spellings but that they all refer to the same person.

There are many important algorithms that support fuzzy matching for various types of data matching. ETL provides a useful set of them with the STRINGDIST and PHONETIC functions.

STRINGDIST

STRINGDIST provides distance algorithms for strings to determine the differences, referred to as the "distance", between two strings.

STRINGDIST('Levenshtein|JaroWinkler|Qgram|Ngram|Jaccard|Cosine',[Dimension.Column1], [Dimension.Column2]

Options include:

PHONETIC

The PHONETIC function uses phonetic algorithms to provide a method of indexing strings by their pronunciation. These are useful for matching English-language words. Phonetic transformation algorithms to create a code that represents how a string sounds as it is pronounced. For example, if two strings have the same Metaphone code, they are likely to sound alike.

PHONETIC('Soundex|Metaphone|DoubleMetaphone',[Dimension.Column1], [Dimension.Column2]

Options include:

  • Soundex: The Soundex option encodes a string into a Soundex value. Soundex is an encoding used to relate similar names, but can also be used as a general purpose scheme to find word with similar phonemes. See the Wikipedia entry Soundex.
  • Metaphone: The Metaphone option encodes a string into a Metaphone value and is appropriate for basic English words. See the Wikipedia entry Metaphone.
  • DoubleMetaphone: The DoubleMetaphone option supports more complex English words that are based on other languages. See the Wikipedia entry Metaphone.

Example Results of ETL Fuzzy Matching Functions

The following example shows how to upload data that you want to match, and apply the fuzzy matching functions to a scripted source.

  1. Start with data similar to the following. It should have two columns of strings that almost match each other.
Col1 Col2
abcdd abbcd
ab bxa
abcde abdcde
Cosmo Kramer Cosmo Kramer
Kosmo Kramer Cosmo Kramer
Comso Kramer Cosmo Kramer
Csmo Kramer Cosmo Kramer
Cosmo X. Kramer Cosmo Kramer
Kramer, Cosmo Cosmo Kramer
Jerry Seinfeld Cosmo Kramer
CKaemmoorrs Cosmo Kramer
Cosmer Kramo Cosmo Kramer
Kosmoo Karme Cosmo Kramer
George Costanza Cosmo Kramer
Elaine Benes Cosmo Kramer
Dr. Van Nostren Cosmo Kramer
remarK omsoC Cosmo Kramer
Mr. Kramer Cosmo Kramer
Sir Cosmo Kramer Cosmo Kramer
C.o.s.m.o. .K.r.a.m.e.r Cosmo Kramer
CsoKae Cosmo Kramer
Coso Kraer Cosmo Kramer
  1. Upload it to an Advanced space and enable the source.
  2. In Admin - Define Sources - Manage Sources - Data Sources click the gear icon and select Add Script.
  3. Name the script and save it.
  4. Add the columns and click Save.
  5. Add the query and script. For example, the SELECT statement gets the data from the uploaded source:
    SELECT [MyNames.Col1], [MyNames.Col2] FROM [MyNames]
    The script runs the functions against the data:
    [Col1]=[MyNames.Col1]
    [Col2]=[MyNames.Col2]
    [Levenshtein]=StringDist('Levenshtein',[MyNames.Col1],[MyNames.Col2])
    [JaroWinkler]=StringDist('JaroWinkler',[MyNames.Col1],[MyNames.Col2])
    [Qgram]=StringDist('Qgram',[MyNames.Col1],[MyNames.Col2])
    [Ngram]=StringDist('Ngram',[MyNames.Col1],[MyNames.Col2])
    [Jaccard]=StringDist('Jaccard',[MyNames.Col1],[MyNames.Col2])
    [Cosine]=StringDist('Cosine',[MyNames.Col1],[MyNames.Col2])
    [Soundex]=Phonetic('Soundex',[MyNames.Col1])
    [Metaphone]=Phonetic('Metaphone',[MyNames.Col1])
    [DoubleMetaphone]=Phonetic('DoubleMetaphone',[MyNames.Col1])
    WRITERECORD
  6. Click Validate. Troubleshoot if needed, then click Save.
  7. Click Execute to see the results in the Raw Data tab.
  8. Use these functions in your data cleansing ETL, based on the data and the matching algorithm you need.