Fuzzy Matching

String similarity and comparison functions for row-wise operations.

Usage


| name_a    | name_b   | levenshtein   | levenshtein_norm   | soundex_match   |
| john      | jon      | 1             | 0.75               | true            |
| smith     | smyth    | 1             | 0.80               | true            |
| acme corp | acme inc | 4             | 0.56               | false           |
| robert    | bob      | 5             | 0.17               | false           |

Installation

datacompose add fuzzy_matching

API Reference

Utility Functions

fuzzy.levenshtein

Calculate Levenshtein edit distance between two strings. The Levenshtein distance is the minimum number of single-character edits (insertions, deletions, substitutions) required to transform one string into another.

Parameters

Property Type Description
col1 required
Column
First string column
col2 required
Column
Second string column

fuzzy.levenshtein_normalized

Calculate normalized Levenshtein similarity (0.0 to 1.0). Returns a similarity score where 1.0 means identical strings and 0.0 means completely different. Calculated as: 1 - (levenshtein_distance / max(len(str1), len(str2)))

Parameters

Property Type Description
col1 required
Column
First string column
col2 required
Column
Second string column

fuzzy.levenshtein_threshold

Check if normalized Levenshtein similarity meets threshold.

Parameters

Property Type Description
col1 required
Column
First string column
col2 required
Column
Second string column
threshold required
Column
Minimum similarity score

fuzzy.soundex

Calculate Soundex phonetic encoding of a string. Soundex encodes a string into a letter followed by three digits, representing how the word sounds in English.

Parameters

Property Type Description
col required
Column
String column to encode

fuzzy.soundex_match

Check if two strings have the same Soundex encoding. Useful for matching names that sound alike but are spelled differently (e.g., "Smith" and "Smyth").

Parameters

Property Type Description
col1 required
Column
First string column
col2 required
Column
Second string column

fuzzy.jaccard_similarity

Calculate Jaccard similarity between tokenized strings. Splits both strings into tokens and calculates: |intersection| / |union| Useful for comparing multi-word strings where word order doesn't matter.

Parameters

Property Type Description
col1 required
Column
First string column
col2 required
Column
Second string column
delimiter required
Column
Token delimiter

fuzzy.token_overlap

Count number of overlapping tokens between two strings.

Parameters

Property Type Description
col1 required
Column
First string column
col2 required
Column
Second string column
delimiter required
Column
Token delimiter

fuzzy.exact_match

Check if two strings match exactly.

Parameters

Property Type Description
col1 required
Column
First string column
col2 required
Column
Second string column
ignore_case required
Column
If True, comparison is case-insensitive

fuzzy.contains_match

Check if one string contains the other. Returns True if col1 contains col2 OR col2 contains col1.

Parameters

Property Type Description
col1 required
Column
First string column
col2 required
Column
Second string column
ignore_case required
Column
If True, comparison is case-insensitive

fuzzy.prefix_match

Check if two strings share the same prefix.

Parameters

Property Type Description
col1 required
Column
First string column
col2 required
Column
Second string column
length required
Column
Number of characters to compare

fuzzy.ngram_similarity

Calculate n-gram (character-level) similarity between two strings. Breaks strings into overlapping character sequences of length n, then calculates Jaccard similarity on the n-gram sets. Good for catching typos and character-level variations.

Parameters

Property Type Description
col1 required
Column
First string column
col2 required
Column
Second string column
n required
Column
Size of n-grams

fuzzy.ngram_distance

Calculate n-gram distance (1 - similarity) between two strings.

Parameters

Property Type Description
col1 required
Column
First string column
col2 required
Column
Second string column
n required
Column
Size of n-grams

fuzzy.cosine_similarity

Calculate cosine similarity between tokenized strings. Treats each string as a bag of words and computes cosine similarity based on term frequency. Good for comparing longer text.

Parameters

Property Type Description
col1 required
Column
First string column
col2 required
Column
Second string column
delimiter required
Column
Token delimiter