Fuzzy Matching
String similarity and comparison functions for row-wise operations.
Usage
| name_a | name_b | levenshtein | levenshtein_norm | soundex_match |
| john | jon | 1 | 0.75 | true |
| smith | smyth | 1 | 0.80 | true |
| acme corp | acme inc | 4 | 0.56 | false |
| robert | bob | 5 | 0.17 | false |
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from transformers.pyspark.fuzzy_matching import fuzzy
# Initialize Spark
spark = SparkSession.builder.appName("FuzzyMatching").getOrCreate()
# Create sample data
data = [
("john", "jon"),
("smith", "smyth"),
("acme corp", "acme inc"),
]
df = spark.createDataFrame(data, ["name_a", "name_b"])
# Compare strings
result_df = df.select(
F.col("name_a"),
F.col("name_b"),
fuzzy.levenshtein(F.col("name_a"), F.col("name_b")).alias("distance"),
fuzzy.levenshtein_normalized(F.col("name_a"), F.col("name_b")).alias("similarity"),
fuzzy.soundex_match(F.col("name_a"), F.col("name_b")).alias("soundex_match")
)
# Filter to similar matches
similar = result_df.filter(F.col("similarity") >= 0.8)Installation
datacompose add fuzzy_matching API Reference
Utility Functions
fuzzy.levenshtein
Calculate Levenshtein edit distance between two strings. The Levenshtein distance is the minimum number of single-character edits (insertions, deletions, substitutions) required to transform one string into another.
Parameters
| Property | Type | Description |
|---|---|---|
col1 required | Column | First string column |
col2 required | Column | Second string column |
fuzzy.levenshtein_normalized
Calculate normalized Levenshtein similarity (0.0 to 1.0). Returns a similarity score where 1.0 means identical strings and 0.0 means completely different. Calculated as: 1 - (levenshtein_distance / max(len(str1), len(str2)))
Parameters
| Property | Type | Description |
|---|---|---|
col1 required | Column | First string column |
col2 required | Column | Second string column |
fuzzy.levenshtein_threshold
Check if normalized Levenshtein similarity meets threshold.
Parameters
| Property | Type | Description |
|---|---|---|
col1 required | Column | First string column |
col2 required | Column | Second string column |
threshold required | Column | Minimum similarity score |
fuzzy.soundex
Calculate Soundex phonetic encoding of a string. Soundex encodes a string into a letter followed by three digits, representing how the word sounds in English.
Parameters
| Property | Type | Description |
|---|---|---|
col required | Column | String column to encode |
fuzzy.soundex_match
Check if two strings have the same Soundex encoding. Useful for matching names that sound alike but are spelled differently (e.g., "Smith" and "Smyth").
Parameters
| Property | Type | Description |
|---|---|---|
col1 required | Column | First string column |
col2 required | Column | Second string column |
fuzzy.jaccard_similarity
Calculate Jaccard similarity between tokenized strings. Splits both strings into tokens and calculates: |intersection| / |union| Useful for comparing multi-word strings where word order doesn't matter.
Parameters
| Property | Type | Description |
|---|---|---|
col1 required | Column | First string column |
col2 required | Column | Second string column |
delimiter required | Column | Token delimiter |
fuzzy.token_overlap
Count number of overlapping tokens between two strings.
Parameters
| Property | Type | Description |
|---|---|---|
col1 required | Column | First string column |
col2 required | Column | Second string column |
delimiter required | Column | Token delimiter |
fuzzy.exact_match
Check if two strings match exactly.
Parameters
| Property | Type | Description |
|---|---|---|
col1 required | Column | First string column |
col2 required | Column | Second string column |
ignore_case required | Column | If True, comparison is case-insensitive |
fuzzy.contains_match
Check if one string contains the other. Returns True if col1 contains col2 OR col2 contains col1.
Parameters
| Property | Type | Description |
|---|---|---|
col1 required | Column | First string column |
col2 required | Column | Second string column |
ignore_case required | Column | If True, comparison is case-insensitive |
fuzzy.prefix_match
Check if two strings share the same prefix.
Parameters
| Property | Type | Description |
|---|---|---|
col1 required | Column | First string column |
col2 required | Column | Second string column |
length required | Column | Number of characters to compare |
fuzzy.ngram_similarity
Calculate n-gram (character-level) similarity between two strings. Breaks strings into overlapping character sequences of length n, then calculates Jaccard similarity on the n-gram sets. Good for catching typos and character-level variations.
Parameters
| Property | Type | Description |
|---|---|---|
col1 required | Column | First string column |
col2 required | Column | Second string column |
n required | Column | Size of n-grams |
fuzzy.ngram_distance
Calculate n-gram distance (1 - similarity) between two strings.
Parameters
| Property | Type | Description |
|---|---|---|
col1 required | Column | First string column |
col2 required | Column | Second string column |
n required | Column | Size of n-grams |
fuzzy.cosine_similarity
Calculate cosine similarity between tokenized strings. Treats each string as a bag of words and computes cosine similarity based on term frequency. Good for comparing longer text.
Parameters
| Property | Type | Description |
|---|---|---|
col1 required | Column | First string column |
col2 required | Column | Second string column |
delimiter required | Column | Token delimiter |