Introducing DataCompose: Transform Your Data Pipeline

The most important part of any Data Product is Data Quality.

Whether you’re a Business Analyst using reporting tools like PowerBI or Superset, a data scientist building RAG systems for LLMs, or a researcher working with gene sequencing — clean data is the foundation of everything. Yet, achieving data quality remains one of the most challenging aspects of data engineering.

The Phone Number Problem

Let’s look at a common scenario: cleaning phone numbers using regex in PySpark.

# This is you at 3 AM trying to clean phone numbers
df = df.withColumn("phone_clean",
    F.when(F.col("phone").rlike("^d{10}$"), F.col("phone"))
    .when(F.col("phone").rlike("^d{3}-d{3}-d{4}$"),
          F.regexp_replace(F.col("phone"), "-", ""))
    .when(F.col("phone").rlike("^(d{3}) d{3}-d{4}$"),
          F.regexp_replace(F.regexp_replace(F.col("phone"), "[()-s]", ""), " ", ""))
    # ... 47 more edge cases you haven't discovered yet
)

But wait, there’s more problems:

  • Extracting phone numbers from free-form text
  • International formats and country codes
  • Extensions like “x1234” or “ext. 5678”
  • Phone numbers embedded in sentences

The Current Solutions Fall Short

Option 1: External Libraries

Packages like Dataprep.ai or PyJanitor seem promising, but:

  • They only work with Pandas (not PySpark)
  • Built-in assumptions you can’t change without forking
  • One-size-fits-all approach doesn’t fit your data

Option 2: Regex Patterns

  • Hard to maintain and difficult to read
  • Brittle and prone to edge cases
  • Each new format requires updating complex patterns

Option 3: LLMs for Data Cleaning

  • Compliance nightmare with PII data
  • Expensive at scale
  • Non-deterministic results

The Root Problem

Bad data is fundamentally a people problem. It’s nearly impossible to abstract away human inconsistency into an external package. People aren’t predictable, and their mistakes don’t follow neat patterns.

Our Data Quality Hypothesis

I believe data errors follow a distribution something like this:

Distribution of errors in human-entered data:
█████████████ 60% - Perfect data (no cleaning needed)
████████      30% - Common errors (typos, formatting)
██             8% - Edge cases (weird but handleable)
              2% - Chaos (someone typed their life story in the phone field)
DataCompose: Clean the 38% that matters
Let the juniors clean the last 2% (it builds character)

The Uncomfortable Truth About AI and Data Quality

Everyone’s racing to implement RAG, fine-tune models, and build AI agents. But here’s what they don’t put in the keynotes: Your RAG system is only as good as your data quality.

You can have GPT-5, Claude, or any frontier model, but if your customer database has three different formats for phone numbers, your AI is going to hallucinate customer service disasters.

The Real Cause of AI Failures

Most “AI failures” are actually data quality failures.

That customer complaint about your AI-powered system giving wrong information? It’s probably because:

  • Your address data has “St.” in one table and “Street” in another
  • Phone numbers are stored in three different formats
  • Names are sometimes “LASTNAME, FIRSTNAME” and sometimes “FirstName LastName”

DataCompose isn’t trying to be AI. We’re trying to make your AI actually work by ensuring it has clean data to work with.

And here’s the kicker: your 38% of problematic data is not the same as everyone else’s. Your business has its own patterns, its own rules, and its own weird edge cases.

DataCompose Principle #1: Own Your Business Logic

Data transformations and data cleaning are business logic. And business logic belongs in your code.

Learn more about DataCompose Concepts

This is the fundamental problem. So how do we square the circle of these transformations being hard to maintain, yet too inflexible to have as an external dependency?

We took inspiration from the React/Svelte fullstack world and adopted the shadcn “copy to own” pattern, bringing it to PySpark. Instead of importing an external library that you can’t modify, you get battle-tested transformations that lives in your code.

We call our building blocks “primitives” — small, modular functions with clearly defined inputs and outputs that compose into pipelines. When we have a module of primitives that you can compose together, we call it a transformer. These aren’t magical abstractions; they’re just well-written PySpark functions that you own completely.

With this approach, you get:

  • Primitives that do 90% of the work - Start with proven patterns
  • Code that lives in YOUR repository - No external dependencies to manage
  • Full ability to modify as needed - It’s your code, change whatever you want
  • No dependencies beyond what you already have - If you have PySpark, you’re ready

DataCompose Principle #2: Validate Everything

Data transformations should be validated at every step for edge cases, and should be adjustable for your use case.

Every primitive comes with:

  • Comprehensive test cases
  • Edge case handling
  • Clear documentation of what it does and doesn’t handle
  • Configurable behavior for your specific needs

DataCompose Principle #3: Zero Dependencies

No external dependencies beyond Python/PySpark (including DataCompose). Each primitive must be modular and work on your system without adding extra dependencies.

Why this matters:

  • PySpark runs in the JVM — adding dependencies is complex
  • Enterprise environments have strict package approval processes
  • Every new dependency is a potential security risk
  • Simple is more maintainable

Our commitment: Pure PySpark transformations only.

How it works

Want to dive deeper? Check out our Getting Started Guide for a complete walkthrough.

1. Install DataCompose CLI

pip install datacompose

2. Add the transformers you need - they’re copied to your repo, pre-validated against tests

datacompose add addresses

3. You own the code - use it like any other Python module

# This is in your repository, you own it
from transformers.pyspark.addresses import addresses
from pyspark.sql import functions as F

# Clean and extract address components
result_df = df 
    .withColumn("street_number", addresses.extract_street_number(F.col("address"))) 
    .withColumn("street_name", addresses.extract_street_name(F.col("address"))) 
    .withColumn("city", addresses.extract_city(F.col("address"))) 
    .withColumn("state", addresses.standardize_state(F.col("address"))) 
    .withColumn("zip", addresses.extract_zip_code(F.col("address")))

result_df.show()

See all available functions: Check the Address Transformers API Reference

Output:

+----------------------------------------+-------------+------------+-----------+-----+-------+
|address                                 |street_number|street_name |city       |state|zip    |
+----------------------------------------+-------------+------------+-----------+-----+-------+
|123 Main St, New York, NY 10001        |123          |Main        |New York   |NY   |10001  |
|456 Oak Ave Apt 5B, Los Angeles, CA 90001|456        |Oak         |Los Angeles|CA   |90001  |
|789 Pine Blvd, Chicago, IL 60601       |789          |Pine        |Chicago    |IL   |60601  |
+----------------------------------------+-------------+------------+-----------+-----+-------+

4. Need more? Use keyword arguments or modify the source directly

The Future Vision

Our goal is simple: provide clean data transformations as drop-in replacements that you can compose as YOU see fit.

  • No magic
  • No vendor lock-in
  • Just reliable primitives that work

What’s Available Now

We’re starting with the most common data quality problems:

  • Addresses — Standardize formats, extract components, validate
  • Emails — Clean, validate, extract domains
  • Phone Numbers — Format, extract, validate across regions

What’s Next

Based on community demand, we’re considering:

  • Date/time standardization
  • Name parsing and formatting
  • Currency and number formats
  • Custom business identifiers

Want to see something specific? Let us know!

Frequently Asked Questions