DataCompose

A data transformation framework for building reusable, composable data pipelines in PySpark.

+----------------------------------------+-------------+------------+-----------+-----+-------+
|address                                 |street_number|street_name |city       |state|zip    |
+----------------------------------------+-------------+------------+-----------+-----+-------+
|123 Main St, New York, NY 10001        |123          |Main        |New York   |NY   |10001  |
|456 Oak Ave Apt 5B, Los Angeles, CA... |456          |Oak         |Los Angeles|CA   |90001  |
|789 Elm Street, Chicago, IL 60601      |789          |Elm         |Chicago    |IL   |60601  |
|321 Pine Road Suite 100, Boston, MA... |321          |Pine        |Boston     |MA   |02101  |
+----------------------------------------+-------------+------------+-----------+-----+-------+

Every function is native PySpark. No UDFs. No black boxes. Just code that handles edge cases you haven't thought of yet.

Quick Start

1

Initialize Your Project

datacompose init
2

Add Transformers

datacompose add addresses
3

Use Your Code

from transformers.pyspark.addresses import addresses

df = df.withColumn("city", addresses.extract_city(F.col("address")))

Zero dependencies beyond PySpark. No framework lock-in, no version conflicts.

Copy-paste, don't import. The code is yours to modify and own.

Production-ready transformations. Phone numbers, emails, addresses, and more.

Why DataCompose?

Stop writing the same regex patterns. Stop debugging phone number edge cases. Stop maintaining transformation libraries. DataCompose generates the code you'd write yourself if you had the time. Then gives it to you to own, modify, and deploy however you want.