A data transformation framework for building reusable, composable data pipelines in PySpark.
+----------------------------------------+-------------+------------+-----------+-----+-------+
|address |street_number|street_name |city |state|zip |
+----------------------------------------+-------------+------------+-----------+-----+-------+
|123 Main St, New York, NY 10001 |123 |Main |New York |NY |10001 |
|456 Oak Ave Apt 5B, Los Angeles, CA... |456 |Oak |Los Angeles|CA |90001 |
|789 Elm Street, Chicago, IL 60601 |789 |Elm |Chicago |IL |60601 |
|321 Pine Road Suite 100, Boston, MA... |321 |Pine |Boston |MA |02101 |
+----------------------------------------+-------------+------------+-----------+-----+-------+
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from transformers.pyspark.addresses import addresses
# Initialize Spark
spark = SparkSession.builder.appName("DataCleaning").getOrCreate()
# Create sample data
data = [
("123 Main St, New York, NY 10001",),
("456 Oak Ave Apt 5B, Los Angeles, CA 90001",),
("789 Elm Street, Chicago, IL 60601",),
("321 Pine Road Suite 100, Boston, MA 02101",),
]
df = spark.createDataFrame(data, ["address"])
# Extract and standardize address components
result_df = df.select(
F.col("address"),
addresses.extract_street_number(F.col("address")).alias("street_number"),
addresses.extract_street_name(F.col("address")).alias("street_name"),
addresses.extract_city(F.col("address")).alias("city"),
addresses.extract_state(F.col("address")).alias("state"),
addresses.extract_zip_code(F.col("address")).alias("zip")
)
# Show results
result_df.show(truncate=False)
# Filter to valid addresses
valid_addresses = result_df.filter(addresses.validate_zip_code(F.col("zip")))
Every function is native PySpark. No UDFs. No black boxes. Just code that handles edge cases you haven't thought of yet.
datacompose init
datacompose add addresses
from transformers.pyspark.addresses import addresses
df = df.withColumn("city", addresses.extract_city(F.col("address")))
Zero dependencies beyond PySpark. No framework lock-in, no version conflicts.
Copy-paste, don't import. The code is yours to modify and own.
Production-ready transformations. Phone numbers, emails, addresses, and more.
Stop writing the same regex patterns. Stop debugging phone number edge cases. Stop maintaining transformation libraries. DataCompose generates the code you'd write yourself if you had the time. Then gives it to you to own, modify, and deploy however you want.