Introduction to DataCompose

A code creation framework for building reusable, composable data cleaning pipelines in PySpark. Inspired by shadcn-svelte’s approach to UI components.

This is not a data transformation library. It’s how you build your data transformation codebase.

Traditional data libraries work by installing a package from PyPI, importing functions, and using them in your pipeline. This approach works well until you need to customize a transformation to fit your specific data patterns or require one that isn’t included in the library.

Often, you end up writing wrapper functions, fighting with inflexible APIs, or mixing incompatible libraries with different conventions.

DataCompose solves this problem through a fundamentally different approach.

Core Principles

Open Code

The created transformation code is yours to modify and extend. No black boxes or hidden implementations.

Composition

Every primitive uses a common, composable interface, making them predictable and easy to combine.

Distribution

A CLI tool creates self-contained code that can be versioned with your project, ensuring consistency across your team.

Production Ready

Created code has zero external dependencies beyond PySpark. Deploy anywhere PySpark runs.

Why DataCompose?

DataCompose creates the actual transformation code directly into your project. You have full control to customize and extend the primitives to your needs. This means:

Full Transparency: You see exactly how each transformation is implemented
Easy Customization: Modify any part of a primitive to fit your data patterns and business logic
No Hidden Magic: All transformations are readable PySpark code you can debug and test
Complete Ownership: The code is yours, with no external dependencies to manage

In a typical library, if you need to change how email validation works, you’re stuck with the library’s implementation. With DataCompose, you simply edit the created code directly.

Frequently Asked Questions

Philosophy & Inspiration

DataCompose is inspired by shadcn-svelte and huntabyte’s approach to component libraries. Just as shadcn-svelte provides “copy and paste” components rather than npm packages, DataCompose creates data transformation code that becomes part of YOUR codebase.

Why We Believe in This Approach

You Own Your Code
No external dependencies to manage or worry about breaking changes. The created code is yours to version, modify, and deploy as you see fit.

Full Transparency
Every transformation is readable, debuggable PySpark code you can understand. No magic, no hidden complexity.

Customization First
Need to adjust a transformation? Just edit the code. No need to work around library limitations or wait for feature requests.

Learn by Reading
The created code serves as documentation and learning material. See exactly how professional data transformations are implemented.

Get Started

Ready to transform your data pipelines?

Start creating production-ready transformation code in minutes with DataCompose.