Introduction to DataCompose
A code creation framework for building reusable, composable data cleaning pipelines in PySpark. Inspired by shadcn-svelte’s approach to UI components.
This is not a data transformation library. It’s how you build your data transformation codebase.
Traditional data libraries work by installing a package from PyPI, importing functions, and using them in your pipeline. This approach works well until you need to customize a transformation to fit your specific data patterns or require one that isn’t included in the library.
Often, you end up writing wrapper functions, fighting with inflexible APIs, or mixing incompatible libraries with different conventions.
DataCompose solves this problem through a fundamentally different approach.
Core Principles
Open Code
The created transformation code is yours to modify and extend. No black boxes or hidden implementations.
Composition
Every primitive uses a common, composable interface, making them predictable and easy to combine.
Distribution
A CLI tool creates self-contained code that can be versioned with your project, ensuring consistency across your team.
Production Ready
Created code has zero external dependencies beyond PySpark. Deploy anywhere PySpark runs.
Why DataCompose?
DataCompose creates the actual transformation code directly into your project. You have full control to customize and extend the primitives to your needs. This means:
- Full Transparency: You see exactly how each transformation is implemented
- Easy Customization: Modify any part of a primitive to fit your data patterns and business logic
- No Hidden Magic: All transformations are readable PySpark code you can debug and test
- Complete Ownership: The code is yours, with no external dependencies to manage
In a typical library, if you need to change how email validation works, you’re stuck with the library’s implementation. With DataCompose, you simply edit the created code directly.
Frequently Asked Questions
DataCompose follows a headless architecture where the core framework (PrimitiveRegistry, SmartPrimitive) is embedded with each created module.
Updates to the CLI tool provide new transformers and improved code creation, but your existing created code remains stable and under your control. You can recreate specific modules when needed or keep using your customized versions.
Traditional libraries force you into their patterns and hide implementation details. DataCompose creates code that:
- You can read and understand – No diving through library source code
- You can modify – Need different validation rules? Just change them
- Has no runtime dependencies – Deploy anywhere PySpark runs
- You own completely – No version conflicts or breaking changes
DataCompose is much more than just code snippets:
- Intelligent creation – Code is created based on your specific requirements and configuration
- Consistent patterns – All created code follows the same conventions and interfaces
- Built-in testing – Each transformer comes with comprehensive test templates
- Documentation included – Created code includes detailed docstrings and usage examples
- Production optimized – Code follows PySpark best practices for performance and reliability
Philosophy & Inspiration
DataCompose is inspired by shadcn-svelte and huntabyte’s approach to component libraries. Just as shadcn-svelte provides “copy and paste” components rather than npm packages, DataCompose creates data transformation code that becomes part of YOUR codebase.
Why We Believe in This Approach
You Own Your Code
No external dependencies to manage or worry about breaking changes. The created code is yours to version, modify, and deploy as you see fit.
Full Transparency
Every transformation is readable, debuggable PySpark code you can understand. No magic, no hidden complexity.
Customization First
Need to adjust a transformation? Just edit the code. No need to work around library limitations or wait for feature requests.
Learn by Reading
The created code serves as documentation and learning material. See exactly how professional data transformations are implemented.
Get Started
Ready to transform your data pipelines?
Start creating production-ready transformation code in minutes with DataCompose.