Text Transformers

Clean, normalize, and transform text data.

Usage


# Preview output will be shown here

Installation

datacompose add text

API Reference

Extract Functions

text.extract_hex

Extract first hex value from mixed content. Dialects: pyspark, duckdb Looks for hex with prefix (0x, #) or MAC-address format (XX:XX:XX).

Parameters

Property Type Description
col required
Column
Column containing mixed content

text.extract_base64

Extract base64 from mixed content. Dialects: postgres, pyspark, duckdb Looks for base64 strings with = padding or that follow "base64," prefix.

Parameters

Property Type Description
col required
Column
Column containing mixed content

Validation Functions

text.is_valid_hex

Check if string is valid hexadecimal. Dialects: postgres, pyspark, duckdb

Parameters

Property Type Description
col required
Column
Column containing string to validate

text.is_valid_base64

Check if string is valid base64. Dialects: postgres, pyspark, duckdb

Parameters

Property Type Description
col required
Column
Column containing string to validate

text.is_valid_url_encoded

Check if string is valid URL encoded (no malformed percent sequences). Dialects: postgres, pyspark, duckdb

Parameters

Property Type Description
col required
Column
Column containing string to validate

text.has_control_characters

Check if string contains control characters (excluding tab/newline/CR). Dialects: postgres, pyspark, duckdb

Parameters

Property Type Description
col required
Column
Column containing string to check

text.has_zero_width_characters

Check if string contains zero-width characters. Dialects: postgres, pyspark, duckdb

Parameters

Property Type Description
col required
Column
Column containing string to check

text.has_non_ascii

Check if string contains non-ASCII characters. Dialects: postgres, pyspark, duckdb

Parameters

Property Type Description
col required
Column
Column containing string to check

text.has_escape_sequences

Check if string contains literal escape sequences (\\n, \\t, etc). Dialects: postgres, pyspark, duckdb

Parameters

Property Type Description
col required
Column
Column containing string to check

text.has_url_encoding

Check if string contains URL percent encoding. Dialects: postgres, pyspark, duckdb

Parameters

Property Type Description
col required
Column
Column containing string to check

text.has_html_entities

Check if string contains HTML entities. Dialects: postgres, pyspark, duckdb

Parameters

Property Type Description
col required
Column
Column containing string to check

text.has_ansi_codes

Check if string contains ANSI escape codes. Dialects: postgres, pyspark, duckdb

Parameters

Property Type Description
col required
Column
Column containing string to check

text.has_non_printable

Check if string contains non-printable characters. Dialects: postgres, pyspark, duckdb

Parameters

Property Type Description
col required
Column
Column containing string to check

text.has_accents

Check if string contains accented characters. Dialects: postgres, pyspark, duckdb

Parameters

Property Type Description
col required
Column
Column containing string to check

text.has_unicode_issues

Check if string contains unicode normalization issues. Dialects: postgres, pyspark, duckdb Detects: curly quotes, fancy dashes, special spaces, full-width chars, and combining characters (accents as separate codepoints).

Parameters

Property Type Description
col required
Column
Column containing string to check

text.has_whitespace_issues

Check if string has whitespace issues. Dialects: postgres, pyspark, duckdb

Parameters

Property Type Description
col required
Column
Column containing string to check

Utility Functions

text.hex_to_text

Convert hexadecimal string to text. Dialects: pyspark, duckdb

Parameters

Property Type Description
col required
Column
Column containing hex string

text.text_to_hex

Convert text to hexadecimal string. Dialects: pyspark, duckdb

Parameters

Property Type Description
col required
Column
Column containing text

text.clean_hex

Clean hex string (remove prefix, normalize case, remove separators). Dialects: pyspark, duckdb

Parameters

Property Type Description
col required
Column
Column containing hex string

text.decode_base64

Decode base64 string to text. Dialects: postgres, pyspark, duckdb

Parameters

Property Type Description
col required
Column
Column containing base64 string

text.encode_base64

Encode text to base64 string. Dialects: postgres, pyspark, duckdb

Parameters

Property Type Description
col required
Column
Column containing text

text.clean_base64

Clean base64 string (remove whitespace, fix padding). Dialects: postgres, pyspark, duckdb

Parameters

Property Type Description
col required
Column
Column containing base64 string

text.decode_url

Decode URL percent-encoded string. Dialects: postgres, pyspark, duckdb

Parameters

Property Type Description
col required
Column
Column containing URL encoded string

text.encode_url

Encode string with URL percent-encoding. Dialects: postgres, pyspark, duckdb

Parameters

Property Type Description
col required
Column
Column containing string

text.decode_html_entities

Decode HTML entities to characters. Dialects: postgres, pyspark, duckdb

Parameters

Property Type Description
col required
Column
Column containing HTML entities

text.encode_html_entities

Encode special characters as HTML entities. Dialects: postgres, pyspark, duckdb

Parameters

Property Type Description
col required
Column
Column containing string

text.unescape_string

Convert literal escape sequences to actual characters. Dialects: postgres, pyspark, duckdb

Parameters

Property Type Description
col required
Column
Column containing string with escape sequences

text.escape_string

Convert special characters to literal escape sequences. Dialects: postgres, pyspark, duckdb

Parameters

Property Type Description
col required
Column
Column containing string

text.normalize_line_endings

Normalize line endings to LF. Dialects: postgres, pyspark, duckdb

Parameters

Property Type Description
col required
Column
Column containing string

text.to_ascii

Transliterate non-ASCII characters to ASCII equivalents. Dialects: pyspark

Parameters

Property Type Description
col required
Column
Column containing string

text.to_codepoints

Convert string to Unicode codepoints representation. Dialects: pyspark, duckdb

Parameters

Property Type Description
col required
Column
Column containing string

text.from_codepoints

Convert Unicode codepoints representation to string. Dialects: pyspark, duckdb

Parameters

Property Type Description
col required
Column
Column containing codepoints

text.reverse_string

Reverse a string. Dialects: postgres, pyspark, duckdb

Parameters

Property Type Description
col required
Column
Column containing string

text.truncate

Truncate string to maximum length. Dialects: postgres, pyspark, duckdb

Parameters

Property Type Description
col required
Column
Column containing string
max_length required
Column
Maximum length
ellipsis required
Column
Whether to add "..." when truncating

text.pad_left

Pad string on the left to specified width. Dialects: postgres, pyspark, duckdb

Parameters

Property Type Description
col required
Column
Column containing string
width required
Column
Target width
pad_char required
Column
Character to pad with

text.pad_right

Pad string on the right to specified width. Dialects: postgres, pyspark, duckdb

Parameters

Property Type Description
col required
Column
Column containing string
width required
Column
Target width
pad_char required
Column
Character to pad with

text.remove_control_characters

Remove control characters (preserving tab, newline, CR). Dialects: postgres, pyspark, duckdb

Parameters

Property Type Description
col required
Column
Column containing string

text.remove_zero_width_characters

Remove zero-width characters. Dialects: postgres, pyspark, duckdb

Parameters

Property Type Description
col required
Column
Column containing string

text.remove_non_printable

Remove non-printable characters (preserving tab, newline, CR). Dialects: pyspark, duckdb

Parameters

Property Type Description
col required
Column
Column containing string

text.remove_ansi_codes

Remove ANSI escape codes. Dialects: postgres, pyspark, duckdb

Parameters

Property Type Description
col required
Column
Column containing string

text.strip_invisible

Remove all invisible characters (control chars, zero-width, BOM). Dialects: postgres, pyspark, duckdb

Parameters

Property Type Description
col required
Column
Column containing string

text.remove_bom

Remove byte order mark (BOM). Dialects: postgres, pyspark, duckdb

Parameters

Property Type Description
col required
Column
Column containing string

text.normalize_unicode

Normalize unicode (replace curly quotes, fancy dashes, special spaces). Dialects: postgres, pyspark, duckdb

Parameters

Property Type Description
col required
Column
Column containing string

text.remove_accents

Remove accents/diacritics from characters. Dialects: postgres, pyspark, duckdb

Parameters

Property Type Description
col required
Column
Column containing string

text.normalize_whitespace

Normalize whitespace (trim and collapse multiple spaces). Dialects: postgres, pyspark, duckdb

Parameters

Property Type Description
col required
Column
Column containing string

text.remove_html_tags

Remove HTML tags from string. Dialects: postgres, pyspark, duckdb

Parameters

Property Type Description
col required
Column
Column containing string

text.remove_urls

Remove URLs from string. Dialects: postgres, pyspark, duckdb

Parameters

Property Type Description
col required
Column
Column containing string

text.remove_emojis

Remove emojis from string. Dialects: postgres, pyspark, duckdb

Parameters

Property Type Description
col required
Column
Column containing string

text.remove_punctuation

Remove punctuation from string. Dialects: postgres, pyspark, duckdb

Parameters

Property Type Description
col required
Column
Column containing string

text.remove_digits

Remove digits from string. Dialects: postgres, pyspark, duckdb

Parameters

Property Type Description
col required
Column
Column containing string

text.remove_letters

Remove letters from string. Dialects: postgres, pyspark, duckdb

Parameters

Property Type Description
col required
Column
Column containing string

text.remove_escape_sequences

Remove literal escape sequences from string. Dialects: postgres, pyspark, duckdb

Parameters

Property Type Description
col required
Column
Column containing string

text.strip_to_alphanumeric

Keep only alphanumeric characters. Dialects: postgres, pyspark, duckdb

Parameters

Property Type Description
col required
Column
Column containing string

text.clean_for_comparison

Clean string for comparison (lowercase, trim, normalize whitespace, remove accents). Dialects: postgres, pyspark, duckdb

Parameters

Property Type Description
col required
Column
Column containing string

text.slugify

Convert string to URL-safe slug. Dialects: postgres, pyspark, duckdb

Parameters

Property Type Description
col required
Column
Column containing string

text.collapse_repeats

Collapse repeated characters to maximum count. Dialects: postgres, pyspark, duckdb

Parameters

Property Type Description
col required
Column
Column containing string
max_repeat required
Column
Maximum allowed consecutive repetitions (1 or 2)

text.clean_string

Comprehensive string cleaning (remove BOM, zero-width, control chars, normalize unicode). Dialects: postgres, pyspark, duckdb

Parameters

Property Type Description
col required
Column
Column containing string