Data transformation
This page outlines data transformation concepts in GlassFlow.
Data transformation is a critical process in data streaming and processing. It enables the conversion or mapping of data from one format or structure into another. GlassFlow facilitates this process using a custom Python transformation function, allowing for a wide range of transformation scenarios including data cleaning, aggregation, normalization, enrichment, and more.
Implementing Transformations
To perform data transformations in GlassFlow, you must write a Python script containing a mandatory handler
function. This function is where you define your transformation logic. GlassFlow automatically invokes this function when a data pipeline runs and it passes two arguments:
data
- represents the event dispatched to the pipeline, accessible within the function as a JSON or Python dictionary.log
- is a Python logging object to generate logs. Any logs created by the user will be included in the pipeline logs, which can be viewed through the CLI.
The handler
function processes this data and returns the transformed data as a JSON or Python dictionary. Here's the basic structure of a transformation script in GlassFlow:
You can also include other Python dependencies (Python packages that youimport
into your script) in the transformation function. See supported libraries with GlassFlow.
Common Data Transformation Scenarios
Stateless
Data Cleaning
Data Enrichment
Data Validation
Data Anomaly Detection
Data Profiling
Data Quality Check
Data Normalization
Data Conversion
Real-time APIs integration
LLMs integration
ML-trained model integration
Stateful
Data Aggregation
Data Filtering
Data transformation based on history.
Transformation code samples
Data Cleaning
Data cleaning involves removing or correcting the data's inaccurate, incomplete, or irrelevant parts, such as whitespace, correcting typos, or filtering out unwanted records.
Example: Removing Null Values
IP Address Masking
IP address masking is useful for anonymizing user data. This transformation can replace the last octet of an IP address with 0 to mask the user's specific location.
Example: Masking IP Addresses
Data Enrichment
Data enrichment involves enhancing existing data with additional information and for instance, adding user demographic information based on an email address or user ID.
Example: Adding User Type
Python dependencies for transformation
With each import
statement in your transformation function script, you are bringing in a Python dependency. GlassFlow needs to install those dependencies to compile and run the function successfully. When you upload your transformation function through the GlassFlow interface or using the CLI command, GlassFlow automatically compiles your function with the supported libraries installed. This process verifies that your function is compatible with the serverless execution environment.
Supported libraries
As of now, GlassFlow supports the following Python libraries for use in transformation functions:
Requesting Additional Libraries
We understand that the scope of data transformation tasks can vary widely, and you might require libraries that are not currently supported. If you find that you need additional libraries for your transformation functions, we encourage you to request these libraries by raising an issue on our GitHub repository.
Next
For further details on configuring your data pipelines to utilize these transformations effectively, please proceed to the Pipeline Configuration page in our documentation.
Last updated