Data transformation

This page outlines data transformation concepts in GlassFlow.

Data transformation is a critical process in data streaming and processing. It enables the conversion or mapping of data from one format or structure into another. GlassFlow facilitates this process using a custom Python transformation function, allowing for a wide range of transformation scenarios including data cleaning, aggregation, normalization, enrichment, and more.

Implementing Transformations

To perform data transformations in GlassFlow, you must write a Python script containing a mandatory handler function. This function is where you define your transformation logic. GlassFlow automatically invokes this function when a data pipeline runs and it passes two arguments:

data - represents the event dispatched to the pipeline, accessible within the function as a JSON or Python dictionary.
log - is a Python logging object to generate logs. Any logs created by the user will be included in the pipeline logs, which can be viewed through the CLI.

The handler function processes this data and returns the transformed data as a JSON or Python dictionary. Here's the basic structure of a transformation script in GlassFlow:

import json


def handler(data, log):
    log.info("Echo:" + json.dumps(data))
    # Your transformation logic goes here.
    return data

You can also include other Python dependencies (Python packages that youimport into your script) in the transformation function. See supported libraries with GlassFlow.

Common Data Transformation Scenarios

Stateless

Data Cleaning
Data Enrichment
Data Validation
Data Anomaly Detection
Data Profiling
Data Quality Check
Data Normalization
Data Conversion
Real-time APIs integration
LLMs integration
ML-trained model integration

Stateful

Data Aggregation
Data Filtering
Data transformation based on history.

Transformation code samples

Data Cleaning

Data cleaning involves removing or correcting the data's inaccurate, incomplete, or irrelevant parts, such as whitespace, correcting typos, or filtering out unwanted records.

Example: Removing Null Values

import json

def handler(data, log):
    data = json.loads(json_data)
    cleaned_data = {k: v for k, v in data.items() if v is not None}
    return cleaned_data

IP Address Masking

IP address masking is useful for anonymizing user data. This transformation can replace the last octet of an IP address with 0 to mask the user's specific location.

Example: Masking IP Addresses


import json

def handler(data, log):
    ip = data.get('ip_address', '')
    masked_ip = '.'.join(ip.split('.')[:-1] + ['0'])
    data['ip_address'] = masked_ip
    return data

Data Enrichment

Data enrichment involves enhancing existing data with additional information and for instance, adding user demographic information based on an email address or user ID.

Example: Adding User Type

import json

def handler(data, log):
    user_id = data.get('user_id', '')
    # Assume a function getUserType returns user type based on user_id
    data['user_type'] = getUserType(user_id)
    return data

def getUserType(user_id):
    user_types = {
        '001': 'Admin',
        '002': 'Editor',
        '003': 'Viewer',
        # Add more mappings as needed
    }
    return user_types.get(user_id, 'Guest')

Python dependencies for transformation

With each import statement in your transformation function script, you are bringing in a Python dependency. GlassFlow needs to install those dependencies to compile and run the function successfully. When you upload your transformation function through the GlassFlow interface or using the CLI command, GlassFlow automatically compiles your function with the supported libraries installed. This process verifies that your function is compatible with the serverless execution environment.

Supported libraries

As of now, GlassFlow supports the following Python libraries for use in transformation functions:

Library

Description

Requesting Additional Libraries

We understand that the scope of data transformation tasks can vary widely, and you might require libraries that are not currently supported. If you find that you need additional libraries for your transformation functions, we encourage you to request these libraries by raising an issue on our GitHub repository.

For further details on configuring your data pipelines to utilize these transformations effectively, please proceed to the Pipeline Configuration page in our documentation.

PreviousUse cases NextPipeline configuration

Last updated 6 days ago