Real-time clickstream analytics

A practical example of creating a pipeline for analyzing clickstream data and continuously updating real-time dashboards.

Clickstream data contains the information gathered as a user navigates through a web application. Clickstream analytics involves tracking, analyzing, and reporting the web pages visited and user behavior on those pages. This data provides valuable insights into user behavior, such as how they discovered the product or service and their interactions on the website.

In this tutorial, we will build a clickstream analytics dashboard using GlassFlow. We will use Google Analytics Data API in Python to collect clickstream data from a website and send them to a GlassFlow pipeline. Our transformation function will analyze the data to calculate additional metrics, and we will use Streamlit and Plotly to visualize the results.

Pipeline components

Producer

There are two options for data producers:

GlassFlow

GlassFlow is responsible for receiving real-time analytics data from the producer using Python SDK, applying the transformation function, and then making the transformed data available for consumption by the consumer.

Consumer

The dashboard component is built using Streamlit, a powerful tool for creating interactive web applications. This component visualizes the clickstream data by creating various charts and graphs in Plotly.

We'll use the GlassFlow CLI to create a new space and configure the data pipeline.

Prerequisites

Make sure that you have the following before proceeding with the installation:

  • You created a GlassFlow account.

  • You installed GlassFlow CLI and logged into your account via the CLI.

  • Basic knowledge of Google Analytics, Streamlit, and Plotly.

  • You have a Google Analytics (GA) account if you use the GA as a data producer.

Installation

  1. Clone the glassflow-examples repository to your local machine:

    git clone https://github.com/glassflow/glassflow-examples.git
  2. Navigate to the project directory:

    cd use-cases/clickstream-analytics-dashboard
  3. Create a new virtual environment:

    python -m venv .venv && source .venv/bin/activate
  4. Install the required dependencies:

    pip install -r requirements.txt

Steps to set up Google Analytics 4 API

Google Analytics 4 (or GA4) has an API that provides access to page views, traffic sources, and other data points. With this API, you can build custom dashboards, automate reporting, and integrate with other applications. We focus only on accessing and exporting data to GlasFlow using Python. You can find more comprehensive information about how to set up the Google Cloud Project (GCP), enable the API, and configure authentication in the API quickstart, or follow this step-by-step guide.

  1. Enable the Google Analytics Data API for a new project or select an existing project.

  2. Go to https://console.cloud.google.com/apis/credentials. Click "Create credentials" and choose a "Service Account" option. Name the service user and click through the next steps.

  3. Once more go to https://console.cloud.google.com/apis/credentials and click on your newly created user (under Service Accounts) Go to "Keys", click "Add key" -> "Create new key" -> "JSON". A JSON file will be saved to your computer.

  4. Rename this JSON file to credentials.json and put it under use-cases/clickstream-analytics-dashboard. Then set the path to this file to the environment variable GOOGLE_APPLICATION_CREDENTIALS:

export GOOGLE_APPLICATION_CREDENTIALS=credentials.json
  1. Add a service account to the Google Analytics property. Using a text editor or VS code, open the credentials.json file downloaded in the previous step and search for client_email field to obtain the service account email address that looks similar to:

ga-467@clickstream-metrics.iam.gserviceaccount.com

Use this email address to add a user to the Google Analytics property you want to access via the Google Analytics Data API v1. For this tutorial, only Viewer permissions are needed.

  1. Copy the Google Analytics property ID you are discovering and save it to variable value for GA_PROPERTY_ID in a .env file in the project directory.

Define the transformation function

To provide meaningful insights to the user based on the received dimensions and metrics from Google Analytics, we apply some computations in the transformation function:

The sample transformation function enriches input event data with the following:

  • Engagement Score: Calculates an engagement score based on event count, screen page views, and active users.

  • Device Usage Insights: Analyzes the proportion of different device categories.

  • Content Popularity: Tracks the popularity of different screens/pages.

  • Geographic Distribution: Provides insights on user distribution based on geographic location.

Steps to run the GlassFlow pipeline

  1. Create a Space via CLI

Open a terminal and create a new space called examples to organize multiple pipelines:

glassflow space create examples

After creating the space successfully, you will get a SpaceID in the terminal.

  1. Create a Pipeline via CLI

Create a new data pipeline inside the space.

glassflow pipeline create clickstream-analytics-dashboard --space-id={space_id} --function=transform.py

This command initializes the pipeline with the name clickstream-analytics-dashboard in the examples space and specifies the transformation function transform.py. After running the command, it returns a new Pipeline ID with its Access Token.

  1. Add pipeline credentials to the environment configuration file

Add the following configuration variables to the .env file in the project directory:

GA_PROPERTY_ID=your_ga_property_id # You do not need it if you generating mock events. 
PIPELINE_ID=your_pipeline_id
SPACE_ID=your_space_id
PIPELINE_ACCESS_TOKEN=your_pipeline_access_token

Replace your_pipeline_id, your_space_id, and your_pipeline_access_token with appropriate values obtained from your GlassFlow account.

Design Streamlit dashboard

The Streamlit dashboard code in consumer.py the script will visualize the output from the GlassFlow transformation, which includes additional insights such as engagement score, device usage, content popularity, and geographic distribution.

The dashboard is updated in real-time with data being continuously consumed from the GlassFlow pipeline.

Run the pipeline

Run data producer

Run the ga_producer.py or fake_producer.py script to start publishing data:

python ga_producer.py

Run the dashboard

Use Streamlit command to run the dashboard:

streamlit run consumer.py

You see the output with several dashboards updating in real-time:

You learned how to integrate real-time analytics data from Google Analytics into GlassFlow for further processing and visualization. Analytics data can be also stored in a database like ClickHouse for future use.

Last updated

© 2023 GlassFlow