Training Fraud Detection ML Models with Real-time Data Streams

Learn step-by-step how to train a real-time fraud detection model with Apache kafka and river.

Training Fraud Detection ML Models with Real-time Data Streams

In today's rapidly evolving digital landscape, static datasets can quickly become obsolete, rendering conventional machine learning models less effective over time. Enter the paradigm of online machine learning, or as I like to call it, real-time machine learning, where models are not just trained once but constantly updated using live data streams. This approach ensures that models stay relevant, adaptive, and increasingly accurate, reflecting real-time changes and patterns in data. As businesses and industries race to harness the most current insights from their operations, integrating continuous learning with real-time data streams is becoming an indispensable strategy.

Historically, machine learning relied heavily on batch processing. This meant accumulating large amounts of data, processing them, and then updating the model. While effective, this method often led to significant lag times between data collection and actionable insights. It also meant that sudden changes in patterns or anomalies in incoming data might not be promptly recognized. As the digital world became faster and more interconnected, the need for real-time data processing and adaptability became glaringly evident.

In this article, we'll delve into the transformative world of training machine learning models using these real-time data streams, exploring its benefits, challenges, and unparalleled potential in driving real-time decision-making and predictive accuracy.

Overview

In this article, we will set up a pipeline that feeds a machine-learning model with data. The model will be trained as we keep feeding it with more data. We will also get familiar with tools, which are Apache Kafka and River.

Then, we will finish the article with a discussion of the importance and challenges of streaming data in the context of machine learning.

Tools I used

  • Apache Kafka. Apache Kafka is an open-source stream-processing software platform developed by LinkedIn and donated to the Apache Software Foundation. It is designed to handle real-time data feeds with high throughput and low latency.

  • River. River is a cutting-edge machine-learning package that efficiently handles real-time data streams. With a suite of advanced tools and algorithms, River enables users to train models that adapt and learn continuously as new data becomes available.

The Data

In this article, we are going to use the credit card fraud dataset. The dataset can be downloaded from Kaggle. But the River package already comes with this dataset. So, we will simply use the dataset from River.

Here is a quick summary of the data:

The datasets contains transactions made by credit cards in September 2013 by european cardholders. This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.

It contains only numerical input variables which are the result of a PCA transformation. Unfortunately, due to confidentiality issues, we cannot provide the original features and more background information about the data. Features V1, V2, ... V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are 'Time' and 'Amount'. Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature 'Amount' is the transaction Amount, this feature can be used for example-dependant cost-senstive learning. Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise.

Kafka Setup

To set up Kafka locally, let's use docker-compose. This is much simpler than installing Kafka manually. If you don't have Docker and docker-compose installed, please refer to the following document.

# filename: docker-compose.yaml
version: '3.7'
services:
  zookeeper:
    image: zookeeper:latest
    container_name: "zookeeper-stream-ml"
    ports:
      - "2181:2181"
    environment:
      ZOOKEEPER_CLIENT_PORT: 2181
      ZOOKEEPER_TICK_TIME: 2000
  kafka:
    image: wurstmeister/kafka:latest
    restart: unless-stopped
    container_name: "kafka-stream-ml"
    ports:
      - "9092:9092"
    expose:
      - "9093"
    depends_on:
      - zookeeper
    environment:
      KAFKA_ZOOKEEPER_CONNECT: zookeeper-stream-ml:2181/kafka
      KAFKA_BROKER_ID: 0
      KAFKA_ADVERTISED_HOST_NAME: kafka-stream-ml
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka-stream-ml:9093,OUTSIDE://localhost:9092
      KAFKA_LISTENERS: PLAINTEXT://0.0.0.0:9093,OUTSIDE://0.0.0.0:9092
      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,OUTSIDE:PLAINTEXT
      KAFKA_CREATE_TOPIC: "ml_training_data:1:1"

Now that we have the docker-compose file ready, we can run Kafka. Remember that you need to be in the same directory as the above docker-compose.yaml file to run the following command.

$ docker-compose up

Send Streaming Data

Once Kafka is up and running, we can start sending data to it. The following code uses the credit card fraud dataset and sends it to Kafka with a random sleep between messages.

import time
import json
import random

from kafka import KafkaProducer
from river import datasets


# Create a Kafka producer that connects to Kafka on port 9092
producer = KafkaProducer(
    bootstrap_servers=["localhost:9092"],
    value_serializer=lambda x: json.dumps(x).encode("utf-8"),
)

# Initialize the River credit card fraud dataset.
dataset = datasets.CreditCard()

# Send observations to the Kafka topic with a random sleep
for x, y in dataset:
    print("Sending:")
    print(f"   x: {x}")
    print(f"   y: {y}")
    data = {"x": x, "y": y}     
    data = {"x": x, "y": y}
    producer.send("ml_training_data", value=data)

    time.sleep(

If we run the code above, we will see the following data:

Sending:
   x: {'Time': 0.0, 'V1': -1.3598071336738, 'V2': -0.0727811733098497, 'V3': 2.53634673796914, 'V4': 1.37815522427443, 'V5': -0.338320769942518, 'V6': 0.462387777762292, 'V7': 0.239598554061257, 'V8': 0.0986979012610507, 'V9': 0.363786969611213, 'V10': 0.0907941719789316, 'V11': -0.551599533260813, 'V12': -0.617800855762348, 'V13': -0.991389847235408, 'V14': -0.311169353699879, 'V15': 1.46817697209427, 'V16': -0.470400525259478, 'V17': 0.207971241929242, 'V18': 0.0257905801985591, 'V19': 0.403992960255733, 'V20': 0.251412098239705, 'V21': -0.018306777944153, 'V22': 0.277837575558899, 'V23': -0.110473910188767, 'V24': 0.0669280749146731, 'V25': 0.128539358273528, 'V26': -0.189114843888824, 'V27': 0.133558376740387, 'V28': -0.0210530534538215, 'Amount': 149.62}
   y: 0
Sending:
   x: {'Time': 0.0, 'V1': 1.19185711131486, 'V2': 0.26615071205963, 'V3': 0.16648011335321, 'V4': 0.448154078460911, 'V5': 0.0600176492822243, 'V6': -0.0823608088155687, 'V7': -0.0788029833323113, 'V8': 0.0851016549148104, 'V9': -0.255425128109186, 'V10': -0.166974414004614, 'V11': 1.61272666105479, 'V12': 1.06523531137287, 'V13': 0.48909501589608, 'V14': -0.143772296441519, 'V15': 0.635558093258208, 'V16': 0.463917041022171, 'V17': -0.114804663102346, 'V18': -0.183361270123994, 'V19': -0.145783041325259, 'V20': -0.0690831352230203, 'V21': -0.225775248033138, 'V22': -0.638671952771851, 'V23': 0.101288021253234, 'V24': -0.339846475529127, 'V25': 0.167170404418143, 'V26': 0.125894532368176, 'V27': -0.00898309914322813, 'V28': 0.0147241691924927, 'Amount': 2.69}
   y: 0
Sending:
   x: {'Time': 1.0, 'V1': -1.35835406159823, 'V2': -1.34016307473609, 'V3': 1.77320934263119, 'V4': 0.379779593034328, 'V5': -0.503198133318193, 'V6': 1.80049938079263, 'V7': 0.791460956450422, 'V8': 0.247675786588991, 'V9': -1.51465432260583, 'V10': 0.207642865216696, 'V11': 0.624501459424895, 'V12': 0.066083685268831, 'V13': 0.717292731410831, 'V14': -0.165945922763554, 'V15': 2.34586494901581, 'V16': -2.89008319444231, 'V17': 1.10996937869599, 'V18': -0.121359313195888, 'V19': -2.26185709530414, 'V20': 0.524979725224404, 'V21': 0.247998153469754, 'V22': 0.771679401917229, 'V23': 0.909412262347719, 'V24': -0.689280956490685, 'V25': -0.327641833735251, 'V26': -0.139096571514147, 'V27': -0.0553527940384261, 'V28': -0.0597518405929204, 'Amount': 378.66}
   y: 0

Train Model with the Streaming Data

Credit card fraud detection is a classic problem in the domain of anomaly detection and classification, where the objective is to differentiate between legitimate and fraudulent transactions. For this article, we are going to use the logistic regression. Logistic regression is simple to use and easy to interpret. Logistic regression can perform surprisingly well, even for a huge amount of data. It also serves as a good baseline model to benchmark when we decide to use more complex methods later.

While logistic regression has its strengths, it also has limitations, especially in the context of highly imbalanced datasets like our credit card fraud detection. To tackle this issue, we will use the RandomUnderSampler. I won't go into too much detail regarding handling the data imbalance issue in this article. Using the river library, we can set up our model pipeline to handle imbalanced data before we perform the classification operation.

Here is the code where we set up and train our model using streaming data from Kafka.

import json

from kafka import KafkaConsumer

from river import linear_model
from river import preprocessing
from river import metrics
from river import imblearn
from river import optim


# Use rocauc as the metric for evaluation
metric = metrics.ROCAUC()

# Create a logistic regression model with a scaler
# and Random Under Sampler to deal with data imbalance
model = (
    preprocessing.StandardScaler() |
    imblearn.RandomUnderSampler(
        classifier=linear_model.LogisticRegression(
            loss=optim.losses.Log(weight_pos=5)
        ),
        desired_dist={0: .8, 1: .2},
        seed=42
    )
)

# Create our Kafka consumer
consumer = KafkaConsumer(
    "ml_training_data",
    bootstrap_servers=["localhost:9092"],
    auto_offset_reset="earliest",
    enable_auto_commit=True,
    group_id="the-group-id",
    value_deserializer=lambda x: json.loads(x.decode("utf-8")),
)

# Use each event to update our model and print the prediction and metrics
for event in consumer:
    event_data = event.value
    try:
        x = event_data["x"]
        y = event_data["y"]
        prediction = model.predict_one(x)
        y_pred = model.predict_proba_one(x)

        model.learn_one(x, y)
        metric.update(y, y_pred)
        print(f"Prediction: {prediction}   Accuracy:{metric}")
    except:
        print("Processing bad data...")

When we run the script above, we can see the output as follows.

Prediction: False   Accuracy:ROCAUC: -0.00%
Prediction: False   Accuracy:ROCAUC: -0.00%
Prediction: True   Accuracy:ROCAUC: -0.00%
...
Prediction: False   Accuracy:ROCAUC: 96.52%
Prediction: False   Accuracy:ROCAUC: 96.52%
Prediction: False   Accuracy:ROCAUC: 96.52%
Prediction: False   Accuracy:ROCAUC: 96.52%

As more data is consumed, the accuracy of the model gets better.

Use Cases, Benefits, and Challenges

The world of data science and machine learning is continually evolving, and one area of rapid development is real-time machine learning. In traditional batch machine learning, models are trained on a large, static dataset all at once. In contrast, real-time machine learning involves models that learn continuously as new data becomes available. This paradigm is instrumental when data is streamed in real-time. Here, we explore some use cases, benefits, and challenges of this approach.

Benefits

  • Real-time Adaptation: One of the most notable advantages of this type of machine learning is the model's ability to adapt to changes in real-time. When data is streamed, patterns can change over time. Online learning allows models to adjust immediately, ensuring they remain relevant and accurate.

  • Memory Efficiency: Since real-time algorithms process data points sequentially, there's no need to store large datasets in memory. This can be especially beneficial for applications running on devices with limited memory resources, such as IoT devices.

  • Scalability: For vast datasets or those that grow indefinitely, traditional batch learning can be infeasible due to computational or storage limitations. Real-time machine learning offers a scalable solution, as models update with each new data point, eliminating the need for retraining from scratch.

  • Immediate Feedback: In many real-world scenarios, it's crucial to get immediate feedback to make timely decisions. With real-time machine learning, predictions and model adjustments can be made on the fly, enhancing decision-making processes.

  • Less Redundant Computation: Unlike batch learning, where the entire model might be retrained with the addition of new data, real-time learning focuses only on the new data, leading to computational savings. In fact, according to this study from Grubhub, in 2021, it demonstrated a +20% metrics increase and 45x cost savings by leveraging online learning.

Use cases

Because of the benefit above, new opportunities become available for businesses and industries that leverage the prevalence of data streaming and the power of machine learning. Some use cases that come to mind:

  • Fraud detection: In this article, we train our model to predict credit card fraud. With the correct setup, we can detect and prevent fraudulent activities in real time. This will help businesses and industries minimize financial losses.

  • Real-time decision-making: Streaming data gives organizations the ability to evaluate and interpret data instantly, facilitating swift and knowledgeable choices. This capability is especially vital for sectors like finance, healthcare, and transportation, where prompt action is crucial.

  • Enhanced Customer Experience: By utilizing streaming data to observe and assess customer interactions as they happen, businesses can elevate their customer service and offer tailored suggestions.

  • Predictive analytics: Streaming data facilitates the real-time training of machine learning models, paving the way for enhanced predictive analytics and future predictions.

Challenges

With its many advantages, real-time machine learning is not without challenges. It demands efficient algorithms that can process data swiftly without consuming excessive computational resources. There's also a data quality issue; erroneous data can mislead the algorithm, leading to poor predictions. This is particularly concerning when considering adversarial attacks or situations where the data stream can be manipulated.

  • Noise Sensitivity: One of the primary challenges of real-time machine learning is its sensitivity to noisy data. A single mislabeled or erroneous data point can negatively impact the model, especially if the model gives it too much importance.

  • Concept Drift: Over time, the underlying patterns in data can change, a phenomenon known as concept drift. While real-time machine learning is designed to adapt to changes, rapid or frequent drifts can still be challenging to address effectively.

  • Computational Constraints: Although real-time machine learning is more memory-efficient, it still requires rapid processing to keep up with high-frequency data streams. This can be computationally demanding in scenarios with massive data inflow rates.

  • Parameter Tuning: Setting the learning rate and other algorithm parameters can be more challenging in real-time machine learning. An aggressive learning rate can cause the model to overfit to recent data, while a slow rate might make the model too rigid.

  • Lack of Comprehensive Review: With the continuous inflow of data, there might not be a chance to re-analyze or reconsider earlier data. This continuous mode can sometimes mean the model overlooks broader trends or misses out on refining its understanding of certain patterns.

Conclusion

The age of static, one-time learning is giving way to the dynamic, continuous learning heralded by real-time machine learning. This isn't just another advancement; it's a paradigm shift. As we stand on the brink of a data-driven future, with interconnected devices and relentless data streams, it becomes clear: real-time machine learning isn't just better than the alternatives—it might very well be the blueprint for the future of technology.

Real-time machine learning with streamed data offers a compelling real-time learning and adaptation solution for such use cases as Real-Time Fraud detection. Of course, while its benefits regarding adaptability, efficiency, and scalability are clear, the challenges it presents require careful consideration. Addressing issues like noise sensitivity, concept drift, and parameter tuning demands a deep understanding of the domain and the specific application at hand. As technology advances, we can expect the development of tools and strategies that will further harness the strengths of real-time machine learning while mitigating its challenges.

Streaming technology, in particular, holds a pivotal role in the framework of real-time machine learning. In today's data-driven landscape, the continuous influx of information demands real-time processing to derive actionable insights without delay. Streaming technology facilitates this immediate intake and processing of data, ensuring that machine learning models are constantly fed with fresh data points. This continuous data stream allows models to learn, adapt, and evolve on the fly, reflecting the most recent changes and patterns in the data. Without streaming technology, the dynamism inherent to real-time machine learning would be stifled, hindering the model's ability to stay updated and relevant. Simply put, streaming technology is the lifeblood that empowers real-time machine-learning systems to function at their full potential, making instantaneous learning and decision-making a reality.

Last updated