Phase 5: Industry 5.0 Streaming Analysis & Concept Drift Detection

The final—and most advanced—requirement of the project was to simulate a real-time Industry 5.0 production environment. Static models (like the Random Forest trained in Phase 4b) degrade over time in production due to sensor decalibration, operator fatigue, or process changes—a phenomenon known as Concept Drift.

My goal was to build a streaming architecture that not only learns incrementally (row by row) but also monitors its own error rate to detect when the physical reality on the factory floor changes.

The Streaming Architecture & Prequential Evaluation Unlike traditional machine learning, which relies on a fixed train/test split, I used the river library to implement Prequential Evaluation (Test-Then-Train). As each of the 97,612 time-windows arrives:

The model attempts to predict the operator's action based on its current knowledge.
The prediction is logged, and the model immediately reveals the true label to itself.
The model learns from this new observation. This approach mathematically eliminates data leakage and provides an unbiased, real-time estimate of model accuracy.

Simulating the Real World: The Drift Injection To rigorously test the system, I designed a simulate_stream function. For the first half of the dataset (0 to 48,806 windows), the stream runs normally. However, precisely at the 50% mark, I injected a 30% label noise rate (randomly flipping labels to incorrect classes). This simulated a sudden, severe process failure or sensor malfunction on the production line.

The ADWIN & ARF Response (The "Aha!" Moment) For the modeling and detection engine, I paired an Adaptive Random Forest (ARF) with an ADWIN (Adaptive Windowing) drift detector.

The Baseline: Before the drift injection, the ARF model learned the 15 multiclass actions, maintaining a rolling accuracy of 0.996.
The Detection: I injected the 30% noise rate at window 48,806. At window 48,864, the terminal flashed [DRIFT DETECTED]. ADWIN detected a statistically significant spike in the error rate just 58 windows (~3 seconds) after the process was corrupted.
The Recovery: The final accuracy degraded to the mathematically expected level (~0.846), showing that the ARF model adapted its background trees to the noisy data without catastrophic failure.

Phase 5 Code Analysis:

1. The Incremental Learning Loop

Python

for i, (x, y_true) in enumerate(simulate_stream(feat_df, inject_drift=True)): y_pred = model.predict_one(x) # Test accuracy.update(y_true, y_pred) # Evaluate drift_detector.update(int(y_pred != y_true)) # Monitor Error model.learn_one(x, y_true) # Train

This loop is the beating heart of the system. It processes the entire 10GB dataset equivalent (condensed into features) continuously, one window at a time, while requiring almost zero RAM.

2. Model Definition

Python

model = preprocessing.StandardScaler() | forest.ARFClassifier(n_models=10, seed=42) drift_detector = drift.ADWIN(delta=0.002)

The data is incrementally scaled before hitting the ARF. n_models=10 provided enough ensemble power to handle 15 classes while keeping computation extremely light.