Phase 4b: Classical Supervised Learning & Overcoming Temporal Leakage

Given the inherent limitations of distance-based unsupervised algorithms in high-dimensional kinematic spaces (Phase 4a), my objective here was to satisfy the rubric’s “Classical Supervised Model” requirement. To do this, I restricted the dataset to the binary target labels (0 and 1), transforming the task into a clear two-class classification problem.

To handle the 528 extracted features without succumbing to the curse of dimensionality, I used a RandomForestClassifier. Tree-based models do not rely on distance computations. Instead, they recursively learn split thresholds, making them robust to the geometric distortions of high-dimensional spaces.

The Silent Enemy: Temporal Data Leakage In my initial implementation, I used a standard train_test_split(stratify=y). The model immediately achieved a suspicious F1-Score of 1.0000. Further investigation revealed a severe case of Temporal Data Leakage. Because the Phase 3 feature extraction used a 50% overlapping rolling window, adjacent windows shared 50% of the same raw sensor rows. With a random shuffle split, a window’s “overlapping twin” often landed in the training set while the original went to the test set, allowing the model to effectively memorize the data.

The Solution: Systematic Time-Aware Splitting I evaluated TimeSeriesSplit, but it was degenerate in this setting: Class 0 and Class 1 were clustered in completely different temporal segments of the recording. To preserve both temporal integrity and class balance, I implemented a Systematic Split (test_mask[::5] = True). By selecting every 5th window for the test set, adjacent overlapping windows were strictly separated, mathematically eliminating leakage while maintaining a representative test set.

Final Code Analysis & Findings

1. Model Configuration and Hardware Optimization

Python

model = RandomForestClassifier( n_estimators=200, class_weight='balanced', n_jobs=-1, # use all 16 threads (Ryzen 7 7800X3D) random_state=42 )

To address class imbalance (7,182 samples of Class 0 vs. 3,530 of Class 1), I applied class_weight='balanced'. With n_jobs=-1, I fully utilized the Ryzen 7 7800X3D processor, achieving a training time of just 1.3 seconds for 200 estimators.

2. The Dominance of the Z-Axis (_Tz) Even after fixing the leakage, the model achieved an excellent Macro F1-Score: 0.9995. To understand how it was differentiating the actions so effectively, I extracted feature_importances_:

Head_Tz_max
Head_End_Tz_max
Lowerback_Tz_min
L_Femur_Tz_max
Lowerback_Tz_max

The top 20 features were almost exclusively _Tz (Z-axis translation or height) metrics. The physical interpretation is straightforward: Class 0 (Normal Assembly) and Class 1 (Exceptional Intervention) primarily differ in the operator’s vertical posture. The operator is likely upright during Class 0 and bent down or seated during Class 1. In practice, the Random Forest learned that if the head and lower back drop below a specific height threshold, the window is conclusively Class 1.

System Profiling Metrics:

Training Time: 1.3s
Peak RAM Usage: 8.7 MB
Inference Latency (2,143 samples): 27.36 ms (approx. 0.012 ms per window)