Phase 4a: Unsupervised Learning & The Curse of Dimensionality

In this phase, my objective was to tackle the core requirement of the project: building an unsupervised learning model to cluster the sensor data into the 15 distinct production actions. I utilized MiniBatchKMeans for memory-efficient clustering and integrated the Hungarian Algorithm (scipy.optimize.linear_sum_assignment) to optimally map the arbitrary cluster IDs (0-14) back to the ground-truth labels for a robust F1-Score evaluation.

However, this phase turned out to be the most revealing stress-test of the entire project.

The First Trap: The Curse of Dimensionality Initially, I fed the model the full feature matrix—528 statistical features extracted during Phase 3. The resulting F1-Score was catastrophic (around 0.15). I quickly identified the culprit: The Curse of Dimensionality. K-Means is a distance-based algorithm (relying on Euclidean distance). In a 528-dimensional space, the mathematical distance between any two points becomes virtually identical. The algorithm was essentially guessing blindly because every data point looked equally far apart.

The Attempted Fix: Principal Component Analysis (PCA) To mitigate the curse of dimensionality, I introduced PCA before the K-Means clustering step.

I reduced the 528 noisy features to 50 principal components.
As shown in the variance plot, this transformation retained 94.3% of the explained variance while reducing multicollinearity (e.g., highly correlated movements between the wrist and elbow) and sensor noise.

The Brutal Reality: Why Unsupervised Learning Failed Here Despite retaining 94.3% of the variance, the F1-Score barely changed, remaining around 0.14. The confusion matrix heatmap showed substantial overlap between classes. Design Justification: Unsupervised, distance-based algorithms like K-Means are fundamentally ill-suited to 15-class, fine-grained kinematic action recognition. Human motion is highly nuanced; the difference between "tightening" and "loosening" a screw can be a slight rotational variation. PCA improves the geometry, but K-Means still cannot separate 15 dense, overlapping action clusters without label guidance.

This "failure" proved that while unsupervised models are great for broad anomaly detection, real-time industrial action classification absolutely requires the non-linear, non-distance-based decision boundaries of tree-based supervised algorithms.

Phase 4a Code Analysis:

1. Dimensionality Reduction (PCA)

Python

`scaler = StandardScaler() X_scaled = scaler.fit_transform(X)

print(f"Applying PCA ({N_PCA_COMPS} components)...") pca = PCA(n_components=N_PCA_COMPS, random_state=42) X_pca = pca.fit_transform(X_scaled)`

Standardization is mandatory before PCA and K-Means to ensure sensors measured in different units (e.g., millimeters vs. degrees) do not dominate the distance calculations.

2. Fast Clustering

Python

kmeans = MiniBatchKMeans( n_clusters=N_CLUSTERS, random_state=42, batch_size=4096, n_init=10 ) kmeans.fit(X_pca)

I opted for MiniBatchKMeans over standard KMeans. It produces nearly identical results but operates in a fraction of the time and RAM, peaking at just 7.6 MB during training.

3. The Hungarian Algorithm (Linear Sum Assignment)

Python

def hungarian_match(true_labels, cluster_labels, n_clusters): # Builds a cost matrix based on cluster-to-true-label overlap # linear_sum_assignment finds the optimal 1-to-1 mapping