In this phase, my objective was to tackle the core requirement of the project: building an unsupervised learning model to cluster the sensor data into the 15 distinct production actions. I utilized MiniBatchKMeans for memory-efficient clustering and integrated the Hungarian Algorithm (scipy.optimize.linear_sum_assignment) to optimally map the arbitrary cluster IDs (0-14) back to the ground-truth labels for a robust F1-Score evaluation.
However, this phase turned out to be the most revealing stress-test of the entire project.
The First Trap: The Curse of Dimensionality Initially, I fed the model the full feature matrix—528 statistical features extracted during Phase 3. The resulting F1-Score was catastrophic (around 0.15). I quickly identified the culprit: The Curse of Dimensionality. K-Means is a distance-based algorithm (relying on Euclidean distance). In a 528-dimensional space, the mathematical distance between any two points becomes virtually identical. The algorithm was essentially guessing blindly because every data point looked equally far apart.
The Attempted Fix: Principal Component Analysis (PCA) To mitigate the curse of dimensionality, I introduced PCA before the K-Means clustering step.
The Brutal Reality: Why Unsupervised Learning Failed Here Despite retaining 94.3% of the variance, the F1-Score barely changed, remaining around 0.14. The confusion matrix heatmap showed substantial overlap between classes. Design Justification: Unsupervised, distance-based algorithms like K-Means are fundamentally ill-suited to 15-class, fine-grained kinematic action recognition. Human motion is highly nuanced; the difference between "tightening" and "loosening" a screw can be a slight rotational variation. PCA improves the geometry, but K-Means still cannot separate 15 dense, overlapping action clusters without label guidance.
This "failure" proved that while unsupervised models are great for broad anomaly detection, real-time industrial action classification absolutely requires the non-linear, non-distance-based decision boundaries of tree-based supervised algorithms.
Phase 4a Code Analysis:
1. Dimensionality Reduction (PCA)
Python
`scaler = StandardScaler() X_scaled = scaler.fit_transform(X)
print(f"Applying PCA ({N_PCA_COMPS} components)...") pca = PCA(n_components=N_PCA_COMPS, random_state=42) X_pca = pca.fit_transform(X_scaled)`
Standardization is mandatory before PCA and K-Means to ensure sensors measured in different units (e.g., millimeters vs. degrees) do not dominate the distance calculations.
2. Fast Clustering
Python
kmeans = MiniBatchKMeans( n_clusters=N_CLUSTERS, random_state=42, batch_size=4096, n_init=10 ) kmeans.fit(X_pca)
I opted for MiniBatchKMeans over standard KMeans. It produces nearly identical results but operates in a fraction of the time and RAM, peaking at just 7.6 MB during training.
3. The Hungarian Algorithm (Linear Sum Assignment)
Python
def hungarian_match(true_labels, cluster_labels, n_clusters): # Builds a cost matrix based on cluster-to-true-label overlap # linear_sum_assignment finds the optimal 1-to-1 mapping