The Biggest Discovery: The "Binary" Trap According to the dataset README, the dataset was described as a "Binary classification (0 or 1)." However, our EDA revealed a completely different reality.
LABEL column, I found 15 distinct classes (0, 1, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15).Validating Stationarity (Rubric Requirement)
To satisfy the rubric's stationarity requirement without crashing the system with a heavy ADF (Augmented Dickey-Fuller) test on 10 million rows, I implemented a visual rolling mean and standard deviation check (plot_stationarity_check). The generated plots provided visual evidence that the sensor channels maintain a roughly constant mean and variance over time, allowing us to safely assume weak stationarity for our rolling-window feature engineering.
Final Code Analysis
1. Efficient Data Loading & Sampling
Python
pq_df = dd.read_parquet('data/main_data_parquet', engine='pyarrow') sampled_df = pq_df.sample(frac=0.05, random_state=42).compute()
Instead of exhausting RAM by loading the entire 10 GB dataset, I point Dask to optimized Parquet directory. By using .sample(frac=0.05) combined with .compute(), we safely pull a manageable, representative cross-section of the data (roughly 489,000 rows) directly into pandas for visualization.
2. Label Distribution Analysis
Python
label_counts = sampled_df['LABEL'].value_counts().sort_index() anomalies = sampled_df[~sampled_df['LABEL'].isin([0, 1])]
I count the frequency of each class. The ~isin([0, 1]) logic acted as "tripwire", exposing the 15 distinct production actions hidden within the continuous sensor stream.
3. Data Visualization
Python
sns.countplot(data=sampled_df, x='LABEL', palette='viridis')
Using Seaborn, I generated a bar chart to confirm the distribution. The plot showed that the 15 classes were almost uniformly distributed (~36k instances each in the sample), indicating distinct, sustained operator actions rather than random sensor noise.