Phase 2: Scalable Exploratory Data Analysis (EDA) & The "Binary" Trap

In this phase, I conducted an Exploratory Data Analysis (EDA) on a 5% sample of the highly compressed Parquet data. The goal was to visualize the target variable and understand the class distribution before moving on to feature engineering and modeling.

The Biggest Discovery: The "Binary" Trap According to the dataset README, the dataset was described as a "Binary classification (0 or 1)." However, our EDA revealed a completely different reality.

The Multiclass Reality: When extracting the unique values from the LABEL column, I found 15 distinct classes (0, 1, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15).
The Strategic Pivot: A simplistic approach might label anything other than 0 or 1 as an "anomaly" or dirty data. However, finding roughly 435,000 rows in these additional classes showed that the dataset is natively multiclass. The "binary" description was simply a trap or simplification for the supervised modeling portion. This discovery fundamentally justified our primary task: Unsupervised Learning.

Validating Stationarity (Rubric Requirement) To satisfy the rubric's stationarity requirement without crashing the system with a heavy ADF (Augmented Dickey-Fuller) test on 10 million rows, I implemented a visual rolling mean and standard deviation check (plot_stationarity_check). The generated plots provided visual evidence that the sensor channels maintain a roughly constant mean and variance over time, allowing us to safely assume weak stationarity for our rolling-window feature engineering.

Final Code Analysis

1. Efficient Data Loading & Sampling

Python

pq_df = dd.read_parquet('data/main_data_parquet', engine='pyarrow') sampled_df = pq_df.sample(frac=0.05, random_state=42).compute()

Instead of exhausting RAM by loading the entire 10 GB dataset, I point Dask to optimized Parquet directory. By using .sample(frac=0.05) combined with .compute(), we safely pull a manageable, representative cross-section of the data (roughly 489,000 rows) directly into pandas for visualization.

2. Label Distribution Analysis

Python

label_counts = sampled_df['LABEL'].value_counts().sort_index() anomalies = sampled_df[~sampled_df['LABEL'].isin([0, 1])]

I count the frequency of each class. The ~isin([0, 1]) logic acted as "tripwire", exposing the 15 distinct production actions hidden within the continuous sensor stream.

3. Data Visualization

Python

sns.countplot(data=sampled_df, x='LABEL', palette='viridis')

Using Seaborn, I generated a bar chart to confirm the distribution. The plot showed that the 15 classes were almost uniformly distributed (~36k instances each in the sample), indicating distinct, sustained operator actions rather than random sensor noise.