Phase 1: Data Architecture

In this phase, I processed the data using Dask with an out-of-core approach (processing data that does not fit in RAM by reading it from disk in chunks).

The 3 Biggest Challenges I Faced (And How I Overcame Them)

Challenge 1: Hidden Rows and Sneaky Characters (ValueError) Real-world datasets are never perfectly clean. While the sensors were recording, they occasionally shut down and restarted, so the column headers that should have appeared only at the top were written again and again in the middle of the file. On top of that, invisible end-of-file (EOF) characters like \\x1a were hidden inside the file. When standard reading methods tried to convert these strings to numbers (floats), the system understandably threw an error and crashed.
Challenge 2: Massive RAM Usage and “Dead Workers” (KilledWorker) To get past the error above, we told Dask: “Read anything you cannot convert temporarily as text (string/object).” But in Python, text data takes up far more memory than numeric data. To avoid exceeding the limits of our 8-core CPU and 32 GB RAM, we set a 3 GB memory limit per worker. However, when 128 MB data blocks were read as text, they suddenly bloated and exceeded the 3 GB limit. To protect itself, the system started killing workers (the KilledWorker error).
Challenge 3: Malformed Rows (Expected 134 fields... saw 225) Just when we thought we had solved everything, we discovered that some sensor logs had been concatenated, and a row that should have contained 134 columns instead contained 225.

Final Code Analysis

1. Client Setup

client = Client(n_workers=8, threads_per_worker=2, memory_limit='3GB')

First, we bring 8 physical cores to work. We allow each of them to carry at most 3 GB so that a total of 24 GB is used, leaving a comfortable 8 GB for the operating system.

2. Lazy but Safe Data Reading

df = dd.read_csv(..., blocksize='64MB', dtype='object', on_bad_lines='skip')

We tell Dask not to show any “intelligence” and to read the data as plain text (object). To prevent RAM from ballooning, we reduce the bite size from 128 MB to 64 MB. We also use on_bad_lines='skip' to immediately discard malformed lines with too many or too few fields.

3. Cleaning (map_partitions) After the data enters RAM in small, safe 64 MB partitions, we use CPU power to pass each block through our custom cleaning function called clean_and_cast_partition.

This function uses Pandas’ pd.to_numeric(..., errors='coerce'). In plain terms: “Convert strings to numbers aggressively. If you encounter nonsense text like 'LABEL' or '\x1a', do not crash the program. Instead, write a missing/unknown value (NaN) there.”
The following df_part.dropna() command then removes those “empty” (NaN) rows in one shot.
The remaining clean numeric values are packed into memory-efficient types (float32 and int8).

4. Use of Parquet

df_cleaned.to_parquet('data/main_data_parquet', ...)

These cleaned blocks are written to disk immediately in Parquet format. Parquet is a modern format that compresses massive CSV files extremely well and improves subsequent read speed by 10 to 20 times.