ValueError)
Real-world datasets are never perfectly clean. While the sensors were recording, they occasionally shut down and restarted, so the column headers that should have appeared only at the top were written again and again in the middle of the file. On top of that, invisible end-of-file (EOF) characters like \\x1a were hidden inside the file. When standard reading methods tried to convert these strings to numbers (floats), the system understandably threw an error and crashed.KilledWorker)
To get past the error above, we told Dask: “Read anything you cannot convert temporarily as text (string/object).” But in Python, text data takes up far more memory than numeric data. To avoid exceeding the limits of our 8-core CPU and 32 GB RAM, we set a 3 GB memory limit per worker. However, when 128 MB data blocks were read as text, they suddenly bloated and exceeded the 3 GB limit. To protect itself, the system started killing workers (the KilledWorker error).Expected 134 fields... saw 225)
Just when we thought we had solved everything, we discovered that some sensor logs had been concatenated, and a row that should have contained 134 columns instead contained 225.1. Client Setup
client = Client(n_workers=8, threads_per_worker=2, memory_limit='3GB')
First, we bring 8 physical cores to work. We allow each of them to carry at most 3 GB so that a total of 24 GB is used, leaving a comfortable 8 GB for the operating system.
2. Lazy but Safe Data Reading
df = dd.read_csv(..., blocksize='64MB', dtype='object', on_bad_lines='skip')
We tell Dask not to show any “intelligence” and to read the data as plain text (object). To prevent RAM from ballooning, we reduce the bite size from 128 MB to 64 MB. We also use on_bad_lines='skip' to immediately discard malformed lines with too many or too few fields.
3. Cleaning (map_partitions)
After the data enters RAM in small, safe 64 MB partitions, we use CPU power to pass each block through our custom cleaning function called clean_and_cast_partition.
pd.to_numeric(..., errors='coerce'). In plain terms: “Convert strings to numbers aggressively. If you encounter nonsense text like 'LABEL' or '\x1a', do not crash the program. Instead, write a missing/unknown value (NaN) there.”df_part.dropna() command then removes those “empty” (NaN) rows in one shot.float32 and int8).4. Use of Parquet
df_cleaned.to_parquet('data/main_data_parquet', ...)
These cleaned blocks are written to disk immediately in Parquet format. Parquet is a modern format that compresses massive CSV files extremely well and improves subsequent read speed by 10 to 20 times.