In this phase, I processed the data using Dask with an out-of-core approach (processing data that does not fit in RAM by reading it from disk in chunks).

The 3 Biggest Challenges I Faced (And How I Overcame Them)

Final Code Analysis

1. Client Setup

client = Client(n_workers=8, threads_per_worker=2, memory_limit='3GB')

First, we bring 8 physical cores to work. We allow each of them to carry at most 3 GB so that a total of 24 GB is used, leaving a comfortable 8 GB for the operating system.

2. Lazy but Safe Data Reading

df = dd.read_csv(..., blocksize='64MB', dtype='object', on_bad_lines='skip')

We tell Dask not to show any “intelligence” and to read the data as plain text (object). To prevent RAM from ballooning, we reduce the bite size from 128 MB to 64 MB. We also use on_bad_lines='skip' to immediately discard malformed lines with too many or too few fields.

3. Cleaning (map_partitions) After the data enters RAM in small, safe 64 MB partitions, we use CPU power to pass each block through our custom cleaning function called clean_and_cast_partition.

4. Use of Parquet

df_cleaned.to_parquet('data/main_data_parquet', ...)

These cleaned blocks are written to disk immediately in Parquet format. Parquet is a modern format that compresses massive CSV files extremely well and improves subsequent read speed by 10 to 20 times.