This project was created as a midterm for Sakarya University’s Data Visualization course (ISE314).
This document outlines the end-to-end development of the data visualization project across three phases: Data Generation & Preprocessing (assignment phase) and Data Modeling & Visualization (project phase).
This Python script procedurally generates the project’s raw dataset. Its primary purpose is to intentionally introduce real-world data quality issues—such as anomalies, missing values, and formatting inconsistencies—to meet the assignment’s preprocessing requirements.
The script uses several helper functions to produce intentionally messy fields:
rand_date(): Randomly selects among three date formats (dd/mm/yyyy, mm-dd-yyyy, and ISO yyyy-mm-dd) to simulate multi-region locale issues.rand_price(): Generates monetary values using a randomized mix of plain numbers, dollar signs ($), euro signs (€), and Turkish comma-decimal formatting.rand_customer(): Concatenates the customer’s name, dynamically generated email, and phone number into a single pipe-separated (|) string to require split-column operations in Power Query.rand_address(): Combines street, city, and country into a single comma-separated string, intentionally using inconsistent country codes (e.g., “Turkey”, “TR”, “Türkiye”).products_ref.json)product_id, product_name, category, and randomized stock levels.orders_raw.csv)statuses_dirty list to inject spelling errors (e.g., “Deliverd”, “Shiped”) and case inconsistencies (e.g., “PROCESSING”).quantity and invalid values (0 and 6) to the 1–5 customer_rating scale.random.random() probability thresholds to insert None (NULL) values into key columns such as customer_info, product_id, and shipping_address.