ISE314 Data Visualization – Project Summary

This project was created as a midterm for Sakarya University’s Data Visualization course (ISE314).

This document outlines the end-to-end development of the data visualization project across three phases: Data Generation & Preprocessing (assignment phase) and Data Modeling & Visualization (project phase).

Phase 1: Procedural Dirty Data Generation

This Python script procedurally generates the project’s raw dataset. Its primary purpose is to intentionally introduce real-world data quality issues—such as anomalies, missing values, and formatting inconsistencies—to meet the assignment’s preprocessing requirements.

1. Helper Functions (Intentional Obfuscation)

The script uses several helper functions to produce intentionally messy fields:

rand_date(): Randomly selects among three date formats (dd/mm/yyyy, mm-dd-yyyy, and ISO yyyy-mm-dd) to simulate multi-region locale issues.
rand_price(): Generates monetary values using a randomized mix of plain numbers, dollar signs ($), euro signs (€), and Turkish comma-decimal formatting.
rand_customer(): Concatenates the customer’s name, dynamically generated email, and phone number into a single pipe-separated (|) string to require split-column operations in Power Query.
rand_address(): Combines street, city, and country into a single comma-separated string, intentionally using inconsistent country codes (e.g., “Turkey”, “TR”, “Türkiye”).

2. Dimension Table (`products_ref.json`)

Generates a clean reference array containing 20 unique products.
Includes standardized product_id, product_name, category, and randomized stock levels.
Exports the data as JSON to satisfy the requirement of using a non-Excel, non-SQL data source.

3. Fact Table (`orders_raw.csv`)

Anomaly injection: Creates 105 order records while deliberately injecting structural and logical data issues.
Typos & casing: Uses a statuses_dirty list to inject spelling errors (e.g., “Deliverd”, “Shiped”) and case inconsistencies (e.g., “PROCESSING”).
Out-of-bounds data: Assigns negative values to quantity and invalid values (0 and 6) to the 1–5 customer_rating scale.
Missing values: Uses random.random() probability thresholds to insert None (NULL) values into key columns such as customer_info, product_id, and shipping_address.
Format fragmentation: Stores discount rates as a mix of percentages (e.g., “10%”) and raw decimals (e.g., “0.10”).

Phase 1: Procedural Dirty Data Generation

1. Helper Functions (Intentional Obfuscation)

2. Dimension Table (products_ref.json)

3. Fact Table (orders_raw.csv)

2. Dimension Table (`products_ref.json`)

3. Fact Table (`orders_raw.csv`)