
2024-04-14
Data Exploration
Advanced EDA and strategic preprocessing on a global medical dataset to optimize predictive risk classification.
Case Study: Strategic Data Exploration for COVID-19 Survival Analytics
In this project, I acted as a Data Analyst consultant for Nexoid. The goal was to perform deep data exploration and preparation on a specialized medical dataset—sourced from the "Nexoid COVID-19 Survival Calculator"—to enable high-precision predictive modeling for infection and mortality risks.
The challenge was identifying hidden data quality issues and structural anomalies across 36 diverse variables spanning demographics, health conditions, and behavior patterns.
🎯 Project Mission: Ensuring Data Integrity
My mission was to move beyond surface-level cleaning and perform a comprehensive audit of the dataset's foundation:
- Quality Diagnosis: Identify skewness, outliers, and errors using statistical measures and visualization.
- Structural Correction: Align mismatched data types with technical specifications and recover misclassified information.
- Feature Selection: Use correlation analysis to reduce redundancy while maximizing the model's predictive power.
🛠️ My Technical Approach
I applied critical reasoning to data anomalies to ensure the highest level of data integrity:
1. Recovering "Lost" Data (The NA Paradox)
During exploration, I discovered that the region column appeared to have massive missing data. Upon investigation, I realized that Pandas was auto-parsing the 'NA' (North America) region code as NaN (Missing Value). By cross-referencing with country data, I successfully recovered 4,227 records that would have otherwise been discarded.
2. Contextual Data Imputation
Instead of simple mean-filling, I applied domain-specific logic to handle missingness:
- Invalid Values: I identified instances of
-1in thealcoholcolumn. Rather than treating these as errors, I re-interpreted them as "non-drinkers" (0) based on the intake level context, preserving the dataset's volume. - Grouped Imputation: Missing values for
insuranceandincomewere imputed using the mode for each age group, ensuring the filled data remained demographically plausible.
3. Dimensionality Reduction & Feature Selection
I optimized the feature set to prevent multi-collinearity and reduce model noise:
- Redundancy Elimination: Correlation analysis revealed a near-perfect linear relationship between
weightandBMI. I retainedBMIas the more indicative health metric and removedheightandweightto streamline the model. - Hypothesis Testing: Using stacked bar charts, I demonstrated that smoking status had no strong statistical relationship with COVID-19 positive status in this dataset, leading to the strategic removal of the
smokingvariable.
📈 Results & Key Achievements
My exploration and preprocessing pipeline transformed a fragmented dataset into a high-quality, model-ready environment:
| Achievement | Impact |
|---|---|
| Data Recovery | Successfully restored 4,200+ region codes misidentified as null values. |
| Feature Engineering | Created New Age Groups (0-18, 19-35, 36-60, 60+) to better capture risk patterns. |
| Outlier Management | Implemented the Capping Method (IQR) to minimize the impact of extreme numerical values. |
| New Metrics | Designed a Health Condition Index by integrating multiple disease variables into a single scoring system. |
The "Model-Ready" Outcome
By the end of this exploration, I delivered a final processed dataset where categorical variables were encoded and all data types were perfectly aligned. Based on these findings, I identified Classification as the most suitable data mining task to maximize future risk prediction accuracy.

Python Web App
A specialized e-commerce platform for tattoo portfolios, featuring design search, cart management, and secure checkout.

TrebleCross: C# Game
A deep dive into building a strategic game engine using C# and .NET 8, featuring advanced design patterns and strategic gameplay logic.