Advanced EDA and strategic preprocessing on a global medical dataset to optimize predictive risk classification.

Case Study: Strategic Data Exploration for COVID-19 Survival Analytics

In this project, I acted as a Data Analyst consultant for Nexoid. The goal was to perform deep data exploration and preparation on a specialized medical dataset—sourced from the "Nexoid COVID-19 Survival Calculator"—to enable high-precision predictive modeling for infection and mortality risks.

The challenge was identifying hidden data quality issues and structural anomalies across 36 diverse variables spanning demographics, health conditions, and behavior patterns.

🎯 Project Mission: Ensuring Data Integrity

My mission was to move beyond surface-level cleaning and perform a comprehensive audit of the dataset's foundation:

Quality Diagnosis: Identify skewness, outliers, and errors using statistical measures and visualization.
Structural Correction: Align mismatched data types with technical specifications and recover misclassified information.
Feature Selection: Use correlation analysis to reduce redundancy while maximizing the model's predictive power.

🛠️ My Technical Approach

I applied critical reasoning to data anomalies to ensure the highest level of data integrity:

1. Recovering "Lost" Data (The NA Paradox)

During exploration, I discovered that the region column appeared to have massive missing data. Upon investigation, I realized that Pandas was auto-parsing the 'NA' (North America) region code as NaN (Missing Value). By cross-referencing with country data, I successfully recovered 4,227 records that would have otherwise been discarded.

2. Contextual Data Imputation

Instead of simple mean-filling, I applied domain-specific logic to handle missingness:

Invalid Values: I identified instances of -1 in the alcohol column. Rather than treating these as errors, I re-interpreted them as "non-drinkers" (0) based on the intake level context, preserving the dataset's volume.
Grouped Imputation: Missing values for insurance and income were imputed using the mode for each age group, ensuring the filled data remained demographically plausible.

3. Dimensionality Reduction & Feature Selection

I optimized the feature set to prevent multi-collinearity and reduce model noise:

Redundancy Elimination: Correlation analysis revealed a near-perfect linear relationship between weight and BMI. I retained BMI as the more indicative health metric and removed height and weight to streamline the model.
Hypothesis Testing: Using stacked bar charts, I demonstrated that smoking status had no strong statistical relationship with COVID-19 positive status in this dataset, leading to the strategic removal of the smoking variable.

📈 Results & Key Achievements

My exploration and preprocessing pipeline transformed a fragmented dataset into a high-quality, model-ready environment:

Achievement	Impact
Data Recovery	Successfully restored 4,200+ region codes misidentified as null values.
Feature Engineering	Created New Age Groups (0-18, 19-35, 36-60, 60+) to better capture risk patterns.
Outlier Management	Implemented the Capping Method (IQR) to minimize the impact of extreme numerical values.
New Metrics	Designed a Health Condition Index by integrating multiple disease variables into a single scoring system.

The "Model-Ready" Outcome

By the end of this exploration, I delivered a final processed dataset where categorical variables were encoded and all data types were perfectly aligned. Based on these findings, I identified Classification as the most suitable data mining task to maximize future risk prediction accuracy.

Gyuri's Blog

Data Exploration