A dataset is just a collection of numbers until you perform EDA. It is the process of summarizing the main characteristics of data to uncover its secrets.
1Understanding the Shape and Stats
EDA starts with pure statistical math. We first check the 'shape' (df.shape) to know the scope of the problem—whether we are dealing with thousands or millions of rows.
Then, we generate an executive summary using .describe(). This gives us a complete statistical x-ray (averages, standard deviations, min, max) to detect obvious anomalies instantly, like a maximum age of 999 due to a typo. You must know your battlefield before training a model.
import pandas as pd
df = pd.read_csv('user_data.csv')
print(f"Shape: {df.shape}")
summary = df.describe()
print(summary)2Categorical Counts
Not everything in data science is continuous numbers; we also have categorical data like subscription plans or countries. Using .value_counts() allows us to quickly see the distribution of these categorical groups.
If we are trying to predict which users will upgrade to a 'Premium' plan, but 99% of our historical data is from 'Free' users, we have a massive class imbalance. If we don't fix this during EDA, the model will just lazily learn to predict 'Free' every time.
plan_dist = df['plan_type'].value_counts(normalize=True)
print(plan_dist)3Visualizing Distributions
Raw numbers are notoriously hard to interpret, so we visualize them. We analyze individual feature distributions using histograms or Kernel Density Estimate (KDE) curves to find 'skewness' or asymmetry.
For example, salaries often have a long tail to the right because of a few billionaires. This right-skewed distribution will severely bias the model against average earners unless we detect it now and apply mathematical transformations later.
import seaborn as sns
import matplotlib.pyplot as plt
# Check individual distribution for bias
sns.kdeplot(df['salary'])
plt.title('Income Distribution Skew')
plt.show()4Detecting Outliers
While exploring the data, we will inevitably encounter 'Outliers'—atypical values that stray wildly from the main group. They could be measurement errors (a broken sensor) or rare but legitimate events.
In EDA, our mission as engineers is to actively hunt them down using boxplots or quantiles. Once found, we must make a crucial engineering decision: do we delete them to clean the dataset, or do we keep them because they represent a critical edge case?
# Using math to find extreme outliers
upper_limit = df['price'].quantile(0.99)
outliers = df[df['price'] > upper_limit]
print(f"Found {len(outliers)} outliers")5Heatmaps for Correlation
To precisely analyze relationships between features, professionals use Heatmaps. Instead of squinting at dots on a scatter plot, we look at colors representing correlation coefficients (ranging from -1 to 1).
If the correlation between 'number of rooms' and 'house price' is a bright red 0.95, we know this feature has massive predictive power. This thermal map acts as a cheat sheet, guiding us exactly to the columns that matter most to the AI model.
# Calculating Pearson Correlation Matrix
corr_matrix = df.corr()
# Rendering heatmap
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.show()