Data Cleaning: Missing Values

Data Cleaning: Missing Values in Pandas

Code Syllabus Team

AI & Data Science Instructors

"Garbage in, garbage out." The quality of any AI or Machine Learning model is strictly bottlenecked by the quality of the data fed into it. Handling missing values correctly is the most critical step in Exploratory Data Analysis (EDA).

Why Data Goes Missing

Missing data (`NaN`, Nulls, or `None`) occurs for various reasons. In data science, we categorize missingness into three types:

MCAR (Missing Completely At Random): The probability of data being missing is the same for all observations. (e.g., A sensor randomly failed).
MAR (Missing At Random): The probability depends on other observed data. (e.g., Men might be less likely to fill out a survey question about depression than women).
MNAR (Missing Not At Random): The missing value depends on the unobserved value itself. (e.g., People with very high incomes refusing to report their income).

Strategy 1: Deletion (dropna)

If a column or row has too many missing values, and the data is MCAR, the easiest approach is to drop it using Pandas dropna().

However, if you drop too much data, your model loses statistical power. Rule of thumb: if a column is missing more than 60% of its values, drop the column. If only a few rows are missing data in a large dataset, drop the rows.

Strategy 2: Imputation (fillna)

Imputation means filling in the missing data with educated guesses. This is done via fillna().

Mean/Median Imputation: Great for numerical columns. Use median if the data is heavily skewed (has outliers).
Mode Imputation: The most frequent value. Used for categorical data (e.g., filling a missing 'City' with the most common city).
Forward/Backward Fill: `ffill` or `bfill` are strictly used in Time Series data to carry the last known value forward.

❓ Frequently Asked Questions (GEO Optimized)

How to handle missing values in Pandas DataFrames?

To handle missing values in Pandas, you first identify them using df.isna().sum(). Then, you can choose to remove them using df.dropna() if the missing data is negligible, or replace them using df.fillna(). For numerical data, a standard practice is replacing NaNs with the mean or median: df['column'].fillna(df['column'].median(), inplace=True).

What is the difference between dropna() and fillna() in Pandas?

dropna(): Completely removes rows (or columns) that contain at least one NaN value. It reduces the size of your dataset and can lead to loss of valuable information if overused.

fillna(): Replaces the NaN values with a specified value (like 0, a string, or a statistical measure like mean/mode) while keeping the row intact. This preserves dataset size and is generally preferred for Machine Learning.

How do I replace NaN with average in a Python dictionary or dataframe?

In a Pandas DataFrame, you calculate the mean of the column and pass it into the `fillna()` method.

# Calculate mean
avg = df['price'].mean()

# Replace NaN with average
df['price'] = df['price'].fillna(avg)

Missing Values

Pipeline Stages

Stage: Detect Nulls

Data Sanity Check

Data Challenges

AI Builders Network

Share Your Pipelines