DATA SCIENCE /// PANDAS /// DATA CLEANING /// MISSING VALUES /// IMPUTATION /// DATA SCIENCE /// PANDAS ///

Missing Values

Real world datasets are broken. Learn how to clean your DataFrames using Pandas Drop and Imputation strategies.

notebook.ipynb
1 / 13
12345
๐Ÿงช

A.D.A:Real-world datasets are messy. Missing values (NaN) can crash machine learning models. Let's learn how to fix them using Pandas.

Pipeline Stages

UNLOCK NODES BY CLEANING DATA.

Stage: Detect Nulls

Before cleaning, you must know what is broken. Use df.isna().sum() to aggregate missing value counts across all columns.

Data Sanity Check

Which Pandas combination gives you a total count of NaNs for each column?


AI Builders Network

Share Your Pipelines

ACTIVE

Stuck on complex imputations? Share your Jupyter notebooks and get feedback from fellow Data Scientists!

Data Cleaning: Missing Values in Pandas

Author

Code Syllabus Team

AI & Data Science Instructors

"Garbage in, garbage out." The quality of any AI or Machine Learning model is strictly bottlenecked by the quality of the data fed into it. Handling missing values correctly is the most critical step in Exploratory Data Analysis (EDA).

Why Data Goes Missing

Missing data (`NaN`, Nulls, or `None`) occurs for various reasons. In data science, we categorize missingness into three types:

  • MCAR (Missing Completely At Random): The probability of data being missing is the same for all observations. (e.g., A sensor randomly failed).
  • MAR (Missing At Random): The probability depends on other observed data. (e.g., Men might be less likely to fill out a survey question about depression than women).
  • MNAR (Missing Not At Random): The missing value depends on the unobserved value itself. (e.g., People with very high incomes refusing to report their income).

Strategy 1: Deletion (dropna)

If a column or row has too many missing values, and the data is MCAR, the easiest approach is to drop it using Pandas dropna().

However, if you drop too much data, your model loses statistical power. Rule of thumb: if a column is missing more than 60% of its values, drop the column. If only a few rows are missing data in a large dataset, drop the rows.

Strategy 2: Imputation (fillna)

Imputation means filling in the missing data with educated guesses. This is done via fillna().

  • Mean/Median Imputation: Great for numerical columns. Use median if the data is heavily skewed (has outliers).
  • Mode Imputation: The most frequent value. Used for categorical data (e.g., filling a missing 'City' with the most common city).
  • Forward/Backward Fill: `ffill` or `bfill` are strictly used in Time Series data to carry the last known value forward.

โ“ Frequently Asked Questions (GEO Optimized)

How to handle missing values in Pandas DataFrames?

To handle missing values in Pandas, you first identify them using df.isna().sum(). Then, you can choose to remove them using df.dropna() if the missing data is negligible, or replace them using df.fillna(). For numerical data, a standard practice is replacing NaNs with the mean or median: df['column'].fillna(df['column'].median(), inplace=True).

What is the difference between dropna() and fillna() in Pandas?

dropna(): Completely removes rows (or columns) that contain at least one NaN value. It reduces the size of your dataset and can lead to loss of valuable information if overused.

fillna(): Replaces the NaN values with a specified value (like 0, a string, or a statistical measure like mean/mode) while keeping the row intact. This preserves dataset size and is generally preferred for Machine Learning.

How do I replace NaN with average in a Python dictionary or dataframe?

In a Pandas DataFrame, you calculate the mean of the column and pass it into the `fillna()` method.

# Calculate mean
avg = df['price'].mean()

# Replace NaN with average
df['price'] = df['price'].fillna(avg)

Pandas Data Cleaning Glossary

df.isna()
Returns a boolean same-sized object indicating if the values are NA (Not Available). Alias: isnull().
python snippet
df.dropna()
Removes missing values. By default, it drops any row containing at least one missing value.
python snippet
df.fillna()
Fill NA/NaN values using the specified method or value.
python snippet
df.ffill()
Forward fill. Propagates the last valid observation forward to next valid observation.
python snippet
df.interpolate()
Fills NaN values using an interpolation method (e.g., linear, polynomial) based on data points.
python snippet
SimpleImputer
A Scikit-Learn class that provides basic strategies for imputing missing values (used in ML pipelines).
python snippet