Handle Missing Values

Missing values are quite common in real-world data analysis. Dealing with missing data is a fundamental skill to acquire. In Pandas DataFrame, we can easily identify missing data through isnull() or isna() functions.

This tutorial uses classic Iris dataset, which can be downloaded here Iris dataset.

import pandas as pddf = pd.read_csv('Iris.csv')

The Iris dataset is free of missing values. To facilitate our demonstration, we will replace some values with "None" before proceeding.

In the Replace value lesson, we've learned the method of substituting one value with another. Now, let's put this technique into practice by replacing one of the species, specifically "Iris-setosa," with "None." This action will result in the creation of rows containing missing values within the dataset.

Replace Values with "None"

df_new = df.replace(to_replace='Iris-setosa', value=None)

After generating missing values for the Iris dataset and save it to a new variable df_new, we can use isna() to identify rows with missing values.

It's important to note that df_new['Species'].isna() produces a Boolean array that signifies whether a record is missing or not. See example below:

df_new['Species'].isna()

Output:

0 True 1 True 2 True 3 True 4 True ... 145 False 146 False 147 False 148 False 149 False Name: Species, Length: 150, dtype: bool

To filter out original rows containing missing values, we can use:

df_new[df_new['Species'].isna()].head(5)

Output:


Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species
0 1 5.1 3.5 1.4 0.2 None
1 2 4.9 3.0 1.4 0.2 None
2 3 4.7 3.2 1.3 0.2 None
3 4 4.6 3.1 1.5 0.2 None
4 5 5.0 3.6 1.4 0.2 None

Identifying missing values is good, but not enough. In data wrangling, addressing missing values is another crucial step. There are a few commonly taken options to consider:

If you can find the accurate values for the missing data, please replace them with the appropriate information.
If finding accurate values are impossible, then for numerical data, consider using the mean or median to fill in the gaps.
If missing values constitute less than 5% of the entire dataset, you may opt to remove rows containing missing data.

The choice among these options depends on the the data engineer's domain knowledge.

In our situation, we know the correct value for those records, so we will simply replace them with 'Iris-setosa'.

Fill Missing Values via `fillna()` function

Apply the fillna() function only on the column where we intend to substitute missing values with new data.

df_new['Species'].fillna('Iris-setosa', inplace=True)

df_new.head(5)

Output:


Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species
0 1 5.1 3.5 1.4 0.2 Iris-setosa
1 2 4.9 3.0 1.4 0.2 Iris-setosa
2 3 4.7 3.2 1.3 0.2 Iris-setosa
3 4 4.6 3.1 1.5 0.2 Iris-setosa
4 5 5.0 3.6 1.4 0.2 Iris-setosa

As shown above, the missing values in the Species column have been replaced by "Iris-setosa".

Amazing eBook to learn ggplot2 FAST & EASY

book cover for sliding your way to ggplot2 mastery

	Id	SepalLengthCm	SepalWidthCm	PetalLengthCm	PetalWidthCm	Species
0	1	5.1	3.5	1.4	0.2	None
1	2	4.9	3.0	1.4	0.2	None
2	3	4.7	3.2	1.3	0.2	None
3	4	4.6	3.1	1.5	0.2	None
4	5	5.0	3.6	1.4	0.2	None

	Id	SepalLengthCm	SepalWidthCm	PetalLengthCm	PetalWidthCm	Species
0	1	5.1	3.5	1.4	0.2	Iris-setosa
1	2	4.9	3.0	1.4	0.2	Iris-setosa
2	3	4.7	3.2	1.3	0.2	Iris-setosa
3	4	4.6	3.1	1.5	0.2	Iris-setosa
4	5	5.0	3.6	1.4	0.2	Iris-setosa

Handle Missing Values

Replace Values with "None"

Fill Missing Values via fillna() function

Amazing eBook to learn ggplot2 FAST & EASY

Fill Missing Values via `fillna()` function