Handle Missing Values
Missing values are quite common in real-world data analysis. Dealing with missing data is a fundamental skill to acquire. In Pandas DataFrame, we can easily identify missing data through isnull()
or isna()
functions.
This tutorial uses classic Iris dataset, which can be downloaded here Iris dataset.
import pandas as pddf = pd.read_csv('Iris.csv')
The Iris dataset is free of missing values. To facilitate our demonstration, we will replace some values with "None" before proceeding.
In the Replace value lesson, we've learned the method of substituting one value with another. Now, let's put this technique into practice by replacing one of the species, specifically "Iris-setosa," with "None." This action will result in the creation of rows containing missing values within the dataset.
Replace Values with "None"
df_new = df.replace(to_replace='Iris-setosa', value=None)
After generating missing values for the Iris dataset and save it to a new variable df_new, we can use isna()
to identify rows with missing values.
It's important to note that df_new['Species'].isna()
produces a Boolean array that signifies whether a record is missing or not. See example below:
df_new['Species'].isna()
Output:
0 True
1 True
2 True
3 True
4 True ...
145 False
146 False
147 False
148 False
149 False
Name: Species, Length: 150, dtype: bool
To filter out original rows containing missing values, we can use:
df_new[df_new['Species'].isna()].head(5)
Output:
Id | SepalLengthCm | SepalWidthCm | PetalLengthCm | PetalWidthCm | Species | |
---|---|---|---|---|---|---|
0 | 1 | 5.1 | 3.5 | 1.4 | 0.2 | None |
1 | 2 | 4.9 | 3.0 | 1.4 | 0.2 | None |
2 | 3 | 4.7 | 3.2 | 1.3 | 0.2 | None |
3 | 4 | 4.6 | 3.1 | 1.5 | 0.2 | None |
4 | 5 | 5.0 | 3.6 | 1.4 | 0.2 | None |
Identifying missing values is good, but not enough. In data wrangling, addressing missing values is another crucial step. There are a few commonly taken options to consider:
- If you can find the accurate values for the missing data, please replace them with the appropriate information.
- If finding accurate values are impossible, then for numerical data, consider using the mean or median to fill in the gaps.
- If missing values constitute less than 5% of the entire dataset, you may opt to remove rows containing missing data.
The choice among these options depends on the the data engineer's domain knowledge.
In our situation, we know the correct value for those records, so we will simply replace them with 'Iris-setosa'.
Fill Missing Values via fillna()
function
Apply the fillna()
function only on the column where we intend to substitute missing values with new data.
df_new['Species'].fillna('Iris-setosa', inplace=True)
df_new.head(5)
Output:
Id | SepalLengthCm | SepalWidthCm | PetalLengthCm | PetalWidthCm | Species | |
---|---|---|---|---|---|---|
0 | 1 | 5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa |
1 | 2 | 4.9 | 3.0 | 1.4 | 0.2 | Iris-setosa |
2 | 3 | 4.7 | 3.2 | 1.3 | 0.2 | Iris-setosa |
3 | 4 | 4.6 | 3.1 | 1.5 | 0.2 | Iris-setosa |
4 | 5 | 5.0 | 3.6 | 1.4 | 0.2 | Iris-setosa |
As shown above, the missing values in the Species column have been replaced by "Iris-setosa".