Find and Drop Duplicates
In this lesson, we'll dive into methods of finding and dropping duplicated rows of a DataFrame.
This tutorial uses classic Iris dataset, which can be downloaded here Iris dataset.
import pandas as pddf = pd.read_csv('Iris.csv')
1. Find Duplicated Rows via the duplicated()
Function
The duplicated()
function returns a boolean array, indicating the presence of duplicate rows within the dataset. This boolean array can be used as a filtering condition for getting duplicated rows removed from the original DataFrame.
By running code below, the empty result indicates that no duplicated rows pesent in the data.
df[df.duplicated()]
Output:
Id | SepalLengthCm | SepalWidthCm | PetalLengthCm | PetalWidthCm | Species |
---|
To better illustrate how the functions works, we'll manually insert some duplicated rows to the data.
Insert duplicated rows
We will take the last 4 rows of data and append to the original dataset, resulting in duplicated rows within the new dataframe.
# concatenate the original data with the last 4 rowsdf_duplicate = pd.concat([df, df.tail(4)])# reset the row indexdf_duplicate.reset_index(drop=True, inplace=True)
By running tail(8)
function on the new dataframe df_duplicate, we can see that rows 150 to 153 are duplications of rows 146 to 149.
df_duplicate.tail(8)
Output:
Id | SepalLengthCm | SepalWidthCm | PetalLengthCm | PetalWidthCm | Species | |
---|---|---|---|---|---|---|
146 | 147 | 6.3 | 2.5 | 5.0 | 1.9 | Iris-virginica |
147 | 148 | 6.5 | 3.0 | 5.2 | 2.0 | Iris-virginica |
148 | 149 | 6.2 | 3.4 | 5.4 | 2.3 | Iris-virginica |
149 | 150 | 5.9 | 3.0 | 5.1 | 1.8 | Iris-virginica |
150 | 147 | 6.3 | 2.5 | 5.0 | 1.9 | Iris-virginica |
151 | 148 | 6.5 | 3.0 | 5.2 | 2.0 | Iris-virginica |
152 | 149 | 6.2 | 3.4 | 5.4 | 2.3 | Iris-virginica |
153 | 150 | 5.9 | 3.0 | 5.1 | 1.8 | Iris-virginica |
Now let's apply the duplicated()
function on the new dataframe df_duplicate and see what it returns:
df_duplicate[df_duplicate.duplicated()]
Output:
Id | SepalLengthCm | SepalWidthCm | PetalLengthCm | PetalWidthCm | Species | |
---|---|---|---|---|---|---|
150 | 147 | 6.3 | 2.5 | 5.0 | 1.9 | Iris-virginica |
151 | 148 | 6.5 | 3.0 | 5.2 | 2.0 | Iris-virginica |
152 | 149 | 6.2 | 3.4 | 5.4 | 2.3 | Iris-virginica |
153 | 150 | 5.9 | 3.0 | 5.1 | 1.8 | Iris-virginica |
It identifies that the row 150 to 153 are duplications. By default, the duplicated()
function identifies data as duplicates, excluding the first occurrence. In our example, it has flagged rows 150 to 153 as duplicates, where rows 146 to 149 represent the initial occurrences.
2. Drop Duplicated Rows via drop_duplicates()
Function
Just like the duplicated()
function identifies duplicate entries, the drop_duplicates()
function eliminates all duplicate rows, keeping only the first occurrence.
df_duplicate.drop_duplicates()
Output:
Id | SepalLengthCm | SepalWidthCm | PetalLengthCm | PetalWidthCm | Species | |
---|---|---|---|---|---|---|
0 | 1 | 5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa |
1 | 2 | 4.9 | 3.0 | 1.4 | 0.2 | Iris-setosa |
2 | 3 | 4.7 | 3.2 | 1.3 | 0.2 | Iris-setosa |
3 | 4 | 4.6 | 3.1 | 1.5 | 0.2 | Iris-setosa |
4 | 5 | 5.0 | 3.6 | 1.4 | 0.2 | Iris-setosa |
... | ... | ... | ... | ... | ... | ... |
145 | 146 | 6.7 | 3.0 | 5.2 | 2.3 | Iris-virginica |
146 | 147 | 6.3 | 2.5 | 5.0 | 1.9 | Iris-virginica |
147 | 148 | 6.5 | 3.0 | 5.2 | 2.0 | Iris-virginica |
148 | 149 | 6.2 | 3.4 | 5.4 | 2.3 | Iris-virginica |
149 | 150 | 5.9 | 3.0 | 5.1 | 1.8 | Iris-virginica |
150 rows × 6 columns
As shown above, rows 150 to 153 have been removed by the drop_duplicates()
function.
3. Find and Drop Duplicates Based on a Subset of Columns
In our earlier examples, we searched for duplicate rows with identical values in all columns. Yet, there are situations where we specifically need to identify duplicate rows within a subset of columns. To accomplish this, we can use the subset
parameter of the duplicated()
and drop_duplicates()
functions.
Example below shows all rows having duplications in the Species column.
df[df.duplicated(subset=['Species'])]
Output:
Id | SepalLengthCm | SepalWidthCm | PetalLengthCm | PetalWidthCm | Species | |
---|---|---|---|---|---|---|
1 | 2 | 4.9 | 3.0 | 1.4 | 0.2 | Iris-setosa |
2 | 3 | 4.7 | 3.2 | 1.3 | 0.2 | Iris-setosa |
3 | 4 | 4.6 | 3.1 | 1.5 | 0.2 | Iris-setosa |
4 | 5 | 5.0 | 3.6 | 1.4 | 0.2 | Iris-setosa |
5 | 6 | 5.4 | 3.9 | 1.7 | 0.4 | Iris-setosa |
... | ... | ... | ... | ... | ... | ... |
145 | 146 | 6.7 | 3.0 | 5.2 | 2.3 | Iris-virginica |
146 | 147 | 6.3 | 2.5 | 5.0 | 1.9 | Iris-virginica |
147 | 148 | 6.5 | 3.0 | 5.2 | 2.0 | Iris-virginica |
148 | 149 | 6.2 | 3.4 | 5.4 | 2.3 | Iris-virginica |
149 | 150 | 5.9 | 3.0 | 5.1 | 1.8 | Iris-virginica |
147 rows × 6 columns
The duplicated()
function looks for duplicates in the Species column. Since there are only 3 unique values in the Species column, it returns 147 rows of duplications out of 150 rows.
Next, we'll use drop_duplicates()
to remove duplicated rows:
df.drop_duplicates(subset=['Species'])
Output:
Id | SepalLengthCm | SepalWidthCm | PetalLengthCm | PetalWidthCm | Species | |
---|---|---|---|---|---|---|
0 | 1 | 5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa |
50 | 51 | 7.0 | 3.2 | 4.7 | 1.4 | Iris-versicolor |
100 | 101 | 6.3 | 3.3 | 6.0 | 2.5 | Iris-virginica |
After removing duplicated species, we are left with only three rows with unique Species values. As mentioned previously, these three rows have been retained as they represent the initial occurrences.
Well done! We have examined functions for removing duplicate records. In our upcoming tutorial, we will embark on an exploration of data grouping methods.