Find and Drop Duplicates

In this lesson, we'll dive into methods of finding and dropping duplicated rows of a DataFrame.

This tutorial uses classic Iris dataset, which can be downloaded here Iris dataset.

import pandas as pddf = pd.read_csv('Iris.csv')

1. Find Duplicated Rows via the `duplicated()` Function

The duplicated() function returns a boolean array, indicating the presence of duplicate rows within the dataset. This boolean array can be used as a filtering condition for getting duplicated rows removed from the original DataFrame.

By running code below, the empty result indicates that no duplicated rows pesent in the data.

df[df.duplicated()]

Output:


Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species

To better illustrate how the functions works, we'll manually insert some duplicated rows to the data.

Insert duplicated rows

We will take the last 4 rows of data and append to the original dataset, resulting in duplicated rows within the new dataframe.

# concatenate the original data with the last 4 rowsdf_duplicate = pd.concat([df, df.tail(4)])# reset the row indexdf_duplicate.reset_index(drop=True, inplace=True)

By running tail(8) function on the new dataframe df_duplicate, we can see that rows 150 to 153 are duplications of rows 146 to 149.

df_duplicate.tail(8)

Output:


Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species
146 147 6.3 2.5 5.0 1.9 Iris-virginica
147 148 6.5 3.0 5.2 2.0 Iris-virginica
148 149 6.2 3.4 5.4 2.3 Iris-virginica
149 150 5.9 3.0 5.1 1.8 Iris-virginica
150 147 6.3 2.5 5.0 1.9 Iris-virginica
151 148 6.5 3.0 5.2 2.0 Iris-virginica
152 149 6.2 3.4 5.4 2.3 Iris-virginica
153 150 5.9 3.0 5.1 1.8 Iris-virginica

Now let's apply the duplicated() function on the new dataframe df_duplicate and see what it returns:

df_duplicate[df_duplicate.duplicated()]

Output:


Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species
150 147 6.3 2.5 5.0 1.9 Iris-virginica
151 148 6.5 3.0 5.2 2.0 Iris-virginica
152 149 6.2 3.4 5.4 2.3 Iris-virginica
153 150 5.9 3.0 5.1 1.8 Iris-virginica

It identifies that the row 150 to 153 are duplications. By default, the duplicated() function identifies data as duplicates, excluding the first occurrence. In our example, it has flagged rows 150 to 153 as duplicates, where rows 146 to 149 represent the initial occurrences.

2. Drop Duplicated Rows via `drop_duplicates()` Function

Just like the duplicated() function identifies duplicate entries, the drop_duplicates() function eliminates all duplicate rows, keeping only the first occurrence.

df_duplicate.drop_duplicates()

Output:


Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species
0 1 5.1 3.5 1.4 0.2 Iris-setosa
1 2 4.9 3.0 1.4 0.2 Iris-setosa
2 3 4.7 3.2 1.3 0.2 Iris-setosa
3 4 4.6 3.1 1.5 0.2 Iris-setosa
4 5 5.0 3.6 1.4 0.2 Iris-setosa
... ... ... ... ... ... ...
145 146 6.7 3.0 5.2 2.3 Iris-virginica
146 147 6.3 2.5 5.0 1.9 Iris-virginica
147 148 6.5 3.0 5.2 2.0 Iris-virginica
148 149 6.2 3.4 5.4 2.3 Iris-virginica
149 150 5.9 3.0 5.1 1.8 Iris-virginica

150 rows × 6 columns

As shown above, rows 150 to 153 have been removed by the drop_duplicates() function.

3. Find and Drop Duplicates Based on a Subset of Columns

In our earlier examples, we searched for duplicate rows with identical values in all columns. Yet, there are situations where we specifically need to identify duplicate rows within a subset of columns. To accomplish this, we can use the subset parameter of the duplicated() and drop_duplicates() functions.

Example below shows all rows having duplications in the Species column.

df[df.duplicated(subset=['Species'])]

Output:


Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species
1 2 4.9 3.0 1.4 0.2 Iris-setosa
2 3 4.7 3.2 1.3 0.2 Iris-setosa
3 4 4.6 3.1 1.5 0.2 Iris-setosa
4 5 5.0 3.6 1.4 0.2 Iris-setosa
5 6 5.4 3.9 1.7 0.4 Iris-setosa
... ... ... ... ... ... ...
145 146 6.7 3.0 5.2 2.3 Iris-virginica
146 147 6.3 2.5 5.0 1.9 Iris-virginica
147 148 6.5 3.0 5.2 2.0 Iris-virginica
148 149 6.2 3.4 5.4 2.3 Iris-virginica
149 150 5.9 3.0 5.1 1.8 Iris-virginica

147 rows × 6 columns

The duplicated() function looks for duplicates in the Species column. Since there are only 3 unique values in the Species column, it returns 147 rows of duplications out of 150 rows.

Next, we'll use drop_duplicates() to remove duplicated rows:

df.drop_duplicates(subset=['Species'])

Output:


Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species
0 1 5.1 3.5 1.4 0.2 Iris-setosa
50 51 7.0 3.2 4.7 1.4 Iris-versicolor
100 101 6.3 3.3 6.0 2.5 Iris-virginica

After removing duplicated species, we are left with only three rows with unique Species values. As mentioned previously, these three rows have been retained as they represent the initial occurrences.

Well done! We have examined functions for removing duplicate records. In our upcoming tutorial, we will embark on an exploration of data grouping methods.

Amazing eBook to learn ggplot2 FAST & EASY

book cover for sliding your way to ggplot2 mastery

	Id	SepalLengthCm	SepalWidthCm	PetalLengthCm	PetalWidthCm	Species
146	147	6.3	2.5	5.0	1.9	Iris-virginica
147	148	6.5	3.0	5.2	2.0	Iris-virginica
148	149	6.2	3.4	5.4	2.3	Iris-virginica
149	150	5.9	3.0	5.1	1.8	Iris-virginica
150	147	6.3	2.5	5.0	1.9	Iris-virginica
151	148	6.5	3.0	5.2	2.0	Iris-virginica
152	149	6.2	3.4	5.4	2.3	Iris-virginica
153	150	5.9	3.0	5.1	1.8	Iris-virginica

	Id	SepalLengthCm	SepalWidthCm	PetalLengthCm	PetalWidthCm	Species
0	1	5.1	3.5	1.4	0.2	Iris-setosa
1	2	4.9	3.0	1.4	0.2	Iris-setosa
2	3	4.7	3.2	1.3	0.2	Iris-setosa
3	4	4.6	3.1	1.5	0.2	Iris-setosa
4	5	5.0	3.6	1.4	0.2	Iris-setosa
...	...	...	...	...	...	...
145	146	6.7	3.0	5.2	2.3	Iris-virginica
146	147	6.3	2.5	5.0	1.9	Iris-virginica
147	148	6.5	3.0	5.2	2.0	Iris-virginica
148	149	6.2	3.4	5.4	2.3	Iris-virginica
149	150	5.9	3.0	5.1	1.8	Iris-virginica

	Id	SepalLengthCm	SepalWidthCm	PetalLengthCm	PetalWidthCm	Species
0	1	5.1	3.5	1.4	0.2	Iris-setosa
50	51	7.0	3.2	4.7	1.4	Iris-versicolor
100	101	6.3	3.3	6.0	2.5	Iris-virginica

Find and Drop Duplicates

1. Find Duplicated Rows via the duplicated() Function

Insert duplicated rows

2. Drop Duplicated Rows via drop_duplicates() Function

3. Find and Drop Duplicates Based on a Subset of Columns

Amazing eBook to learn ggplot2 FAST & EASY

1. Find Duplicated Rows via the `duplicated()` Function

2. Drop Duplicated Rows via `drop_duplicates()` Function