
dropDuplicates()
The dropDuplicates()
function is a handy tool for removing duplicate rows from a DataFrame. It can be used without any arguments to remove all duplicate rows based on all columns. Alternatively, you can specify a subset of columns to consider for identifying duplicates.
Create Spark Session and sample DataFrame
from pyspark.sql import SparkSession
# Initialize Spark Sessionspark = SparkSession.builder.appName("selectExample").getOrCreate()
# Sample data with duplicatesdata = [ ("James", "Smith", "USA", 1), ("Anna", "Rose", "UK", 2), ("Robert", "Williams", "USA", 3), ("James", "Bond", "USA", 1), # Duplicate ("Anna", "Rose", "UK", 2), # Duplicate ("Robert", "Williams", "USA", 3) # Duplicate]
columns = ["Firstname", "Lastname", "Country", "ID"]
df = spark.createDataFrame(data, columns)df.show()
Output:
+---------+--------+-------+---+
|Firstname|Lastname|Country| ID|
+---------+--------+-------+---+
| James| Smith| USA| 1|
| Anna| Rose| UK| 2|
| Robert|Williams| USA| 3|
| James| Bond| USA| 1|
| Anna| Rose| UK| 2|
| Robert|Williams| USA| 3|
+---------+--------+-------+---+
Example: Drop duplicated rows based on all columns of a DataFrame
df.dropDuplicates()
: This removes all duplicate rows in the DataFramedf
. If two or more rows are exactly the same across all columns, only one is kept.
df.dropDuplicates().show()
Output:
+---------+--------+-------+---+
|Firstname|Lastname|Country| ID|
+---------+--------+-------+---+
| James| Smith| USA| 1|
| Anna| Rose| UK| 2|
| Robert|Williams| USA| 3|
| James| Bond| USA| 1|
+---------+--------+-------+---+
Example: Drop duplicates based on a speficied column
df.dropDuplicates(["Lastname"])
: This removes duplicate rows based on the Lastname column. It keeps only the first occurrence of each unique Lastname, irrespective of the other column values.
# Removing duplicatesdf.dropDuplicates(['Lastname']).show()
Output:
+---------+--------+-------+---+
|Firstname|Lastname|Country| ID|
+---------+--------+-------+---+
| James| Bond| USA| 1|
| Anna| Rose| UK| 2|
| James| Smith| USA| 1|
| Robert|Williams| USA| 3|
+---------+--------+-------+---+
# Stop the Spark Sessionspark.stop()