
replace()
The replace()
function is used to replace values in a DataFrame. This method allows for the substitution of specific values within one or more columns, which is helpful in data cleaning and transformation processes.
Usage
replace(search_value, replacement_value, column_name)
takes three arguments:
- search_value: This is the value you want to search to replace.
- replacement_value: This is the value that you want to use as a replacement for the search value in the specified column.
- column_name(optional): This is the name of the column or a list of columns in which you want to perform the replacement. If not specified, entire dataset will be searched for the serach value and replaced by the replacement value.
Create Spark Session and sample DataFrame
from pyspark.sql import SparkSession
# Initialize Spark Sessionspark = SparkSession.builder.appName("replaceExample").getOrCreate()
# Sample DataFramedata = [("James", "New York"), ("Anna", "California"), ("Robert", "California")]columns = ["Name", "State"]df = spark.createDataFrame(data, columns)df.show()
Output:
+------+----------+
| Name| State|
+------+----------+
| James| New York|
| Anna|California|
|Robert|California|
+------+----------+
Example: Use replace()
to search and replace a value
replace("California", "CA", ["State"])
: it replaces the word "California" with "CA" in the State column.
replaced_df = df.replace("California", "CA", ["State"])replaced_df.show()
Output:
+------+--------+
| Name| State|
+------+--------+
| James|New York|
| Anna| CA|
|Robert| CA|
+------+--------+
# Stop the Spark Sessionspark.stop()