
withColumn()
The withColumn()
function is a versatile tool used to add a new column to a DataFrame or to replace an existing column with new values. This function is essential for data transformation tasks.withColumn()
takes two arguments: the name of the new or existing column, and the expression that defines the values of the column. It returns a new DataFrame with the specified column added or modified.
Create Spark Session and sample DataFrame
from pyspark.sql import SparkSession
# Initialize Spark Sessionspark = SparkSession.builder.appName("selectExample").getOrCreate()# Create a Spark DataFrame data = [("John", 28), ("Sara", 33), ("Mike", 23)]columns = ["Name", "Age"]df = spark.createDataFrame(data, columns)df.show()
Output:
+----+---+
|Name|Age|
+----+---+
|John| 28|
|Sara| 33|
|Mike| 23|
+----+---+
Example: Use withColumn()
to add a new column to the DataFrame
df.withColumn("AgeAfter10Years", col("Age") + 10)
: This line of code adds a new column named AgeAfter10Years to the DataFramedf
. The values in this new column are calculated by adding 10 to each value in the Age column. Essentially, it projects the age of each individual ten years into the future.
from pyspark.sql.functions import col
# Adding a new columndf.withColumn("AgeAfter10Years", col("Age") + 10).show()
Output:
+----+---+---------------+
|Name|Age|AgeAfter10Years|
+----+---+---------------+
|John| 28| 38|
|Sara| 33| 43|
|Mike| 23| 33|
+----+---+---------------+
Example: Use withColumn()
to replace an existing column
df.withColumn("Age", col("Age") * 2)
: Here, instead of adding a new column, the existing Age column in the DataFramedf
is replaced. The replacement values are computed by multiplying each original age value by 2. This could be useful, for example, in a scenario where you need to double the age values for a specific analytical purpose.
df.withColumn("Age", col("Age") * 2).show()
Output:
+----+---+
|Name|Age|
+----+---+
|John| 56|
|Sara| 66|
|Mike| 46|
+----+---+
# Stop the Spark Sessionspark.stop()