`withColumn()`

The withColumn() function is a versatile tool used to add a new column to a DataFrame or to replace an existing column with new values. This function is essential for data transformation tasks.

withColumn() takes two arguments: the name of the new or existing column, and the expression that defines the values of the column. It returns a new DataFrame with the specified column added or modified.

Create Spark Session and sample DataFrame

from pyspark.sql import SparkSession
# Initialize Spark Sessionspark = SparkSession.builder.appName("selectExample").getOrCreate()# Create a Spark DataFrame data = [("John", 28), ("Sara", 33), ("Mike", 23)]columns = ["Name", "Age"]df = spark.createDataFrame(data, columns)df.show()

Output:
+----+---+
|Name|Age|
+----+---+
|John| 28|
|Sara| 33|
|Mike| 23|
+----+---+

Example: Use `withColumn()` to add a new column to the DataFrame

df.withColumn("AgeAfter10Years", col("Age") + 10): This line of code adds a new column named AgeAfter10Years to the DataFrame df. The values in this new column are calculated by adding 10 to each value in the Age column. Essentially, it projects the age of each individual ten years into the future.

from pyspark.sql.functions import col
# Adding a new columndf.withColumn("AgeAfter10Years", col("Age") + 10).show()

Output:
+----+---+---------------+
|Name|Age|AgeAfter10Years|
+----+---+---------------+
|John| 28|             38|
|Sara| 33|             43|
|Mike| 23|             33|
+----+---+---------------+

Example: Use `withColumn()` to replace an existing column

df.withColumn("Age", col("Age") * 2): Here, instead of adding a new column, the existing Age column in the DataFrame df is replaced. The replacement values are computed by multiplying each original age value by 2. This could be useful, for example, in a scenario where you need to double the age values for a specific analytical purpose.

df.withColumn("Age", col("Age") * 2).show()

Output:
+----+---+
|Name|Age|
+----+---+
|John| 56|
|Sara| 66|
|Mike| 46|
+----+---+

# Stop the Spark Sessionspark.stop()

< Previous

Next >

Amazing eBook to learn ggplot2 FAST & EASY

book cover for sliding your way to ggplot2 mastery

withColumn()

Create Spark Session and sample DataFrame

Example: Use withColumn() to add a new column to the DataFrame

Example: Use withColumn() to replace an existing column

Amazing eBook to learn ggplot2 FAST & EASY

`withColumn()`

Example: Use `withColumn()` to add a new column to the DataFrame

Example: Use `withColumn()` to replace an existing column