
selectExpr()
The selectExpr()
function is used to perform complex operations and transformations on DataFrame columns using SQL expressions. It provides a more flexible and powerful way to manipulate data compared to the basic select()
function.selectExpr()
is particularly useful when you need to perform operations that are easily expressed in SQL syntax, such as conditional logic, string manipulation, and mathematical computations.
Create Spark Session and sample DataFrame
from pyspark.sql import SparkSession
# Initialize Spark Sessionspark = SparkSession.builder.appName("Example").getOrCreate()
# Create a Spark DataFrame data = [("James", "Smith", "USA", 1), ("Anna", "Rose", "UK", 2), ("Robert", "Williams", "USA", 3)]
columns = ["Firstname", "Lastname", "Country", "ID"]
df = spark.createDataFrame(data, columns)df.show()
Output:
+---------+--------+-------+---+
|Firstname|Lastname|Country| ID|
+---------+--------+-------+---+
| James| Smith| USA| 1|
| Anna| Rose| UK| 2|
| Robert|Williams| USA| 3|
+---------+--------+-------+---+
Example: Use selectExpr()
for creating a boolean column based on conditions
- The code below creates a new DataFrame called selected_df by selecting specific columns Firstname and Lastname from the original
df
DataFrame. - It also includes a new boolean column IsInUS using the
selectExpr()
function. The SQL expressionCountry in ('USA')
checks if the Country column has a value equal to 'USA', resulting in a Boolean column. Thus, the IsInUS column will contain 'true' for rows where 'Country' is 'USA' and 'false' for other rows.
selected_df = df.selectExpr("Firstname", "Lastname", "Country in ('USA') as IsInUS")selected_df.show()
Output:
+---------+--------+------+
|Firstname|Lastname|IsInUS|
+---------+--------+------+
| James| Smith| true|
| Anna| Rose| false|
| Robert|Williams| true|
+---------+--------+------+
Example: Use selectExpr()
for concatenating two string columns
df.selectExpr("Firstname || ' ' || Lastname as fullName")
: this line of code creates a new DataFrame named df_fullName by combining the Firstname column and the Lastname to create a new column named fullName.
df_fullName = df.selectExpr("Firstname || ' ' || Lastname as fullName")df_fullName.show()
Output:
+---------------+
| fullName|
+---------------+
| James Smith|
| Anna Rose|
|Robert Williams|
+---------------+
Example: Use selectExpr()
for math calculations
Create a new sample DataFrame
# create a new DataFramedf = spark.createDataFrame([(10, 2), (20, 4), (30, 6)], ["value", "weight"])df.show()
Output:
+-----+------+
|value|weight|
+-----+------+
| 10| 2|
| 20| 4|
| 30| 6|
+-----+------+
df.selectExpr("*", "weight*2 as doublingWeight")
: this line of code selects all columns of the df and creates a new column doublingWeight by multiplying the weight column by 2.
# Using selectExpr for a multipliation calculationdf.selectExpr("*", "weight*2 as doublingWeight").show()
Output:
+-----+------+--------------+
|value|weight|doublingWeight|
+-----+------+--------------+
| 10| 2| 4|
| 20| 4| 8|
| 30| 6| 12|
+-----+------+--------------+
# Stop the Spark Sessionspark.stop()