
select()
The select()
function is used for selecting specific columns from a DataFrame and returns a new DataFrame. It is one of the most common operations in data processing and analysis.
Create Spark Session and sample DataFrame
from pyspark.sql import SparkSession
# Initialize Spark Sessionspark = SparkSession.builder.appName("selectExample").getOrCreate()
# Create a Spark DataFrame data = [("James", "Smith", "USA", 1), ("Anna", "Rose", "UK", 2), ("Robert", "Williams", "USA", 3)]
columns = ["Firstname", "Lastname", "Country", "ID"]
df = spark.createDataFrame(data, columns)df.show()
Output:
+---------+--------+-------+---+
|Firstname|Lastname|Country| ID|
+---------+--------+-------+---+
| James| Smith| USA| 1|
| Anna| Rose| UK| 2|
| Robert|Williams| USA| 3|
+---------+--------+-------+---+
Example: Use select()
to select a single column from a DataFrame
df.select("Firstname")
: it selects a single column from DataFrame df and returns a new DataFrame
# select a single columndf.select("Firstname").show()
Output:
+---------+
|Firstname|
+---------+
| James|
| Anna|
| Robert|
+---------+
Example: use select()
to select multiple columns from a DataFrame
df.select("Firstname", "Lastname")
: it selects multiple columns from DataFrame df and returns a new DataFrame
# Select only Firstname and Lastname columns and save the new DataFrame to df_namedf_name = df.select("Firstname", "Lastname")df_name.show()
Output:
+---------+--------+
|Firstname|Lastname|
+---------+--------+
| James| Smith|
| Anna| Rose|
| Robert|Williams|
+---------+--------+
Example: use select()
to select all columns of a DataFrame
df.select("*")
: it selects all columns of the DataFrame df and returns a new DataFrame
df_all = df.select("*")df_all.show()
Output:
+---------+--------+-------+---+
|Firstname|Lastname|Country| ID|
+---------+--------+-------+---+
| James| Smith| USA| 1|
| Anna| Rose| UK| 2|
| Robert|Williams| USA| 3|
+---------+--------+-------+---+
End the Spark Session
spark.stop()