
filter()
The filter()
function is used to filter rows in a DataFrame based on one or more conditions. It can be used with a variety of conditions, including equality checks, range queries, and string operations. The function returns a new DataFrame containing only the rows that match the specified condition.
Create Spark Session and sample DataFrame
from pyspark.sql import SparkSession
# Initialize Spark Sessionspark = SparkSession.builder.appName("selectExample").getOrCreate()
# Create a Spark DataFrame data = [("John", 28, "Software Engineer"), ("Sara", 33, "Doctor"), ("Mike", 23, "Teacher")]columns = ["Name", "Age", "Profession"]
df = spark.createDataFrame(data, columns)df.show()
Output:
+----+---+-----------------+
|Name|Age| Profession|
+----+---+-----------------+
|John| 28|Software Engineer|
|Sara| 33| Doctor|
|Mike| 23| Teacher|
+----+---+-----------------+
Example: Filter data based on a single condition
df.filter(col("Age") > 30)
: This filters the DataFramedf
, keeping only the rows where the Age column is greater than 30.
from pyspark.sql.functions import col
df.filter(col("Age") > 30).show()
Output:
+----+---+----------+
|Name|Age|Profession|
+----+---+----------+
|Sara| 33| Doctor|
+----+---+----------+
Example: Filter data based on multiple conditions
df.filter((col("Age") > 25) & (col("Profession") == "Doctor"))
: This applies multiple conditions. It filters df to include rows where Age is greater than 25 and the Profession is Doctor.
df.filter((col("Age") > 25) & (col("Profession") == "Doctor")).show()
Output:
+----+---+----------+
|Name|Age|Profession|
+----+---+----------+
|Sara| 33| Doctor|
+----+---+----------+
# Stop the Spark Sessionspark.stop()