`filter()`

The filter() function is used to filter rows in a DataFrame based on one or more conditions. It can be used with a variety of conditions, including equality checks, range queries, and string operations. The function returns a new DataFrame containing only the rows that match the specified condition.

Create Spark Session and sample DataFrame

from pyspark.sql import SparkSession
# Initialize Spark Sessionspark = SparkSession.builder.appName("selectExample").getOrCreate()
# Create a Spark DataFrame data = [("John", 28, "Software Engineer"), ("Sara", 33, "Doctor"), ("Mike", 23, "Teacher")]columns = ["Name", "Age", "Profession"]
df = spark.createDataFrame(data, columns)df.show()

Output:
+----+---+-----------------+
|Name|Age|       Profession|
+----+---+-----------------+
|John| 28|Software Engineer|
|Sara| 33|           Doctor|
|Mike| 23|          Teacher|
+----+---+-----------------+

Example: Filter data based on a single condition

df.filter(col("Age") > 30): This filters the DataFrame df, keeping only the rows where the Age column is greater than 30.

from pyspark.sql.functions import col
df.filter(col("Age") > 30).show()

Output:
+----+---+----------+
|Name|Age|Profession|
+----+---+----------+
|Sara| 33|    Doctor|
+----+---+----------+

Example: Filter data based on multiple conditions

df.filter((col("Age") > 25) & (col("Profession") == "Doctor")): This applies multiple conditions. It filters df to include rows where Age is greater than 25 and the Profession is Doctor.

df.filter((col("Age") > 25) & (col("Profession") == "Doctor")).show()

Output:
+----+---+----------+
|Name|Age|Profession|
+----+---+----------+
|Sara| 33|    Doctor|
+----+---+----------+

# Stop the Spark Sessionspark.stop()

< Previous

Next >

Amazing eBook to learn ggplot2 FAST & EASY

book cover for sliding your way to ggplot2 mastery