
min()
The min()
function in Apache Spark is an aggregation function designed to compute the minimum value of a column in a DataFrame.
Usage
- The
min()
function can be applied directly to a DataFrame to find the minimum value in a specific column. - When used with
groupBy()
, it returns the minimum values for each group in a column.
Create Spark Session and sample DataFrame
from pyspark.sql import SparkSessionfrom pyspark.sql.functions import min
# Initialize Spark Sessionspark = SparkSession.builder.appName("minExample").getOrCreate()
# Sample DataFramedata = [("group A", 45), ("group A", 30), ("group A", 55), ("group B", 10), ("group B", 20), ("group B", 60), ]columns = ["Group", "Variable"]df = spark.createDataFrame(data, columns)df.show()
Output:
+-------+--------+
| Group|Variable|
+-------+--------+
|group A| 45|
|group A| 30|
|group A| 55|
|group B| 10|
|group B| 20|
|group B| 60|
+-------+--------+
Example: Use min
to compute the min value of a column
min("Variable")
: it computes the minimum value of the Variable column.alias("Minimum Value")
: it renames the resulting column as Minimum Value.
df.select(min("Variable").alias("Minimum Value")).show()
Output:
+-------------+
|Minimum Value|
+-------------+
| 10|
+-------------+
Example: Use min()
with groupBy()
to compute the min value of each group
groupBy("Group")
: it groups the data by the Group column.agg(min("Variable").alias("Minimum Value")
: it computes the min value of each group and renames it as Minimum Value.
grouped_data = df.groupBy("Group").agg(min("Variable").alias("Minimum Value"))grouped_data.show()
Output:
+-------+-------------+
| Group|Minimum Value|
+-------+-------------+
|group A| 30|
|group B| 10|
+-------+-------------+
# Stop the Spark Sessionspark.stop()