
max
The max()
function in Apache Spark is an aggregation function used to compute the maximum value of a column in a DataFrame.
Usage
max()
can be applied directly to a DataFrame to find the maximum value in a specific column.- When combined with
groupBy()
, it can be used to find the maximum values for each group in a column.
Create Spark Session and sample DataFrame
from pyspark.sql import SparkSessionfrom pyspark.sql.functions import max
# Initialize Spark Sessionspark = SparkSession.builder.appName("maxExample").getOrCreate()
# Sample DataFramedata = [("group A", 45), ("group A", 30), ("group A", 55), ("group B", 10), ("group B", 20), ("group B", 60), ]columns = ["Group", "Variable"]df = spark.createDataFrame(data, columns)df.show()
Output:
+-------+--------+
| Group|Variable|
+-------+--------+
|group A| 45|
|group A| 30|
|group A| 55|
|group B| 10|
|group B| 20|
|group B| 60|
+-------+--------+
Example: Use max()
to return the max value of a column
max("Variable")
: it returns the max value of the Variable column.alias("Maximum Value")
: it renames the returned column as Maxium Value.
max_df = df.select(max("Variable").alias("Maximum Value"))max_df.show()
Output:
+-------------+
|Maximum Value|
+-------------+
| 60|
+-------------+
Example: Use max()
with groupBy()
to return the max value of each group
groupBy("Group")
: it groups the data by the Group column.max("Variable").alias("Maximum Value")
: now it returns the max value of each group and renamed it as Maximum Value.
grouped_data = df.groupBy("Group").agg(max("Variable").alias("Maximum Value"))grouped_data.show()
Output:
+-------+-------------+
| Group|Maximum Value|
+-------+-------------+
|group A| 55|
|group B| 60|
+-------+-------------+
# Stop the Spark Sessionspark.stop()