
avg()
The avg()
function in Apache Spark is an aggregation function used to calculate the average value of a numeric column in a DataFrame.avg()
can be used on its own to compute the average of a column, or in conjunction with groupBy()
to calculate the average for each group.
Create Spark Session and sample DataFrame
from pyspark.sql import SparkSessionfrom pyspark.sql.functions import avg
# Initialize Spark Sessionspark = SparkSession.builder.appName("avgExample").getOrCreate()
# Sample DataFramedata = [("James", "classA",85), ("Anna","classA", 90), ("Robert","classA", 88), ("James", "classB",90), ("Anna","classB", 80), ("Robert","classB", 90), ("James", "classC",82), ("Anna","classC", 94), ("Robert","classC", 92), ]columns = ["Name", "Class", "Grade"]df = spark.createDataFrame(data, columns)df.show()
Output:
+------+------+-----+
| Name| Class|Grade|
+------+------+-----+
| James|classA| 85|
| Anna|classA| 90|
|Robert|classA| 88|
| James|classB| 90|
| Anna|classB| 80|
|Robert|classB| 90|
| James|classC| 82|
| Anna|classC| 94|
|Robert|classC| 92|
+------+------+-----+
Example: Use avg()
function to compute the average of a numeric column
avg("Grade")
: it calculates the average of the entire Grade column.alias("Average Grade")
: it renames the resulted average column to Average Grade.
df.select(avg("Grade").alias("Average Grade")).show()
Output:
+-----------------+
| Average Grade|
+-----------------+
|87.88888888888889|
+-----------------+
Example: Use avg()
with groupBy()
to calculate averages of each group in a column
groupBy("Class")
: this functions groups the data by the Class column.avg("Grade")
: it calculates the average grade of each class based on the group by column Class.agg()
:agg()
function is used to chainavg()
function together withalias()
function.
grouped_data = df.groupBy("Class").agg(avg("Grade").alias("Average Grade"))grouped_data.show()
Output:
+------+-----------------+
| Class| Average Grade|
+------+-----------------+
|classA|87.66666666666667|
|classB|86.66666666666667|
|classC|89.33333333333333|
+------+-----------------+
# Stop the Spark Sessionspark.stop()