
agg()
In Apache Spark, the agg()
function is used to perform complex aggregations on a DataFrame. It is often used after a groupBy()
operation and can apply multiple aggregation functions at once.
How It Works
Create Spark Session and sample DataFrame
from pyspark.sql import SparkSessionfrom pyspark.sql import functions as F
# Initialize Spark Sessionspark = SparkSession.builder.appName("aggExample").getOrCreate()
# Sample DataFramedata = [("James", "Sales", 3000), ("Michael", "Sales", 4600), ("Robert", "Sales", 4100), ("Maria", "Finance", 3000), ("James", "Sales", 3000), ("Scott", "Finance", 3300), ("Jen", "Finance", 3900), ("Jeff", "Marketing", 3000), ("Kumar", "Marketing", 2000), ("Saif", "Sales", 4100)]columns = ["employee_name", "department", "salary"]
df = spark.createDataFrame(data, schema=columns)df.show()
Output:
+-------------+----------+------+
|employee_name|department|salary|
+-------------+----------+------+
| James| Sales| 3000|
| Michael| Sales| 4600|
| Robert| Sales| 4100|
| Maria| Finance| 3000|
| James| Sales| 3000|
| Scott| Finance| 3300|
| Jen| Finance| 3900|
| Jeff| Marketing| 3000|
| Kumar| Marketing| 2000|
| Saif| Sales| 4100|
+-------------+----------+------+
Example: Apply multiple aggregate functions on grouped data
- We're going to group the DataFrame df by the department column.
- The agg function is used to calculate the
average
,sum
, andmaximum
salary for each department. And we use the alias function to rename the output columns for better readability.
# GroupBy and Aggagg_df = df.groupBy("department") .agg( F.avg("salary").alias("averageSalary"), F.sum("salary").alias("totalSalary"), F.max("salary").alias("maxSalary") )agg_df.show()
Output:
+----------+-------------+-----------+---------+
|department|averageSalary|totalSalary|maxSalary|
+----------+-------------+-----------+---------+
| Sales| 3760.0| 18800| 4600|
| Finance| 3400.0| 10200| 3900|
| Marketing| 2500.0| 5000| 3000|
+----------+-------------+-----------+---------+
# Stop the Spark Sessionspark.stop()