`agg()`

In Apache Spark, the agg() function is used to perform complex aggregations on a DataFrame. It is often used after a groupBy() operation and can apply multiple aggregation functions at once.

How It Works

agg() allows you to compute different types of aggregations (like sum, average, maximum) on various DataFrame columns.
It can be used alongside groupBy() to perform these aggregations on groups of data.

Create Spark Session and sample DataFrame

from pyspark.sql import SparkSessionfrom pyspark.sql import functions as F
# Initialize Spark Sessionspark = SparkSession.builder.appName("aggExample").getOrCreate()
# Sample DataFramedata = [("James", "Sales", 3000),        ("Michael", "Sales", 4600),        ("Robert", "Sales", 4100),        ("Maria", "Finance", 3000),        ("James", "Sales", 3000),        ("Scott", "Finance", 3300),        ("Jen", "Finance", 3900),        ("Jeff", "Marketing", 3000),        ("Kumar", "Marketing", 2000),        ("Saif", "Sales", 4100)]columns = ["employee_name", "department", "salary"]
df = spark.createDataFrame(data, schema=columns)df.show()

Output:
+-------------+----------+------+
|employee_name|department|salary|
+-------------+----------+------+
|        James|     Sales|  3000|
|      Michael|     Sales|  4600|
|       Robert|     Sales|  4100|
|        Maria|   Finance|  3000|
|        James|     Sales|  3000|
|        Scott|   Finance|  3300|
|          Jen|   Finance|  3900|
|         Jeff| Marketing|  3000|
|        Kumar| Marketing|  2000|
|         Saif|     Sales|  4100|
+-------------+----------+------+

Example: Apply multiple aggregate functions on grouped data

We're going to group the DataFrame df by the department column.
The agg function is used to calculate the average, sum, and maximum salary for each department. And we use the alias function to rename the output columns for better readability.

# GroupBy and Aggagg_df = df.groupBy("department")           .agg(                F.avg("salary").alias("averageSalary"),                F.sum("salary").alias("totalSalary"),                F.max("salary").alias("maxSalary")            )agg_df.show()

Output:
+----------+-------------+-----------+---------+
|department|averageSalary|totalSalary|maxSalary|
+----------+-------------+-----------+---------+
|     Sales|       3760.0|      18800|     4600|
|   Finance|       3400.0|      10200|     3900|
| Marketing|       2500.0|       5000|     3000|
+----------+-------------+-----------+---------+

# Stop the Spark Sessionspark.stop()

< Previous

Next >

Amazing eBook to learn ggplot2 FAST & EASY

book cover for sliding your way to ggplot2 mastery