
sumDistinct()
The sumDistinct()
function is utilized to compute the sum of all distinct values in a column of a DataFrame. This aggregation function is handy when dealing with duplicate values that should only be used once in the sum.
Create Spark Session and sample DataFrame
from pyspark.sql import SparkSessionfrom pyspark.sql.functions import sumDistinct
# Initialize Spark Sessionspark = SparkSession.builder.appName("sumDistinctExample").getOrCreate()
# Sample DataFramedata = [("James", 10), ("Anna", 15), ("James", 10), ("Robert", 20)]columns = ["Name", "Score"]df = spark.createDataFrame(data, columns)df.show()
Output:
+------+-----+
| Name|Score|
+------+-----+
| James| 10|
| Anna| 15|
| James| 10|
|Robert| 20|
+------+-----+
Example: Use sumDistinct()
to sum distinct values in a column
sumDistinct("Score")
: it sums up all the distinct values in the Score column.alias("Distinct Sum of Scores")
: it renames the resulting column as Distinct Sum of Scores.
distinct_sum_df = df.select(sumDistinct("Score").alias("Distinct Sum of Scores"))distinct_sum_df.show()
Output:
+----------------------+
|Distinct Sum of Scores|
+----------------------+
| 45|
+----------------------+
# Stop the Spark Sessionspark.stop()