
approx_count_distinct()
The approx_count_distinct()
function uses an approximation algorithm and quickly provides an approximate count of the distinct elements in a column. This function is particularly helpful in handling large datasets where obtaining an exact distinct count might be resource-intensive.
Create Spark Session and sample DataFrame
from pyspark.sql import SparkSessionfrom pyspark.sql.functions import approx_count_distinct
# Initialize Spark Sessionspark = SparkSession.builder.appName("approxCountDistinctExample").getOrCreate()
# Sample DataFramedata = [("James", "Sales"), ("Michael", "Sales"), ("Robert", "Sales"), ("Maria", "Finance"), ("James", "Sales"), ("Scott", "Finance")]columns = ["Employee Name", "Department"]df = spark.createDataFrame(data, columns)df.show()
Output:
+-------------+----------+
|Employee Name|Department|
+-------------+----------+
| James| Sales|
| Michael| Sales|
| Robert| Sales|
| Maria| Finance|
| James| Sales|
| Scott| Finance|
+-------------+----------+
Example: Use approx_count_distinct()
function to count the number of distinct values in a column
approx_count_distinct("Department")
: By passing the Department column to the function, it counts the number of distinct departments in the column.alias("Distinct Departments")
: this renames the resulted column as Distinct Departments.df.select()
: it selects only the calculated column.
approx_distinct_count = df.select(approx_count_distinct("Department").alias("Distinct Departments"))approx_distinct_count.show()
Output:
+--------------------+
|Distinct Departments|
+--------------------+
| 2|
+--------------------+
# Stop the Spark Sessionspark.stop()