`approx_count_distinct()`

The approx_count_distinct() function uses an approximation algorithm and quickly provides an approximate count of the distinct elements in a column. This function is particularly helpful in handling large datasets where obtaining an exact distinct count might be resource-intensive.

Create Spark Session and sample DataFrame

from pyspark.sql import SparkSessionfrom pyspark.sql.functions import approx_count_distinct
# Initialize Spark Sessionspark = SparkSession.builder.appName("approxCountDistinctExample").getOrCreate()
# Sample DataFramedata = [("James", "Sales"), ("Michael", "Sales"), ("Robert", "Sales"),        ("Maria", "Finance"), ("James", "Sales"), ("Scott", "Finance")]columns = ["Employee Name", "Department"]df = spark.createDataFrame(data, columns)df.show()

Output:
+-------------+----------+
|Employee Name|Department|
+-------------+----------+
|        James|     Sales|
|      Michael|     Sales|
|       Robert|     Sales|
|        Maria|   Finance|
|        James|     Sales|
|        Scott|   Finance|
+-------------+----------+

Example: Use `approx_count_distinct()` function to count the number of distinct values in a column

approx_count_distinct("Department"): By passing the Department column to the function, it counts the number of distinct departments in the column.
alias("Distinct Departments"): this renames the resulted column as Distinct Departments.
df.select(): it selects only the calculated column.

approx_distinct_count = df.select(approx_count_distinct("Department").alias("Distinct Departments"))approx_distinct_count.show()

Output:
+--------------------+
|Distinct Departments|
+--------------------+
|                   2|
+--------------------+

# Stop the Spark Sessionspark.stop()

< Previous

Next >

Amazing eBook to learn ggplot2 FAST & EASY

book cover for sliding your way to ggplot2 mastery

approx_count_distinct()

Create Spark Session and sample DataFrame

Example: Use approx_count_distinct() function to count the number of distinct values in a column

Amazing eBook to learn ggplot2 FAST & EASY

`approx_count_distinct()`

Example: Use `approx_count_distinct()` function to count the number of distinct values in a column