
count()
The count()
function is used to count the total number of rows in a DataFrame, and it can also be used in combination with groupBy()
to count the number of rows in each group.
Create Spark Session and sample DataFrame
from pyspark.sql import SparkSession
# Initialize Spark Sessionspark = SparkSession.builder.appName("countExample").getOrCreate()
# Sample DataFramedata = [("James", "Sales"), ("Ana", "Sales"), ("Robert", "IT"), ("Maria", "IT")]columns = ["Employee Name", "Department"]df = spark.createDataFrame(data, columns)df.show()
Output:
+-------------+----------+
|Employee Name|Department|
+-------------+----------+
| James| Sales|
| Ana| Sales|
| Robert| IT|
| Maria| IT|
+-------------+----------+
Example: Use count()
to count the total number of rows in the DataFrame
total_count = df.count()print("Total Row Count:", total_count)
Output:
Total Row Count: 4
Example: Use count()
with groupBy()
to count the number of rows in each group
groupBy("Department")
: the DataFrame is grouped by the Department column.count()
: count the number of rows in each department.
# Counting Rows by Groupgrouped_count = df.groupBy("Department").count()grouped_count.show()
Output:
Total Row Count: 4+----------+-----+
|Department|count|
+----------+-----+
| Sales| 2|
| IT| 2|
+----------+-----+
# Stop the Spark Sessionspark.stop()