Group Data
To obtain aggregated data statistics, we can use the groupby()
function, which allows us to group data based on specific criteria and subsequently compute statistics for each category.
In the case of the Iris dataset, we'll examine examples of how to get the count of records for each species, as well as calculate the mean and standard deviation for each numerical feature.
Let's begin by exploring the use of groupby()
function in conjunction with aggregation functions to accomplish this task.
This tutorial uses classic Iris dataset, which can be downloaded here Iris dataset.
import pandas as pddf = pd.read_csv('Iris.csv')
1. Get Counts of Each Species through groupby()
and count()
Functions
In this example, we've grouped the data by the distinct values in the Species column and subsequently counted the number of records within each category. The result has showed that each category comprises 50 records:
df.groupby('Species')['Id'].count()
Output:
Species
Iris-setosa 50
Iris-versicolor 50
Iris-virginica 50
Name: Id, dtype: int64
2. Calculate Mean of Numerical Columns through groupby()
and mean()
Functions
In this example, we've grouped the data by the distinct species. Then, we've applied the mean(numeric_only=True)
function to compute the mean of all numerical features in the dataset:
df.groupby('Species').mean(numeric_only=True)
Output:
Id | SepalLengthCm | SepalWidthCm | PetalLengthCm | PetalWidthCm | |
---|---|---|---|---|---|
Species | |||||
Iris-setosa | 25.5 | 5.006 | 3.418 | 1.464 | 0.244 |
Iris-versicolor | 75.5 | 5.936 | 2.770 | 4.260 | 1.326 |
Iris-virginica | 125.5 | 6.588 | 2.974 | 5.552 | 2.026 |
3. Calculate Standard Deviation of Numerical Columns through groupby()
and std()
Functions
In this example, we've also grouped the data by distinct species and computed standard deviations of each numerical features:
df.groupby('Species').std()
Output:
Id | SepalLengthCm | SepalWidthCm | PetalLengthCm | PetalWidthCm | |
---|---|---|---|---|---|
Species | |||||
Iris-setosa | 14.57738 | 0.352490 | 0.381024 | 0.173511 | 0.107210 |
Iris-versicolor | 14.57738 | 0.516171 | 0.313798 | 0.469911 | 0.197753 |
Iris-virginica | 14.57738 | 0.635880 | 0.322497 | 0.551895 | 0.274650 |
Excellent! We have explored methods for grouping data and computing group-specific statistics. In the upcoming tutorial, we'll delve into the use of custom functions to peform data transformations.