Sort a DataFrame
Sometimes, you may need to arrange the rows of a DataFrame based on specific criteria. To do this, we can use the sort_values()
function.
This tutorial uses classic Iris dataset, which can be downloaded here Iris dataset.
import pandas as pddf = pd.read_csv('Iris.csv')
Before sorting, the data is ordered by its Id column ascendingly.
df.head()
Output:
Id | SepalLengthCm | SepalWidthCm | PetalLengthCm | PetalWidthCm | Species | |
---|---|---|---|---|---|---|
0 | 1 | 5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa |
1 | 2 | 4.9 | 3.0 | 1.4 | 0.2 | Iris-setosa |
2 | 3 | 4.7 | 3.2 | 1.3 | 0.2 | Iris-setosa |
3 | 4 | 4.6 | 3.1 | 1.5 | 0.2 | Iris-setosa |
4 | 5 | 5.0 | 3.6 | 1.4 | 0.2 | Iris-setosa |
1. Sort a DataFrame by a Single Column
To sort a DataFrame by a specific column, such as SepalLengthCm, we can use the sort_values()
function. Simply provide the column name to the by
parameter like this: df.sort_values(by='SepalLengthCm')
. By default, this function arranges the data in ascending order.
df.sort_values(by=['SepalLengthCm']).head(10)
Output:
Id | SepalLengthCm | SepalWidthCm | PetalLengthCm | PetalWidthCm | Species | |
---|---|---|---|---|---|---|
13 | 14 | 4.3 | 3.0 | 1.1 | 0.1 | Iris-setosa |
42 | 43 | 4.4 | 3.2 | 1.3 | 0.2 | Iris-setosa |
38 | 39 | 4.4 | 3.0 | 1.3 | 0.2 | Iris-setosa |
8 | 9 | 4.4 | 2.9 | 1.4 | 0.2 | Iris-setosa |
41 | 42 | 4.5 | 2.3 | 1.3 | 0.3 | Iris-setosa |
22 | 23 | 4.6 | 3.6 | 1.0 | 0.2 | Iris-setosa |
3 | 4 | 4.6 | 3.1 | 1.5 | 0.2 | Iris-setosa |
6 | 7 | 4.6 | 3.4 | 1.4 | 0.3 | Iris-setosa |
47 | 48 | 4.6 | 3.2 | 1.4 | 0.2 | Iris-setosa |
2 | 3 | 4.7 | 3.2 | 1.3 | 0.2 | Iris-setosa |
In the returned DataFrame displayed above, you might notice that the Id column is no longer in ascending order. This is because the DataFrame has been sorted based on the SepalLengthCm column.
To sort the data by the SepalLengthCm column descendingly, we can set the ascending
parameter of the sort_values()
function to False
, as shown in the example below:
df.sort_values(by=['SepalLengthCm'], ascending=False).head(10)
Output:
Id | SepalLengthCm | SepalWidthCm | PetalLengthCm | PetalWidthCm | Species | |
---|---|---|---|---|---|---|
131 | 132 | 7.9 | 3.8 | 6.4 | 2.0 | Iris-virginica |
135 | 136 | 7.7 | 3.0 | 6.1 | 2.3 | Iris-virginica |
122 | 123 | 7.7 | 2.8 | 6.7 | 2.0 | Iris-virginica |
117 | 118 | 7.7 | 3.8 | 6.7 | 2.2 | Iris-virginica |
118 | 119 | 7.7 | 2.6 | 6.9 | 2.3 | Iris-virginica |
105 | 106 | 7.6 | 3.0 | 6.6 | 2.1 | Iris-virginica |
130 | 131 | 7.4 | 2.8 | 6.1 | 1.9 | Iris-virginica |
107 | 108 | 7.3 | 2.9 | 6.3 | 1.8 | Iris-virginica |
125 | 126 | 7.2 | 3.2 | 6.0 | 1.8 | Iris-virginica |
109 | 110 | 7.2 | 3.6 | 6.1 | 2.5 | Iris-virginica |
The data has now been aranged in descending order based on the SepalLengthCm column.
2. Sort a DataFrame by Multiple Columns
Occasionally, a situation arises where a tie occurs while sorting data based on a single column. In such cases, we can enhance our sorting strategy by including an additional column within the by
argument. This way, if there's a tie in the primary column, the data will be sorted by the secondary column.
Let's see the following example where we use the sort_values()
function to sort the data based on both the SepalLengthCm and SepalWidthCm columns.
df.sort_values(by=['SepalLengthCm', 'SepalWidthCm']).head(10)
Output:
Id | SepalLengthCm | SepalWidthCm | PetalLengthCm | PetalWidthCm | Species | |
---|---|---|---|---|---|---|
13 | 14 | 4.3 | 3.0 | 1.1 | 0.1 | Iris-setosa |
8 | 9 | 4.4 | 2.9 | 1.4 | 0.2 | Iris-setosa |
38 | 39 | 4.4 | 3.0 | 1.3 | 0.2 | Iris-setosa |
42 | 43 | 4.4 | 3.2 | 1.3 | 0.2 | Iris-setosa |
41 | 42 | 4.5 | 2.3 | 1.3 | 0.3 | Iris-setosa |
3 | 4 | 4.6 | 3.1 | 1.5 | 0.2 | Iris-setosa |
47 | 48 | 4.6 | 3.2 | 1.4 | 0.2 | Iris-setosa |
6 | 7 | 4.6 | 3.4 | 1.4 | 0.3 | Iris-setosa |
22 | 23 | 4.6 | 3.6 | 1.0 | 0.2 | Iris-setosa |
2 | 3 | 4.7 | 3.2 | 1.3 | 0.2 | Iris-setosa |
The data has now been primarily sorted based on the SepalLengthCm column. In the event of a tie in the SepalLengthCm column, it is further sorted by the secondary column SepalWidthCm. For instance, for rows with SepalLengthCm value of 4.4, like Id
of 9, 39, and 43, they have been further ordered based on their SepalWidthCm values to resolve the tie.
3. Sort a DataFrame in Ascending and Descending Order
If you want to sort a DataFrame in ascending order based on the SepalLengthCm column and in descending order based on the SepalWidthCm column, you can achieve this by providing a list of boolean values to specify the sorting order for each column using the ascending
parameter within the sort_values
function. Here's how:
df.sort_values(by=['SepalLengthCm', 'SepalWidthCm'], ascending=[True, False]).head(10)
Output:
Id | SepalLengthCm | SepalWidthCm | PetalLengthCm | PetalWidthCm | Species | |
---|---|---|---|---|---|---|
13 | 14 | 4.3 | 3.0 | 1.1 | 0.1 | Iris-setosa |
42 | 43 | 4.4 | 3.2 | 1.3 | 0.2 | Iris-setosa |
38 | 39 | 4.4 | 3.0 | 1.3 | 0.2 | Iris-setosa |
8 | 9 | 4.4 | 2.9 | 1.4 | 0.2 | Iris-setosa |
41 | 42 | 4.5 | 2.3 | 1.3 | 0.3 | Iris-setosa |
22 | 23 | 4.6 | 3.6 | 1.0 | 0.2 | Iris-setosa |
6 | 7 | 4.6 | 3.4 | 1.4 | 0.3 | Iris-setosa |
47 | 48 | 4.6 | 3.2 | 1.4 | 0.2 | Iris-setosa |
3 | 4 | 4.6 | 3.1 | 1.5 | 0.2 | Iris-setosa |
2 | 3 | 4.7 | 3.2 | 1.3 | 0.2 | Iris-setosa |
Excellent! In the next tutorial, we'll delve into another crucial aspect of Data Wrangling: Replace Values.