Select a Subset of Data
When working with data, you may not always need all columns or want to anallyze a specific portion of your dataset. In such situations, you can use the loc[]
or iloc[]
functions to extract the desired subset of data.
This tutorial uses classic Iris dataset, which can be downloaded here Iris dataset.
import pandas as pddf = pd.read_csv('Iris.csv')
1. Select a Subset of Data via iloc[]
The df.iloc[]
function is used for integer based indexing for selection by position, allowing access to a group of rows and columns by integer indices.
Usage
.iloc[row_index, column_index]
takes two arguments:
- row_index: This argument specifies the rows you want to select based on their integer positions. You can provide a single integer, a slice (e.g., 0:5), or a list of integers to select specific rows.
- column_index: This argument specifies the columns you want to select based on their integer positions. Similar to the rows, you can provide a single integer, a slice, or a list of integers to select specific columns.
Select One Column by Column Index
In this example, we've selected all rows using :
and specifically chosen the first column with [0]
.
df.iloc[:, [0]]
Output:
Id | |
---|---|
0 | 1 |
1 | 2 |
2 | 3 |
3 | 4 |
4 | 5 |
... | ... |
145 | 146 |
146 | 147 |
147 | 148 |
148 | 149 |
149 | 150 |
150 rows × 1 columns
Select First 3 Rows and One column
In this example. we've selected the first 3 rows of data with :3
and the second column with [1]
.
df.iloc[:3, [1]]
Output:
SepalLengthCm | |
---|---|
0 | 5.1 |
1 | 4.9 |
2 | 4.7 |
2. Select a Subset of Data via loc[]
Different from iloc[]
using positions, loc[]
accesses specific rows and columns by labels or a filter rows based on a condition.
Select one column
In this example, we've selected the first 5 rows and the SepalLengthCm
column.
It's essential to understand that in Python, data slicing with [:]
is typically left-inclusive and right-exclusive. However, with loc[:]
, it becomes inclusive on both sides. Therefore, to select the first 5 rows, we used loc[:4]
or iloc[:5]
.
df.loc[:4, ['SepalLengthCm']]
Output:
SepalLengthCm | |
---|---|
0 | 5.1 |
1 | 4.9 |
2 | 4.7 |
3 | 4.6 |
4 | 5.0 |
Select multiple columns
Select first 10 rows and two columns out from the data using loc[]
df.loc[:9, ['SepalLengthCm', 'SepalWidthCm']]
Output:
SepalLengthCm | SepalWidthCm | |
---|---|---|
0 | 5.1 | 3.5 |
1 | 4.9 | 3.0 |
2 | 4.7 | 3.2 |
3 | 4.6 | 3.1 |
4 | 5.0 | 3.6 |
5 | 5.4 | 3.9 |
6 | 4.6 | 3.4 |
7 | 5.0 | 3.4 |
8 | 4.4 | 2.9 |
9 | 4.9 | 3.1 |
150 rows × 1 columns
3. Filter Data Based on Conditions with loc
Single condition Example
select records where its SepalLengthCm
column values are larger than 7.
df.loc[df['SepalLengthCm']>7]
Output:
Id | SepalLengthCm | SepalWidthCm | PetalLengthCm | PetalWidthCm | Species | |
---|---|---|---|---|---|---|
102 | 103 | 7.1 | 3.0 | 5.9 | 2.1 | Iris-virginica |
105 | 106 | 7.6 | 3.0 | 6.6 | 2.1 | Iris-virginica |
107 | 108 | 7.3 | 2.9 | 6.3 | 1.8 | Iris-virginica |
109 | 110 | 7.2 | 3.6 | 6.1 | 2.5 | Iris-virginica |
117 | 118 | 7.7 | 3.8 | 6.7 | 2.2 | Iris-virginica |
118 | 119 | 7.7 | 2.6 | 6.9 | 2.3 | Iris-virginica |
122 | 123 | 7.7 | 2.8 | 6.7 | 2.0 | Iris-virginica |
125 | 126 | 7.2 | 3.2 | 6.0 | 1.8 | Iris-virginica |
129 | 130 | 7.2 | 3.0 | 5.8 | 1.6 | Iris-virginica |
130 | 131 | 7.4 | 2.8 | 6.1 | 1.9 | Iris-virginica |
131 | 132 | 7.9 | 3.8 | 6.4 | 2.0 | Iris-virginica |
135 | 136 | 7.7 | 3.0 | 6.1 | 2.3 | Iris-virginica |
Multiple conditions example
To select records where its SepalLengthCm
are larger than 7 and PetalWidthCm
are smaller than 2, we can use the &
operator for 'and' to combine the two conditions. It's important to remember that when using and &
operator and or |
opeartor to combine multiple conditions, we need to use parenthesis ()
to wrap each condition to ensure the filtering is working properly.
df.loc[(df['SepalLengthCm']>7) & (df['PetalWidthCm']<2)]
Output:
Id | SepalLengthCm | SepalWidthCm | PetalLengthCm | PetalWidthCm | Species | |
---|---|---|---|---|---|---|
107 | 108 | 7.3 | 2.9 | 6.3 | 1.8 | Iris-virginica |
125 | 126 | 7.2 | 3.2 | 6.0 | 1.8 | Iris-virginica |
129 | 130 | 7.2 | 3.0 | 5.8 | 1.6 | Iris-virginica |
130 | 131 | 7.4 | 2.8 | 6.1 | 1.9 | Iris-virginica |
Awesome! In this lesson, you've learned the skills to select specific data subsets using iloc
and loc
. In our next chapter, we'll delve into data sorting. Stay tuned!