Quick Overview of a DataFrame
After we've loaded data into a Pandas DataFrame, the first thing to do is to inspect the DataFrame's characteristics and structures.
This tutorial uses classic Iris dataset, which can be downloaded here Iris dataset.
import pandas as pddf = pd.read_csv('Iris.csv')
1. DataFrame Overview with df.head()
In pandas, you can use df.head(n)
function to view the first n
rows of a DataFrame, providing a quick overview of your data. If n
is not specified, the default value is 5. This function is handy for a rapid glimpse into your dataset.
df.head()
Output:
Id | SepalLengthCm | SepalWidthCm | PetalLengthCm | PetalWidthCm | Species | |
---|---|---|---|---|---|---|
0 | 1 | 5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa |
1 | 2 | 4.9 | 3.0 | 1.4 | 0.2 | Iris-setosa |
2 | 3 | 4.7 | 3.2 | 1.3 | 0.2 | Iris-setosa |
3 | 4 | 4.6 | 3.1 | 1.5 | 0.2 | Iris-setosa |
4 | 5 | 5.0 | 3.6 | 1.4 | 0.2 | Iris-setosa |
2. Inspect Last n Rows with df.tail()
Similar to df.head()
, you can use df.tail(n)
to check the last n
rows of a DataFrame.
df.tail(5)
Output:
Id | SepalLengthCm | SepalWidthCm | PetalLengthCm | PetalWidthCm | Species | |
---|---|---|---|---|---|---|
145 | 146 | 6.7 | 3.0 | 5.2 | 2.3 | Iris-virginica |
146 | 147 | 6.3 | 2.5 | 5.0 | 1.9 | Iris-virginica |
147 | 148 | 6.5 | 3.0 | 5.2 | 2.0 | Iris-virginica |
148 | 149 | 6.2 | 3.4 | 5.4 | 2.3 | Iris-virginica |
149 | 150 | 5.9 | 3.0 | 5.1 | 1.8 | Iris-virginica |
3. Get Dimensionality of a DataFrame
If we want to have a quick look at how many rows and columns a DataFrame has, we can use the shape
attribute of a DataFrame. It returns a tuple in the format (rows, columns)
that provides the total row and column count for the DataFrame. As shown below, the Iris data has 150 rows of records, and 6 columns.
df.shape
Output:
(150, 6)
4. DataFrame Summary with info()
The info()
method is a handy tool for getting a concise summary of a DataFrame. This summary provides information about the number of rows in our dataset, the count of missing values in each column, as well as the data type of each column.
You can easily identify missing data by comparing the Non-Null Count
to the total record count mentioned in the RangeIndex
section at the top of the summary.
df.info()
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Id 150 non-null int64
1 SepalLengthCm 150 non-null float64
2 SepalWidthCm 150 non-null float64
3 PetalLengthCm 150 non-null float64
4 PetalWidthCm 150 non-null float64
5 Species 150 non-null object
dtypes: float64(4), int64(1), object(1)
memory usage: 7.2+ KB
5. Numerical Feature Statistics with describe()
The describe()
function is for analyzing numerical data withinin a DataFrame, excluding categorical data. It provides essential statistics such as the mean, median, mode, minimum, and maximum values for each column. In our example, it computes these statistics for the first 5 columns in the DataFrame, excluding the Species column due to its non-numerical nature.
This summary helps us quickly understand value variations and identify data skew in the columns. While it provides statistics for all numerical columns, it may not provide meaningful insights for specific columns, such as the Id column.
df.describe()
Output:
Id | SepalLengthCm | SepalWidthCm | PetalLengthCm | PetalWidthCm | |
---|---|---|---|---|---|
count | 150.000000 | 150.000000 | 150.000000 | 150.000000 | 150.000000 |
mean | 75.500000 | 5.843333 | 3.054000 | 3.758667 | 1.198667 |
std | 43.445368 | 0.828066 | 0.433594 | 1.764420 | 0.763161 |
min | 1.000000 | 4.300000 | 2.000000 | 1.000000 | 0.100000 |
25% | 38.250000 | 5.100000 | 2.800000 | 1.600000 | 0.300000 |
50% | 75.500000 | 5.800000 | 3.000000 | 4.350000 | 1.300000 |
75% | 112.750000 | 6.400000 | 3.300000 | 5.100000 | 1.800000 |
max | 150.000000 | 7.900000 | 4.400000 | 6.900000 | 2.500000 |
Great! Now that we've learned how to get a quick glimpse of the data, let's move on to the method for selecting a specific subset of the dataset.