Quick Overview of a DataFrame

After we've loaded data into a Pandas DataFrame, the first thing to do is to inspect the DataFrame's characteristics and structures.

This tutorial uses classic Iris dataset, which can be downloaded here Iris dataset.

import pandas as pddf = pd.read_csv('Iris.csv')

1. DataFrame Overview with `df.head()`

In pandas, you can use df.head(n) function to view the first n rows of a DataFrame, providing a quick overview of your data. If n is not specified, the default value is 5. This function is handy for a rapid glimpse into your dataset.

df.head()

Output:


Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species
0 1 5.1 3.5 1.4 0.2 Iris-setosa
1 2 4.9 3.0 1.4 0.2 Iris-setosa
2 3 4.7 3.2 1.3 0.2 Iris-setosa
3 4 4.6 3.1 1.5 0.2 Iris-setosa
4 5 5.0 3.6 1.4 0.2 Iris-setosa

2. Inspect Last n Rows with `df.tail()`

Similar to df.head(), you can use df.tail(n) to check the last n rows of a DataFrame.

df.tail(5)

Output:


Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species
145 146 6.7 3.0 5.2 2.3 Iris-virginica
146 147 6.3 2.5 5.0 1.9 Iris-virginica
147 148 6.5 3.0 5.2 2.0 Iris-virginica
148 149 6.2 3.4 5.4 2.3 Iris-virginica
149 150 5.9 3.0 5.1 1.8 Iris-virginica

3. Get Dimensionality of a DataFrame

If we want to have a quick look at how many rows and columns a DataFrame has, we can use the shape attribute of a DataFrame. It returns a tuple in the format (rows, columns) that provides the total row and column count for the DataFrame. As shown below, the Iris data has 150 rows of records, and 6 columns.

df.shape

Output:
(150, 6)

4. DataFrame Summary with `info()`

The info() method is a handy tool for getting a concise summary of a DataFrame. This summary provides information about the number of rows in our dataset, the count of missing values in each column, as well as the data type of each column.

You can easily identify missing data by comparing the Non-Null Count to the total record count mentioned in the RangeIndex section at the top of the summary.

df.info()

Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype
---  ------         --------------  -----
 0   Id             150 non-null    int64
 1   SepalLengthCm  150 non-null    float64
 2   SepalWidthCm   150 non-null    float64
 3   PetalLengthCm  150 non-null    float64
 4   PetalWidthCm   150 non-null    float64
 5   Species        150 non-null    object
dtypes: float64(4), int64(1), object(1)
memory usage: 7.2+ KB

5. Numerical Feature Statistics with `describe()`

The describe() function is for analyzing numerical data withinin a DataFrame, excluding categorical data. It provides essential statistics such as the mean, median, mode, minimum, and maximum values for each column. In our example, it computes these statistics for the first 5 columns in the DataFrame, excluding the Species column due to its non-numerical nature.

This summary helps us quickly understand value variations and identify data skew in the columns. While it provides statistics for all numerical columns, it may not provide meaningful insights for specific columns, such as the Id column.

df.describe()

Output:


Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm
count 150.000000 150.000000 150.000000 150.000000 150.000000
mean 75.500000 5.843333 3.054000 3.758667 1.198667
std 43.445368 0.828066 0.433594 1.764420 0.763161
min 1.000000 4.300000 2.000000 1.000000 0.100000
25% 38.250000 5.100000 2.800000 1.600000 0.300000
50% 75.500000 5.800000 3.000000 4.350000 1.300000
75% 112.750000 6.400000 3.300000 5.100000 1.800000
max 150.000000 7.900000 4.400000 6.900000 2.500000

Great! Now that we've learned how to get a quick glimpse of the data, let's move on to the method for selecting a specific subset of the dataset.

Amazing eBook to learn ggplot2 FAST & EASY

book cover for sliding your way to ggplot2 mastery

	Id	SepalLengthCm	SepalWidthCm	PetalLengthCm	PetalWidthCm	Species
0	1	5.1	3.5	1.4	0.2	Iris-setosa
1	2	4.9	3.0	1.4	0.2	Iris-setosa
2	3	4.7	3.2	1.3	0.2	Iris-setosa
3	4	4.6	3.1	1.5	0.2	Iris-setosa
4	5	5.0	3.6	1.4	0.2	Iris-setosa

	Id	SepalLengthCm	SepalWidthCm	PetalLengthCm	PetalWidthCm	Species
145	146	6.7	3.0	5.2	2.3	Iris-virginica
146	147	6.3	2.5	5.0	1.9	Iris-virginica
147	148	6.5	3.0	5.2	2.0	Iris-virginica
148	149	6.2	3.4	5.4	2.3	Iris-virginica
149	150	5.9	3.0	5.1	1.8	Iris-virginica

	Id	SepalLengthCm	SepalWidthCm	PetalLengthCm	PetalWidthCm
count	150.000000	150.000000	150.000000	150.000000	150.000000
mean	75.500000	5.843333	3.054000	3.758667	1.198667
std	43.445368	0.828066	0.433594	1.764420	0.763161
min	1.000000	4.300000	2.000000	1.000000	0.100000
25%	38.250000	5.100000	2.800000	1.600000	0.300000
50%	75.500000	5.800000	3.000000	4.350000	1.300000
75%	112.750000	6.400000	3.300000	5.100000	1.800000
max	150.000000	7.900000	4.400000	6.900000	2.500000

Quick Overview of a DataFrame

1. DataFrame Overview with df.head()

2. Inspect Last n Rows with df.tail()

3. Get Dimensionality of a DataFrame

4. DataFrame Summary with info()

5. Numerical Feature Statistics with describe()

Amazing eBook to learn ggplot2 FAST & EASY

1. DataFrame Overview with `df.head()`

2. Inspect Last n Rows with `df.tail()`

4. DataFrame Summary with `info()`

5. Numerical Feature Statistics with `describe()`