Merge DateFrames
In real life, the data we need often comes from different places like databases or files. Thus, it requires us to combine various data into a single table before preceeding with analysis. In this lesson, we'll explore how to combine two dataframes based on columns.
In pandas, you can use both the merge()
and join()
functions to merge DataFrames, but they have different behaviors. merge()
is a more versatile and powerful method that provides finer control when combining DataFrames, whereas join()
is simpler. In this tutorial, we'll focus on using the merge()
function because of its flexibility.
To demonstrate how to merge DataFrames using the merge()
function, we'll create two example dataframes.
Create dataframes
We'll create two dataframes: one named df_restaurants, containing restaurant IDs and their respective locations; and another dataframe named df_ratings, containing restaurant IDs and their corresponding ratings.
import pandas as pd
restaurants = {'id': [1, 2, 3, 4], 'city': ['NY', 'LA', 'SF', 'Boston'] }df_restaurants = pd.DataFrame(restaurants)df_restaurants
Output:
id | city | |
---|---|---|
0 | 1 | NY |
1 | 2 | LA |
2 | 3 | SF |
3 | 4 | Boston |
ratings = {'id': [2, 3, 4, 5], 'rating': [4.2, 4.5, 4.0, 4.1] }df_ratings = pd.DataFrame(ratings)df_ratings
Output:
id | rating | |
---|---|---|
0 | 2 | 4.2 |
1 | 3 | 4.5 |
2 | 4 | 4.0 |
3 | 5 | 4.1 |
Merge two dataframes
The next step is to use the merge()
function to merge df_restaurants and df_ratings based on their common restaurant IDs. This merging process will create a single table including both the restaurant's location and its corresponding rating information.
Here is the syntax for the merge()
function: df_left.merge(df_right, how='inner', on=None)
- df_left and df_right are the two DataFrames that you want to join. You can place them in any order, but it's common to put the dataframe with the main information on the left.
- how: the default value is
inner
, showing only the matching records. Other options include "outer", "left" and "right". Examples of each option is provided below. - on: this parameter specifies the columns to use for joining, and these columns must exist in both dataframes. If the joining columns have different names in the two DataFrames, you can use the left_on and right_on parameters to specify the respective column names.
- left_on: use this parameter to specify the column names to join on in the left DataFrame.
- right_on: use this parameter to specify the column names to join on in the right DataFrame.
Example: Inner join
An inner join displays records that exist in both DataFrames. In our example, only restaurants 2, 3 and 4 are found in both location and rating tables.
df_restaurants.merge(df_ratings, on='id', how='inner')
Output:
id | city | rating | |
---|---|---|---|
0 | 2 | LA | 4.2 |
1 | 3 | SF | 4.5 |
2 | 4 | Boston | 4.0 |
Example: outer join
An outer join combines data from two dataframes and displays all records, filling in missing values with NaN. In our example, restaurant 1 doesn't have a matching entry in the rating table, so its rating is displayed as NaN. Similarly, restaurant 5 doesn't appear in the location table, resulting in a NaN value for its location.
df_restaurants.merge(df_ratings, on='id', how='outer')
Output:
id | city | rating | |
---|---|---|---|
0 | 1 | NY | NaN |
1 | 2 | LA | 4.2 |
2 | 3 | SF | 4.5 |
3 | 4 | Boston | 4.0 |
4 | 5 | NaN | 4.1 |
Example: left join
A left join displays all the records from the left DataFrame and includes only the matching records from the right DataFrame. In this specific case, the merged DataFrame will contain information about restaurants 1 to 4.
df_restaurants.merge(df_ratings, on='id', how='left')
Output:
id | city | rating | |
---|---|---|---|
0 | 1 | NY | NaN |
1 | 2 | LA | 4.2 |
2 | 3 | SF | 4.5 |
3 | 4 | Boston | 4.0 |
Example: right join
A right join displays all the records from the right DataFrame while including only the matching records from the left DataFrame. Therefore, in the merged DataFrame, you will find restaurants 2 to 5.
df_restaurants.merge(df_ratings, on='id', how='right')
Output:
id | city | rating | |
---|---|---|---|
0 | 2 | LA | 4.2 |
1 | 3 | SF | 4.5 |
2 | 4 | Boston | 4.0 |
3 | 5 | NaN | 4.1 |
Excellent! In this tutorial, we have explored various methods for merging DataFrames. In our next tutorial, we will delve into an alternative approach to combine DataFrames that does not rely on having common columns. Please stay tuned for more insights!