
join()
join()
operations are used to combine two DataFrames based on a common key. Different types of joins are available, each serving specific requirements.
Types of Joins
- Inner Join
- Full Outer Join
- Left Outer Join
- Right Outer Join
- Cartesian Join (Cross Join)
Create Spark Session and sample DataFrame
from pyspark.sql import SparkSession
# Initialize Spark Sessionspark = SparkSession.builder.appName("joinExamples").getOrCreate()
df1 = spark.createDataFrame([(1, "Alice"), (2, "Bob")], ["id", "name"])df1.show()
df2 = spark.createDataFrame([(1, "Apple"), (3, "Banana")], ["id", "favorite_fruit"])df2.show()
Output:
+---+-----+
| id| name|
+---+-----+
| 1|Alice|
| 2| Bob|
+---+-----+
+---+--------------+
| id|favorite_fruit|
+---+--------------+
| 1| Apple|
| 3| Banana|
+---+--------------+
Example: Apply Inner Join
- Combines rows from both DataFrames that match the join condition.
- Syntax:
df1.join(df2, on="key", how="inner")
inner_join_df = df1.join(df2, df1.id == df2.id, how="inner")inner_join_df.show()
Output:
+---+-----+---+--------------+
| id| name| id|favorite_fruit|
+---+-----+---+--------------+
| 1|Alice| 1| Apple|
+---+-----+---+--------------+
Example: Apply Full Outer Join
- Returns all rows from both DataFrames, with matching rows from both sides where available.
- If there is no match, the result is
null
on the side of the DataFrame without a match. - Syntax:
df1.join(df2, on="key", how="outer")
ordf1.join(df2, on="key", how="full")
full_outer_join_df = df1.join(df2, df1.id == df2.id, how="outer")full_outer_join_df.show()
Output:
+----+-----+----+--------------+
| id| name| id|favorite_fruit|
+----+-----+----+--------------+
| 1|Alice| 1| Apple|
| 2| Bob|NULL| NULL|
|NULL| NULL| 3| Banana|
+----+-----+----+--------------+
Example: Apply Left Outer Join
- Returns all rows from the left DataFrame, and the matched rows from the right DataFrame.
- The result is
null
on the right side if there is no match. - Syntax:
df1.join(df2, on="key", how="left_outer")
ordf1.join(df2, on="key", how="left")
left_outer_join_df = df1.join(df2, df1.id == df2.id, how="left_outer")left_outer_join_df.show()
Output:
+---+-----+----+--------------+
| id| name| id|favorite_fruit|
+---+-----+----+--------------+
| 1|Alice| 1| Apple|
| 2| Bob|NULL| NULL|
+---+-----+----+--------------+
Example: Apply Right Outer Join
- Similar to the left outer join, but returns all rows from the right DataFrame.
- Syntax:
df1.join(df2, on="key", how="right_outer")
ordf1.join(df2, on="key", how="right")
right_outer_join_df = df1.join(df2, df1.id == df2.id, how="right_outer")right_outer_join_df.show()
Output:
+----+-----+---+--------------+
| id| name| id|favorite_fruit|
+----+-----+---+--------------+
| 1|Alice| 1| Apple|
|NULL| NULL| 3| Banana|
+----+-----+---+--------------+
Example: Apply Cartesian Join
- Combines each row of the first DataFrame with every row of the second DataFrame.
- Often used for exhaustive combinations.
- Syntax:
df1.crossJoin(df2)
cartesian_join_df = df1.crossJoin(df2)cartesian_join_df.show()
Output:
+---+-----+---+--------------+
| id| name| id|favorite_fruit|
+---+-----+---+--------------+
| 1|Alice| 1| Apple|
| 1|Alice| 3| Banana|
| 2| Bob| 1| Apple|
| 2| Bob| 3| Banana|
+---+-----+---+--------------+
# Stop the Spark Sessionspark.stop()