`join()`

join() operations are used to combine two DataFrames based on a common key. Different types of joins are available, each serving specific requirements.

Types of Joins

Inner Join
Full Outer Join
Left Outer Join
Right Outer Join
Cartesian Join (Cross Join)

Create Spark Session and sample DataFrame

from pyspark.sql import SparkSession
# Initialize Spark Sessionspark = SparkSession.builder.appName("joinExamples").getOrCreate()
df1 = spark.createDataFrame([(1, "Alice"), (2, "Bob")], ["id", "name"])df1.show()
df2 = spark.createDataFrame([(1, "Apple"), (3, "Banana")], ["id", "favorite_fruit"])df2.show()

Output:
+---+-----+
| id| name|
+---+-----+
|  1|Alice|
|  2|  Bob|
+---+-----+
+---+--------------+
| id|favorite_fruit|
+---+--------------+
|  1|         Apple|
|  3|        Banana|
+---+--------------+

Example: Apply Inner Join

Combines rows from both DataFrames that match the join condition.
Syntax: df1.join(df2, on="key", how="inner")

inner_join_df = df1.join(df2, df1.id == df2.id, how="inner")inner_join_df.show()

Output:
+---+-----+---+--------------+
| id| name| id|favorite_fruit|
+---+-----+---+--------------+
|  1|Alice|  1|         Apple|
+---+-----+---+--------------+

Example: Apply Full Outer Join

Returns all rows from both DataFrames, with matching rows from both sides where available.
If there is no match, the result is null on the side of the DataFrame without a match.
Syntax: df1.join(df2, on="key", how="outer") or df1.join(df2, on="key", how="full")

full_outer_join_df = df1.join(df2, df1.id == df2.id, how="outer")full_outer_join_df.show()

Output:
+----+-----+----+--------------+
|  id| name|  id|favorite_fruit|
+----+-----+----+--------------+
|   1|Alice|   1|         Apple|
|   2|  Bob|NULL|          NULL|
|NULL| NULL|   3|        Banana|
+----+-----+----+--------------+

Example: Apply Left Outer Join

Returns all rows from the left DataFrame, and the matched rows from the right DataFrame.
The result is null on the right side if there is no match.
Syntax: df1.join(df2, on="key", how="left_outer") or df1.join(df2, on="key", how="left")

left_outer_join_df = df1.join(df2, df1.id == df2.id, how="left_outer")left_outer_join_df.show()

Output:
+---+-----+----+--------------+
| id| name|  id|favorite_fruit|
+---+-----+----+--------------+
|  1|Alice|   1|         Apple|
|  2|  Bob|NULL|          NULL|
+---+-----+----+--------------+

Example: Apply Right Outer Join

Similar to the left outer join, but returns all rows from the right DataFrame.
Syntax: df1.join(df2, on="key", how="right_outer") or df1.join(df2, on="key", how="right")

right_outer_join_df = df1.join(df2, df1.id == df2.id, how="right_outer")right_outer_join_df.show()

Output:
+----+-----+---+--------------+
|  id| name| id|favorite_fruit|
+----+-----+---+--------------+
|   1|Alice|  1|         Apple|
|NULL| NULL|  3|        Banana|
+----+-----+---+--------------+

Example: Apply Cartesian Join

Combines each row of the first DataFrame with every row of the second DataFrame.
Often used for exhaustive combinations.
Syntax: df1.crossJoin(df2)

cartesian_join_df = df1.crossJoin(df2)cartesian_join_df.show()

Output:
+---+-----+---+--------------+
| id| name| id|favorite_fruit|
+---+-----+---+--------------+
|  1|Alice|  1|         Apple|
|  1|Alice|  3|        Banana|
|  2|  Bob|  1|         Apple|
|  2|  Bob|  3|        Banana|
+---+-----+---+--------------+

# Stop the Spark Sessionspark.stop()

< Previous

Next >

Amazing eBook to learn ggplot2 FAST & EASY

book cover for sliding your way to ggplot2 mastery