`union()` and `unionByName()`

union() and unionByName() are both used to concatenate two DataFrames. While union() function merges DataFrames based on column positions, unionByName() function concatenates two DataFrames based on their column names.

`union()` vs `unionByName()`

union() concatenates two DataFrames regardless of column orders. The first column in the first DataFrame is combined with the first column in the second DataFrame, and so on. Best used when the schemas of the DataFrames are exactly the same and in the same order.
unionByName() aligns columns by name, not by position, making it useful when the DataFrames to be merged have the same column names but in different orders. It ensures that columns with the same name are merged, irrespective of their position.

Create Spark Session and sample DataFrame

from pyspark.sql import SparkSessionfrom pyspark.sql import Row
# Initialize Spark Sessionspark = SparkSession.builder.appName("unionExample").getOrCreate()
# Create Sample DataFramesdf1 = spark.createDataFrame([Row(name="Alice", age=31), Row(name="Bob", age=22)])df1.show()df2 = spark.createDataFrame([Row(name="Charlie", age=47), Row(name="David", age=29)])df2.show()

Output:
+-----+---+
| name|age|
+-----+---+
|Alice| 31|
|  Bob| 22|
+-----+---+
+-------+---+
|   name|age|
+-------+---+
|Charlie| 47|
|  David| 29|
+-------+---+

Example: Use `union()` to union two DataFrames having same column orders

union_df = df1.union(df2)union_df.show()

Output:
+-------+---+
|   name|age|
+-------+---+
|  Alice| 31|
|    Bob| 22|
|Charlie| 47|
|  David| 29|
+-------+---+

Example: Use `union()` to union two DataFrame having different column orders

# Create Sample DataFramesdf3 = spark.createDataFrame([Row(age=31, name="Alice"), Row(age=22, name="Bob")])df3.show()df4 = spark.createDataFrame([Row(name="Charlie", age=47), Row(name="David", age=29)])df4.show()

Output:
+---+-----+
|age| name|
+---+-----+
| 31|Alice|
| 22|  Bob|
+---+-----+
+-------+---+
|   name|age|
+-------+---+
|Charlie| 47|
|  David| 29|
+-------+---+

df3.union(df4).show()

Output:
+-------+-----+
|    age| name|
+-------+-----+
|     31|Alice|
|     22|  Bob|
|Charlie|   47|
|  David|   29|
+-------+-----+

In this example, the union() operation demonstrates that it concatenates two DataFrames exactly as they are, regardless of the order of columns.

Example: Use `unionByName()` to union two DataFrame having different column orders

# Creating Sample DataFrames with Columns in Different Ordersdf5 = spark.createDataFrame([(1, "Alice"), (2, "Bob")], ["id", "name"])df5.show()df6 = spark.createDataFrame([("Charlie", 3), ("David", 4)], ["name", "id"])df6.show()
# Union by Nameunion_df = df5.unionByName(df6)union_df.show()

Output:
+---+-----+
| id| name|
+---+-----+
|  1|Alice|
|  2|  Bob|
+---+-----+
+-------+---+
|   name| id|
+-------+---+
|Charlie|  3|
|  David|  4|
+-------+---+
+---+-------+
| id|   name|
+---+-------+
|  1|  Alice|
|  2|    Bob|
|  3|Charlie|
|  4|  David|
+---+-------+

In this example, despite the name and id columns being arranged in different orders in the two DataFrames, the concatenation process aligns them seamlessly based on their corresponding column names.

# Stop the Spark Sessionspark.stop()

< Previous

Next >

Amazing eBook to learn ggplot2 FAST & EASY

book cover for sliding your way to ggplot2 mastery

union() and unionByName()

union() vs unionByName()