
union()
and unionByName()
union()
and unionByName()
are both used to concatenate two DataFrames. While union()
function merges DataFrames based on column positions, unionByName()
function concatenates two DataFrames based on their column names.
union()
vs unionByName()
union()
concatenates two DataFrames regardless of column orders. The first column in the first DataFrame is combined with the first column in the second DataFrame, and so on. Best used when the schemas of the DataFrames are exactly the same and in the same order.unionByName()
aligns columns by name, not by position, making it useful when the DataFrames to be merged have the same column names but in different orders. It ensures that columns with the same name are merged, irrespective of their position.
Create Spark Session and sample DataFrame
from pyspark.sql import SparkSessionfrom pyspark.sql import Row
# Initialize Spark Sessionspark = SparkSession.builder.appName("unionExample").getOrCreate()
# Create Sample DataFramesdf1 = spark.createDataFrame([Row(name="Alice", age=31), Row(name="Bob", age=22)])df1.show()df2 = spark.createDataFrame([Row(name="Charlie", age=47), Row(name="David", age=29)])df2.show()
Output:
+-----+---+
| name|age|
+-----+---+
|Alice| 31|
| Bob| 22|
+-----+---+
+-------+---+
| name|age|
+-------+---+
|Charlie| 47|
| David| 29|
+-------+---+
Example: Use union()
to union two DataFrames having same column orders
union_df = df1.union(df2)union_df.show()
Output:
+-------+---+
| name|age|
+-------+---+
| Alice| 31|
| Bob| 22|
|Charlie| 47|
| David| 29|
+-------+---+
Example: Use union()
to union two DataFrame having different column orders
# Create Sample DataFramesdf3 = spark.createDataFrame([Row(age=31, name="Alice"), Row(age=22, name="Bob")])df3.show()df4 = spark.createDataFrame([Row(name="Charlie", age=47), Row(name="David", age=29)])df4.show()
Output:
+---+-----+
|age| name|
+---+-----+
| 31|Alice|
| 22| Bob|
+---+-----+
+-------+---+
| name|age|
+-------+---+
|Charlie| 47|
| David| 29|
+-------+---+
df3.union(df4).show()
Output:
+-------+-----+
| age| name|
+-------+-----+
| 31|Alice|
| 22| Bob|
|Charlie| 47|
| David| 29|
+-------+-----+
In this example, the union()
operation demonstrates that it concatenates two DataFrames exactly as they are, regardless of the order of columns.
Example: Use unionByName()
to union two DataFrame having different column orders
# Creating Sample DataFrames with Columns in Different Ordersdf5 = spark.createDataFrame([(1, "Alice"), (2, "Bob")], ["id", "name"])df5.show()df6 = spark.createDataFrame([("Charlie", 3), ("David", 4)], ["name", "id"])df6.show()
# Union by Nameunion_df = df5.unionByName(df6)union_df.show()
Output:
+---+-----+
| id| name|
+---+-----+
| 1|Alice|
| 2| Bob|
+---+-----+
+-------+---+
| name| id|
+-------+---+
|Charlie| 3|
| David| 4|
+-------+---+
+---+-------+
| id| name|
+---+-------+
| 1| Alice|
| 2| Bob|
| 3|Charlie|
| 4| David|
+---+-------+
In this example, despite the name and id columns being arranged in different orders in the two DataFrames, the concatenation process aligns them seamlessly based on their corresponding column names.
# Stop the Spark Sessionspark.stop()