
explode()
The explode()
function is used to convert each element in an array or each key-value pair in a map into a separate row. This transformation is particularly useful for flattening complex nested data structures in DataFrames.
Usage
explode()
is applied to an array or map column.- In the case of an array, each element becomes a new row.
- For a map, each key-value pair is turned into a new row.
Create Spark Session
from pyspark.sql import SparkSessionfrom pyspark.sql.functions import explode
# Initialize Spark Sessionspark = SparkSession.builder.appName("explodeExample").getOrCreate()
Example: Use explode()
with Array columns
Create a sample DataFrame with an Array column
array_data = [(1, ["Java", "Python", "C++"]), (2, ["Spark", "Java", "C++"]), (3, ["Python", "Scala"])]array_columns = ["Id", "Languages"]array_df = spark.createDataFrame(array_data, array_columns)array_df.show()
Output:
+---+-------------------+
| Id| Languages|
+---+-------------------+
| 1|[Java, Python, C++]|
| 2| [Spark, Java, C++]|
| 3| [Python, Scala]|
+---+-------------------+
Use explode()
on the language column
exploded_df = array_df.select(array_df.Id, explode(array_df.Languages).alias("Language"))exploded_df.show()
Output:
+---+--------+
| Id|Language|
+---+--------+
| 1| Java|
| 1| Python|
| 1| C++|
| 2| Spark|
| 2| Java|
| 2| C++|
| 3| Python|
| 3| Scala|
+---+--------+
explode(array_df.Languages)
: this transforms each element in the Languages Array column into a separate row.- The Id column is retained for each exploded row, and the new Language column contains the individual elements from the arrays.
Example: Use explode()
with Map columns
Create a sample DataFrame with an Map column
map_data = [(1, {"Java": "JVM", "Python": "CPython"}), (2, {"C++": "GCC", "Java": "OpenJDK"}), (3, {"Python": "PyPy", "Scala": "JVM"})]map_columns = ["Id", "LanguageMap"]map_df = spark.createDataFrame(map_data, map_columns)map_df.show(truncate=False)
Output:
+---+--------------------------------+
|Id |LanguageMap |
+---+--------------------------------+
|1 |{Java -> JVM, Python -> CPython}|
|2 |{Java -> OpenJDK, C++ -> GCC} |
|3 |{Scala -> JVM, Python -> PyPy} |
+---+--------------------------------+
exploded_map_df = map_df.select(map_df.Id, explode(map_df.LanguageMap).alias("Language", "Platform"))exploded_map_df.show()
Output:
+---+--------+--------+
| Id|Language|Platform|
+---+--------+--------+
| 1| Java| JVM|
| 1| Python| CPython|
| 2| Java| OpenJDK|
| 2| C++| GCC|
| 3| Scala| JVM|
| 3| Python| PyPy|
+---+--------+--------+
explode(map_df.LanguageMap)
: this transforms each key-value pair in the LanguageMap column into separate rows.- The resulting DataFrame has three columns: Id, Language (the key), and Platform (the value).
# Stop the Spark Sessionspark.stop()