`split()`

The split() function is used to divide a string column into an array of strings using a specified delimiter. It's a useful function for breaking down and analyzing complex string data.

Usage

split() takes a string column and a delimiter as arguments.
It returns an array column with elements that are substrings split by the delimiter.

Create Spark Session and sample DataFrame

from pyspark.sql import SparkSessionfrom pyspark.sql.functions import split, col
# Initialize Spark Sessionspark = SparkSession.builder.appName("splitExample").getOrCreate()
# Sample DataFramedata = [("James,Smith,USA",), ("Anna,Rose,Canada",), ("Robert,Williams,UK",)]columns = ["PersonInfo"]df = spark.createDataFrame(data, columns)df.show()

Output:
+------------------+
|        PersonInfo|
+------------------+
|   James,Smith,USA|
|  Anna,Rose,Canada|
|Robert,Williams,UK|
+------------------+

Example: Use `split()` to divide columns

df.withColumn("SplittedInfo", ...): it creates a new column in the df DataFrame named SplittedInfo.
split(col("PersonInfo"), ","): split is used to divide the PersonInfo column by the delimiter ,. The resulting splitted texts are stored in the new column SplittedInfo.

split_df = df.withColumn("SplittedInfo", split(col("PersonInfo"), ","))split_df.show(truncate=False)

Output:
+------------------+----------------------+
|PersonInfo        |SplittedInfo          |
+------------------+----------------------+
|James,Smith,USA   |[James, Smith, USA]   |
|Anna,Rose,Canada  |[Anna, Rose, Canada]  |
|Robert,Williams,UK|[Robert, Williams, UK]|
+------------------+----------------------+

# Stop the Spark Sessionspark.stop()

< Previous

Next >

Amazing eBook to learn ggplot2 FAST & EASY

book cover for sliding your way to ggplot2 mastery

split()

Usage

Create Spark Session and sample DataFrame

Example: Use split() to divide columns

Amazing eBook to learn ggplot2 FAST & EASY

`split()`

Example: Use `split()` to divide columns