
split()
The split()
function is used to divide a string column into an array of strings using a specified delimiter. It's a useful function for breaking down and analyzing complex string data.
Usage
split()
takes a string column and a delimiter as arguments.- It returns an array column with elements that are substrings split by the delimiter.
Create Spark Session and sample DataFrame
from pyspark.sql import SparkSessionfrom pyspark.sql.functions import split, col
# Initialize Spark Sessionspark = SparkSession.builder.appName("splitExample").getOrCreate()
# Sample DataFramedata = [("James,Smith,USA",), ("Anna,Rose,Canada",), ("Robert,Williams,UK",)]columns = ["PersonInfo"]df = spark.createDataFrame(data, columns)df.show()
Output:
+------------------+
| PersonInfo|
+------------------+
| James,Smith,USA|
| Anna,Rose,Canada|
|Robert,Williams,UK|
+------------------+
Example: Use split()
to divide columns
df.withColumn("SplittedInfo", ...)
: it creates a new column in the df DataFrame named SplittedInfo.split(col("PersonInfo"), ",")
:split
is used to divide the PersonInfo column by the delimiter,
. The resulting splitted texts are stored in the new column SplittedInfo.
split_df = df.withColumn("SplittedInfo", split(col("PersonInfo"), ","))split_df.show(truncate=False)
Output:
+------------------+----------------------+
|PersonInfo |SplittedInfo |
+------------------+----------------------+
|James,Smith,USA |[James, Smith, USA] |
|Anna,Rose,Canada |[Anna, Rose, Canada] |
|Robert,Williams,UK|[Robert, Williams, UK]|
+------------------+----------------------+
# Stop the Spark Sessionspark.stop()