
date_format()
The date_format()
function converts a date, timestamp, or string column into a string column with a specified date format. date_format()
takes two arguments: the date column and the date format string.
Create Spark Session and sample DataFrame
from pyspark.sql import SparkSessionfrom pyspark.sql.functions import date_format, to_date
# Initialize Spark Sessionspark = SparkSession.builder.appName("dateFormatExample").getOrCreate()
# Sample DataFrame with Date Stringsdata = [("2021-01-01",), ("2021-06-24",)]columns = ["Date"]df = spark.createDataFrame(data, columns)#### Creating Spark Session and sample DataFramedf.show()
Output:
+----------+
| Date|
+----------+
|2021-01-01|
|2021-06-24|
+----------+
Example: Use date_format()
to covnert a date to format like January 1, 2021
date_format("Date", "MMMM d, yyyy")
: the first argument is the Date column of the DataFrame df.MMMM d, yyyy
specifies the format you want to use on the Date column
# convert a String type column to Date typedf = df.withColumn("Date", to_date("Date"))# use date_formatdf.withColumn("Formatted Date", date_format("Date", "MMMM d, yyyy")).show()
Output:
+----------+---------------+
| Date| Formatted Date|
+----------+---------------+
|2021-01-01|January 1, 2021|
|2021-06-24| June 24, 2021|
+----------+---------------+
Example: Use date_format()
to covnert a date to format like 01/01/2021
df.withColumn("Formatted Date", date_format("Date", "MM/dd/yyyy")).show()
Output:
+----------+--------------+
| Date|Formatted Date|
+----------+--------------+
|2021-01-01| 01/01/2021|
|2021-06-24| 06/24/2021|
+----------+--------------+
Example: Use date_format()
to covnert a datetime to timestamp
Create a new sample DataFrame with timestamp
data = [("2021-01-01 12:30:00",), ("2021-06-24 15:45:30",), ("2021-07-11 08:00:15",)]columns = ["Timestamp"]df = spark.createDataFrame(data, columns)df.show()
Output:
+-------------------+
| Timestamp|
+-------------------+
|2021-01-01 12:30:00|
|2021-06-24 15:45:30|
|2021-07-11 08:00:15|
+-------------------+
date_format("Timestamp", "HH:mm:ss.SSSSSS")
: the first argument is the Timestamp column of the DataFrame df.HH:mm:ss.SSSSSS
specifies the format which include only the timestamp part of the Timestamp column.
formatted_df = df.withColumn("Timestamp only", date_format("Timestamp", "HH:mm:ss.SSSSSS"))formatted_df.show(truncate=False)
Output:
+-------------------+---------------+
|Timestamp |Timestamp only |
+-------------------+---------------+
|2021-01-01 12:30:00|12:30:00.000000|
|2021-06-24 15:45:30|15:45:30.000000|
|2021-07-11 08:00:15|08:00:15.000000|
+-------------------+---------------+
# Stop the Spark Sessionspark.stop()