
Save PySpark DataFrame to Files
This tutorial explains how to save PySpark DataFrames to various file types using the spark.write()
function.
We'll covers the following topics:
Functions for Writing to Different File Type
File Type | Function |
---|---|
CSV | spark.write.csv(path, options) |
JSON | spark.write.json(path, options) |
Parquet | spark.write.parquet(path) |
ORC | spark.write.orc(path) |
Text | spark.write.text(path) |
Different Modes of File Writing
The mode()
method specifies how data will be written to the target location. There are four commonly used modes when writing to file: overwrite
, append
, ignore
, and errorIfExists
.
Mode | Usage |
---|---|
overwrite | Overwrite existing data |
append | Append new data to existing data |
ignore | Does not write the new data if the target location already exists |
errorIfExists | Raises an error if the target location already exists |
Create Spark Session and Sample DataFrame
from pyspark.sql import SparkSession
# Initialize Spark Sessionspark = SparkSession.builder.appName("app").getOrCreate()
# Create a sample DataFramedata = [("James", "Sales", 3000), ("Michael", "Sales", 4600)]columns = ["Employee Name", "Department", "Salary"]df = spark.createDataFrame(data, columns)df.show()
Output:
+-------------+----------+------+
|Employee Name|Department|Salary|
+-------------+----------+------+
| James| Sales| 3000|
| Michael| Sales| 4600|
+-------------+----------+------+
Save DataFrame to CSV
df.write.csv(path="path/to/save/csv_file", header=True, mode="append")
Alternatively, we can put the options ouside the csv()
function using mode()
and option()
:
df.write.mode("append").option("header", "true").csv(path="path/to/save/csv_file")
Save DataFrame to Parquet
df.write.parquet(path="path/to/save/file",compression="snappy", mode="overwrite")
Similarly, you can also write like:
df.write.mode("overwrite").option("compression", "snappy").parquet(path="path/to/save/csv_file")