
Quick Intro to DataFrame Reader
spark.read
method reads data from a variety of sources and returns a PySpark DataFrame. It supports reading file types including CSV, JSON, Parquet, ORC, and more.
Functions for Reading Different File Types
File Type | Function |
---|---|
CSV file | spark.read.csv(path, options) |
JSON file | spark.read.json(path, options) |
Parquet file | spark.read.parquet(path) |
ORC file | spark.read.orc(path) |
Text file | spark.read.text(path) |
Read CSV Files
Read from CSV with the DataFrameReader's csv()
method and the following options:
option("sep", "\t")
: sets "\t" as delimiter.option("header", True)
: sets the first line as header.option("inferSchema", True)
: lets PySpark infer schema automatically.csv(path)
: path to the csv file.
df = spark.read.option("sep", "\t") .option("header", True) .option("inferSchema", True) .csv("/path/to/your/csvfile.csv")
Instead of using option()
to specify all the reading parameters before the csv()
function, another way to write the reading logic is putting all the options as parameters in the csv()
function:
df = spark.read .csv("/path/to/your/csvfile.csv", sep="\t", header=True, inferSchema=True)
Read JSON Files
Read from JSON with the DataFrameReader's json()
method and the infer schema option.
df = spark.read.option("inferSchema", True).json("/path/to/your/jsonfile.json")
Read Text Files
If we have a text file with data as following:
We can also use .csv()
function to read the text file and set sep="|"
:
df = spark.read.csv("example.txt", sep="|", header=True)df.show()
Output:
+---+-------+---+
| id| name|age|
+---+-------+---+
| 1| Alice| 30|
| 2| Bob| 25|
| 3|Charlie| 35|
+---+-------+---+