Quick Intro to DataFrame Reader

spark.read method reads data from a variety of sources and returns a PySpark DataFrame. It supports reading file types including CSV, JSON, Parquet, ORC, and more.

Functions for Reading Different File Types

File Type	Function
CSV file	`spark.read.csv(path, options)`
JSON file	`spark.read.json(path, options)`
Parquet file	`spark.read.parquet(path)`
ORC file	`spark.read.orc(path)`
Text file	`spark.read.text(path)`

Read CSV Files

Read from CSV with the DataFrameReader's csv() method and the following options:

option("sep", "\t"): sets "\t" as delimiter.
option("header", True): sets the first line as header.
option("inferSchema", True): lets PySpark infer schema automatically.
csv(path): path to the csv file.

df = spark.read.option("sep", "\t")               .option("header", True)               .option("inferSchema", True)               .csv("/path/to/your/csvfile.csv")

Instead of using option() to specify all the reading parameters before the csv() function, another way to write the reading logic is putting all the options as parameters in the csv() function:

df = spark.read        .csv("/path/to/your/csvfile.csv", sep="\t", header=True, inferSchema=True)

Read JSON Files

Read from JSON with the DataFrameReader's json() method and the infer schema option.

df = spark.read.option("inferSchema", True).json("/path/to/your/jsonfile.json")

Read Text Files

If we have a text file with data as following:

txt file example

We can also use .csv() function to read the text file and set sep="|":

df = spark.read.csv("example.txt", sep="|", header=True)df.show()

Output:
+---+-------+---+
| id|   name|age|
+---+-------+---+
|  1|  Alice| 30|
|  2|    Bob| 25|
|  3|Charlie| 35|
+---+-------+---+

< Previous

Next >

Amazing eBook to learn ggplot2 FAST & EASY

book cover for sliding your way to ggplot2 mastery