Create a PySpark DataFrame Using `createDataFrame()`

The createDataFrame() method creates a PySpark DataFrame. If you're not familiar with PySpark DataFrame, check out this article: What is PySpark DataFrame.

This article covers the following topics:

Create a PySpark DataFrame from different data structures:
Use the schema argument to define a DataFrame's schema

Create a Spark Session

To start, initalize a SparkSession and assign it to a variable spark.

If you're not familiar with Spark session, click here to learn more.

from pyspark.sql import SparkSession
# Initialize Spark Sessionspark = SparkSession.builder.appName("createDataFrameExample").getOrCreate()

Create a PySpark DataFrame From a List of Lists

createDataFrame() takes a list of lists as an input and turns it into a PySpark DataFrame.

# Sample data (a list of lists)data = [["Alice", 25],        ["Bob", 30],        ["Charlie", 22]]
# Create a PySpark DataFramedf = spark.createDataFrame(data)df.show()

Output:
+-------+---+
|     _1| _2|
+-------+---+
|  Alice| 25|
|    Bob| 30|
|Charlie| 22|
+-------+---+

The DataFrame created above has generic column names, _1 and _2 because we didn't specify custom names. We can use the schema argument to define specific column names.

Define Schema

In the example below, the variable schema is a StructType object containing two StructField objects: "Name" (String) and "Age" (Integer).

To learn more about DataFrame schema and how to create one, check out this article: Create a PySpark DataFrame Schema

from pyspark.sql.types import StructType, StructField, StringType, IntegerType
# Define a schema variableschema = StructType([    StructField("Name", StringType(), True),    StructField("Age", IntegerType(), True)])

Pass the defined schema variable as an argument to the schema parameter:

# Pass the schema variable as an argument to createDataFrame()df = spark.createDataFrame(data=data, schema=schema)df.show()

Output:
+-------+---+
|   Name|Age|
+-------+---+
|  Alice| 25|
|    Bob| 30|
|Charlie| 22|
+-------+---+

Display the df's schema:

df.printSchema()

Output:
root
 |-- Name: string (nullable = true)
 |-- Age: integer (nullable = true)

Create a PySpark DataFrame From a List of Tuples

# Sample data (list of tuples)data = [("James", 34),        ("Anna", 28),        ("Robert", 45)]
# Define a schema variableschema = StructType([    StructField("Name", StringType(), True),    StructField("Age", IntegerType(), True)])
# Pass the schema variable as an argument to the schema parameterdf = spark.createDataFrame(data=data, schema=schema)df.show()

Output:
+------+---+
|  Name|Age|
+------+---+
| James| 34|
|  Anna| 28|
|Robert| 45|
+------+---+

Create a PySpark DataFrame From a List of Dictionaries

# Sample data (list of tuples)data = [    {"Name": "James", "Age": 34},    {"Name": "Anna", "Age": 28},    {"Name": "Robert", "Age": 45}]
# Define schemaschema = StructType([    StructField("Name", StringType(), True),    StructField("Age", IntegerType(), True)])
# Create a PySpark DataFramedf = spark.createDataFrame(data=data, schema=schema)df.show()

Output:
+------+---+
|  Name|Age|
+------+---+
| James| 34|
|  Anna| 28|
|Robert| 45|
+------+---+

Create a PySpark DataFrame From a Pandas DataFrame

import pandas as pd
# Create a pandas DataFramepandas_df = pd.DataFrame({    'Name': ['James', 'Anna', 'Robert'],    'Age': [34, 28, 45]})
# Create a PySpark DataFrame from a Pandas DataFrame df = spark.createDataFrame(pandas_df)df.show()

Output:
+------+---+
|  Name|Age|
+------+---+
| James| 34|
|  Anna| 28|
|Robert| 45|
+------+---+

# Stop the Spark Sessionspark.stop()

< Previous

Next >

Amazing eBook to learn ggplot2 FAST & EASY

book cover for sliding your way to ggplot2 mastery

Create a PySpark DataFrame Using createDataFrame()