
Create a PySpark DataFrame Using createDataFrame()
The createDataFrame()
method creates a PySpark DataFrame. If you're not familiar with PySpark DataFrame, check out this article: What is PySpark DataFrame.
This article covers the following topics:
- Create a PySpark DataFrame from different data structures:
- Use the
schema
argument to define a DataFrame's schema
Create a Spark Session
To start, initalize a SparkSession and assign it to a variable spark
.
If you're not familiar with Spark session, click here to learn more.
from pyspark.sql import SparkSession
# Initialize Spark Sessionspark = SparkSession.builder.appName("createDataFrameExample").getOrCreate()
Create a PySpark DataFrame From a List of Lists
createDataFrame()
takes a list of lists as an input and turns it into a PySpark DataFrame.
# Sample data (a list of lists)data = [["Alice", 25], ["Bob", 30], ["Charlie", 22]]
# Create a PySpark DataFramedf = spark.createDataFrame(data)df.show()
Output:
+-------+---+
| _1| _2|
+-------+---+
| Alice| 25|
| Bob| 30|
|Charlie| 22|
+-------+---+
The DataFrame created above has generic column names, _1
and _2
because we didn't specify custom names. We can use the schema
argument to define specific column names.
Define Schema
In the example below, the variable schema
is a StructType
object containing two StructField
objects: "Name" (String) and "Age" (Integer).
To learn more about DataFrame schema and how to create one, check out this article: Create a PySpark DataFrame Schema
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
# Define a schema variableschema = StructType([ StructField("Name", StringType(), True), StructField("Age", IntegerType(), True)])
Pass the defined schema
variable as an argument to the schema
parameter:
# Pass the schema variable as an argument to createDataFrame()df = spark.createDataFrame(data=data, schema=schema)df.show()
Output:
+-------+---+
| Name|Age|
+-------+---+
| Alice| 25|
| Bob| 30|
|Charlie| 22|
+-------+---+
Display the df
's schema:
df.printSchema()
Output:
root
|-- Name: string (nullable = true)
|-- Age: integer (nullable = true)
Create a PySpark DataFrame From a List of Tuples
# Sample data (list of tuples)data = [("James", 34), ("Anna", 28), ("Robert", 45)]
# Define a schema variableschema = StructType([ StructField("Name", StringType(), True), StructField("Age", IntegerType(), True)])
# Pass the schema variable as an argument to the schema parameterdf = spark.createDataFrame(data=data, schema=schema)df.show()
Output:
+------+---+
| Name|Age|
+------+---+
| James| 34|
| Anna| 28|
|Robert| 45|
+------+---+
Create a PySpark DataFrame From a List of Dictionaries
# Sample data (list of tuples)data = [ {"Name": "James", "Age": 34}, {"Name": "Anna", "Age": 28}, {"Name": "Robert", "Age": 45}]
# Define schemaschema = StructType([ StructField("Name", StringType(), True), StructField("Age", IntegerType(), True)])
# Create a PySpark DataFramedf = spark.createDataFrame(data=data, schema=schema)df.show()
Output:
+------+---+
| Name|Age|
+------+---+
| James| 34|
| Anna| 28|
|Robert| 45|
+------+---+
Create a PySpark DataFrame From a Pandas DataFrame
import pandas as pd
# Create a pandas DataFramepandas_df = pd.DataFrame({ 'Name': ['James', 'Anna', 'Robert'], 'Age': [34, 28, 45]})
# Create a PySpark DataFrame from a Pandas DataFrame df = spark.createDataFrame(pandas_df)df.show()
Output:
+------+---+
| Name|Age|
+------+---+
| James| 34|
| Anna| 28|
|Robert| 45|
+------+---+
# Stop the Spark Sessionspark.stop()