
Create Schema for PySpark DataFrame
A schema defines the structure of a DataFrame, specifying the names and data types of its columns. It is essential for ensuring data consistency and integrity throughout DataFrame operations, such as creation, reading, and writing.
This article introduces two ways of creating DataFrame's schema:
1. Programmatic Schema Creation (Using StructType and StructField)
- Flexibility: You can specify the data types and nullable settings for each column individually.
- Readability: Provides a clear definition of the schema in your code.
- Type Safety: PySpark will check the data types at runtime against the specified schema.
2. String-Based Schema Definition
- Conciseness: Easier to define a schema with a shorter syntax, especially for simple cases.
- Less Control: You cannot specify nullable settings or other advanced options.
- Potential Errors: No type checking at runtime; errors may occur if the data does not match the specified schema.
Create a Spark Session
To start, initalize a SparkSession and assign it to a variable spark
.
If you're not familiar with Spark session, click here to learn more.
from pyspark.sql import SparkSession
# Initialize Spark Sessionspark = SparkSession.builder.appName("app").getOrCreate()
Programmatic Schema Creation
Create a StructType
object with a list of StructField
objects. Each StructField
object defines the column name, column type, and whether it allows null values.
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, LongType, DoubleType, DateType, TimestampTypefrom datetime import datetime, date
# Sample data (list of tuples)data = [ (1, 2., 'string1', date(2000, 1, 1), datetime(2000, 1, 1, 12, 0)), (2, 3., 'string2', date(2000, 2, 1), datetime(2000, 1, 2, 12, 0)), (3, 4., 'string3', date(2000, 3, 1), datetime(2000, 1, 3, 12, 0))]
# Define a schema variableschema = StructType([ StructField("col1", LongType(), nullable = True), StructField("col2", DoubleType(), nullable = True), StructField("col3", StringType(), nullable = True), StructField("col4", DateType(), nullable = True), StructField("col5", TimestampType(), nullable = True)])
StructType([])
: creates a StructType object with a list of StructField objects.StructField(name, type, nullable)
accepts three arguments:name
: specify column names.type
: define column type. In the example, five different data types are used.nullable
: a boolean variable indicating whether the column allows null values.
Pass the defined schema variable to the createDataFrame()
function:
# Pass the schema variable as an argument to the schema parameterdf = spark.createDataFrame(data=data, schema=schema)df.show()
Output:
+----+----+-------+----------+-------------------+
|col1|col2| col3| col4| col5|
+----+----+-------+----------+-------------------+
| 1| 2.0|string1|2000-01-01|2000-01-01 12:00:00|
| 2| 3.0|string2|2000-02-01|2000-01-02 12:00:00|
| 3| 4.0|string3|2000-03-01|2000-01-03 12:00:00|
+----+----+-------+----------+-------------------+
String-Based Schema Definition
A string-based schema definition is usually used when you have a simple DataFrame structure.
# Sample data (list of tuples)data = [ (1, 2., 'string1', date(2000, 1, 1), datetime(2000, 1, 1, 12, 0)), (2, 3., 'string2', date(2000, 2, 1), datetime(2000, 1, 2, 12, 0)), (3, 4., 'string3', date(2000, 3, 1), datetime(2000, 1, 3, 12, 0))]
df = spark.createDataFrame(data=data, schema='col1 long, col2 double, col3 string, col4 date, col5 timestamp')df.show()
Output:
+----+----+-------+----------+-------------------+
|col1|col2| col3| col4| col5|
+----+----+-------+----------+-------------------+
| 1| 2.0|string1|2000-01-01|2000-01-01 12:00:00|
| 2| 3.0|string2|2000-02-01|2000-01-02 12:00:00|
| 3| 4.0|string3|2000-03-01|2000-01-03 12:00:00|
+----+----+-------+----------+-------------------+
Choosing between these methods depends on your specific use case
Use StructType and StructField:
- when you need precise control over the schema, including data types and nullable settings.
- For more complex schemas or when you need to ensure type safety.
Use String-Based Schema:
- for simple schemas where you don't need to specify nullable settings or have simple data types.
- When you prioritize concise code over detailed schema definition.