
Quick Intro to Spark Session
Spark Session is a key component in Apache Spark, serving as the main entry point for interacting with Spark's functionality. To use PySpark in Python environment, the first thing to do is to create a Spark Session.
Create a Spark Session
Key Methods
builder
: Constructs a Spark Session.appName()
: Sets the name of the application.getOrCreate()
: Returns an existing Spark Session or creates a new one if none exists.
from pyspark.sql import SparkSession
# Initialize Spark Sessionspark = SparkSession.builder.appName("MySparkApp").getOrCreate()
Using Spark Session
After creation, the Spark Session (spark
) can be used to create and manipulate DataFrames, execute SQL queries, and interact with datasets from different data sources.
Stop a Spark Session
It's important to stop the Spark Session when your application is finished to free up resources. We can use spark.stop()
to terminate the session.
spark.stop()
Now that you know how to initiate a Spark Session, keep reading the tutorial to learn about PySpark's core data structure: the PySpark DataFrame.
Other Key Features of Spark Session
It's a Unified Entry Point
- Spark Session consolidates various functionalities of Spark, including Spark SQL, DataFrame, DataSet, and streaming APIs.
- It simplifies the process of interacting with different data formats and sources.
It's a Replacement for SQLContext and HiveContext
- Before Spark 2.0,
SQLContext
andHiveContext
were used. Spark Session subsumes these contexts, providing a more streamlined and unified approach.