Quick Intro to Spark Session

Spark Session is a key component in Apache Spark, serving as the main entry point for interacting with Spark's functionality. To use PySpark in Python environment, the first thing to do is to create a Spark Session.

Create a Spark Session

Key Methods

builder: Constructs a Spark Session.
appName(): Sets the name of the application.
getOrCreate(): Returns an existing Spark Session or creates a new one if none exists.

from pyspark.sql import SparkSession
# Initialize Spark Sessionspark = SparkSession.builder.appName("MySparkApp").getOrCreate()

Using Spark Session

After creation, the Spark Session (spark) can be used to create and manipulate DataFrames, execute SQL queries, and interact with datasets from different data sources.

Stop a Spark Session

It's important to stop the Spark Session when your application is finished to free up resources. We can use spark.stop() to terminate the session.

spark.stop()

Now that you know how to initiate a Spark Session, keep reading the tutorial to learn about PySpark's core data structure: the PySpark DataFrame.

Other Key Features of Spark Session

It's a Unified Entry Point

Spark Session consolidates various functionalities of Spark, including Spark SQL, DataFrame, DataSet, and streaming APIs.
It simplifies the process of interacting with different data formats and sources.

It's a Replacement for SQLContext and HiveContext

Before Spark 2.0, SQLContext and HiveContext were used. Spark Session subsumes these contexts, providing a more streamlined and unified approach.

< Previous

Next >

Amazing eBook to learn ggplot2 FAST & EASY

book cover for sliding your way to ggplot2 mastery