
corr()
The corr()
function is utilized to calculate the Pearson correlation coefficient between two columns in a DataFrame. This coefficient is a measure of the linear relationship between two variables.
Usage
corr()
is applied to a DataFrame to compute the correlation between two numeric columns.- It returns a value between -1 and 1, where 1 implies a perfect positive correlation, -1 a perfect negative correlation, and 0 no correlation.
Create Spark Session and sample DataFrame
from pyspark.sql import SparkSessionfrom pyspark.sql.functions import corr
# Initialize Spark Sessionspark = SparkSession.builder.appName("corrExample").getOrCreate()
# Sample DataFramedata = [("James", 1, 10), ("Michael", 2, 20), ("Robert", 3, 30), ("Maria", 4, 40)]columns = ["Name", "Variable1", "Variable2"]df = spark.createDataFrame(data, columns)df.show()
Output:
+-------+---------+---------+
| Name|Variable1|Variable2|
+-------+---------+---------+
| James| 1| 10|
|Michael| 2| 20|
| Robert| 3| 30|
| Maria| 4| 40|
+-------+---------+---------+
Example: Use corr()
to find the correlation coefficient between two columns
corr("Variable1", "Variable2")
: this calculates the Pearson correlation coefficient between Variable1 column and Variable2 column.alias("correlation")
: this gives the new column name of correlation
correlation_df = df.select(corr("Variable1", "Variable2").alias("correlation"))correlation_df.show()
Output:
+-----------+
|correlation|
+-----------+
| 1.0|
+-----------+
# Stop the Spark Sessionspark.stop()