`corr()`

The corr() function is utilized to calculate the Pearson correlation coefficient between two columns in a DataFrame. This coefficient is a measure of the linear relationship between two variables.

Usage

corr() is applied to a DataFrame to compute the correlation between two numeric columns.
It returns a value between -1 and 1, where 1 implies a perfect positive correlation, -1 a perfect negative correlation, and 0 no correlation.

Create Spark Session and sample DataFrame

from pyspark.sql import SparkSessionfrom pyspark.sql.functions import corr
# Initialize Spark Sessionspark = SparkSession.builder.appName("corrExample").getOrCreate()
# Sample DataFramedata = [("James", 1, 10), ("Michael", 2, 20), ("Robert", 3, 30), ("Maria", 4, 40)]columns = ["Name", "Variable1", "Variable2"]df = spark.createDataFrame(data, columns)df.show()

Output:
+-------+---------+---------+
|   Name|Variable1|Variable2|
+-------+---------+---------+
|  James|        1|       10|
|Michael|        2|       20|
| Robert|        3|       30|
|  Maria|        4|       40|
+-------+---------+---------+

Example: Use `corr()` to find the correlation coefficient between two columns

corr("Variable1", "Variable2"): this calculates the Pearson correlation coefficient between Variable1 column and Variable2 column.
alias("correlation"): this gives the new column name of correlation

correlation_df = df.select(corr("Variable1", "Variable2").alias("correlation"))correlation_df.show()

Output:
+-----------+
|correlation|
+-----------+
|        1.0|
+-----------+

# Stop the Spark Sessionspark.stop()

< Previous

Next >

Amazing eBook to learn ggplot2 FAST & EASY

book cover for sliding your way to ggplot2 mastery

corr()

Usage

Create Spark Session and sample DataFrame

Example: Use corr() to find the correlation coefficient between two columns

Amazing eBook to learn ggplot2 FAST & EASY

`corr()`

Example: Use `corr()` to find the correlation coefficient between two columns