var_pop()

The var_pop() function in Apache Spark calculates the population variance of a given numeric column in a DataFrame. It's an important statistical function used to measure the spread or dispersion of a set of values relative to their mean.

Usage

var_pop() computes the variance considering the entire population. It differs from var_samp(), which calculates the sample variance.
This function is crucial in statistical analysis for understanding the variability of an entire dataset.

Create Spark Session and sample DataFrame

from pyspark.sql import SparkSessionfrom pyspark.sql.functions import var_pop
# Initialize Spark Sessionspark = SparkSession.builder.appName("varPopExample").getOrCreate()
# Sample DataFramedata = [("James", 23), ("Anna", 30), ("Robert", 34), ("Maria", 29)]columns = ["Name", "Age"]df = spark.createDataFrame(data, columns)df.show()

Output:
+------+---+
|  Name|Age|
+------+---+
| James| 23|
|  Anna| 30|
|Robert| 34|
| Maria| 29|
+------+---+

Example: Use `var_pop()` to compute population variance of a numeric column

var_pop("Age"): it computes the population variance of the Age column in the df DataFrame.
alias("Population Variance"): it renames the resulting column as Population Variance.

population_variance_df = df.select(var_pop("Age").alias("Population Variance"))population_variance_df.show()

Output:
+-------------------+
|Population Variance|
+-------------------+
|               15.5|
+-------------------+

# Stop the Spark Sessionspark.stop()

< Previous

Next >

Amazing eBook to learn ggplot2 FAST & EASY

book cover for sliding your way to ggplot2 mastery

var_pop()

Usage

Create Spark Session and sample DataFrame

Example: Use var_pop() to compute population variance of a numeric column

Amazing eBook to learn ggplot2 FAST & EASY

Example: Use `var_pop()` to compute population variance of a numeric column