
var_pop()
The var_pop()
function in Apache Spark calculates the population variance of a given numeric column in a DataFrame. It's an important statistical function used to measure the spread or dispersion of a set of values relative to their mean.
Usage
var_pop()
computes the variance considering the entire population. It differs fromvar_samp()
, which calculates the sample variance.- This function is crucial in statistical analysis for understanding the variability of an entire dataset.
Create Spark Session and sample DataFrame
from pyspark.sql import SparkSessionfrom pyspark.sql.functions import var_pop
# Initialize Spark Sessionspark = SparkSession.builder.appName("varPopExample").getOrCreate()
# Sample DataFramedata = [("James", 23), ("Anna", 30), ("Robert", 34), ("Maria", 29)]columns = ["Name", "Age"]df = spark.createDataFrame(data, columns)df.show()
Output:
+------+---+
| Name|Age|
+------+---+
| James| 23|
| Anna| 30|
|Robert| 34|
| Maria| 29|
+------+---+
Example: Use var_pop()
to compute population variance of a numeric column
var_pop("Age")
: it computes the population variance of the Age column in the df DataFrame.alias("Population Variance")
: it renames the resulting column as Population Variance.
population_variance_df = df.select(var_pop("Age").alias("Population Variance"))population_variance_df.show()
Output:
+-------------------+
|Population Variance|
+-------------------+
| 15.5|
+-------------------+
# Stop the Spark Sessionspark.stop()