
stddev_samp()
The stddev_samp()
function calculates the sample standard deviation of a given numeric column in a DataFrame. This statistical measure is used to quantify the amount of variation or spread in a set of data values.
Usage
stddev_samp()
computes the standard deviation using the formula for a sample of the population, which is useful for inferential statistics.- The function is particularly helpful in data analysis for understanding the variability of data.
Create Spark Session and sample DataFrame
from pyspark.sql import SparkSessionfrom pyspark.sql.functions import stddev_samp
# Initialize Spark Sessionspark = SparkSession.builder.appName("stddevSampExample").getOrCreate()
# Sample DataFramedata = [("James", 23), ("Anna", 30), ("Robert", 34), ("Maria", 29)]columns = ["Name", "Age"]df = spark.createDataFrame(data, columns)df.show()
Output:
+------+---+
| Name|Age|
+------+---+
| James| 23|
| Anna| 30|
|Robert| 34|
| Maria| 29|
+------+---+
Example: Use stddev_samp()
to compute standard deviation of values in a column
stddev_samp("Age")
: it computes the standard deviation of values in the Age column.alias("Sample StdDev Age")
: it renames the resulting column as Sample StdDev Age.
stddev_age_df = df.select(stddev_samp("Age").alias("Sample StdDev Age"))stddev_age_df.show()
Output:
+-----------------+
|Sample StdDev Age|
+-----------------+
|4.546060565661952|
+-----------------+
# Stop the Spark Sessionspark.stop()