
isNull()
The isNull()
function is used to identify null values in a DataFrame.
Usage
isNull()
is applied to a column of a DataFrame to create a Boolean expression indicating whether each value in the column is null.- It's often used in conjunction with
filter
orwhere
for selecting rows with null values or for data cleaning tasks.
Create Spark Session and sample DataFrame
from pyspark.sql import SparkSessionfrom pyspark.sql.functions import col
# Initialize Spark Sessionspark = SparkSession.builder.appName("isNullExample").getOrCreate()
# Sample DataFrame with Null Valuesdata = [("James", None), ("Anna", 28), (None, 34), ("Robert", None)]columns = ["Name", "Age"]df = spark.createDataFrame(data, columns)df.show()
Output:
+------+----+
| Name| Age|
+------+----+
| James|NULL|
| Anna| 28|
| NULL| 34|
|Robert|NULL|
+------+----+
Example: Use isNull()
to filter Null values
df.select('*', col('Name').isNull().alias('missingName'), col('Age').isNull().alias('missingAge')).show()
Output:
+------+----+-----------+----------+
| Name| Age|missingName|missingAge|
+------+----+-----------+----------+
| James|NULL| false| true|
| Anna| 28| false| false|
| NULL| 34| true| false|
|Robert|NULL| false| true|
+------+----+-----------+----------+
select('*', ...)
: "*" is used to letselect
function to select all columns of the df DataFrame.col('Name').isNull()
: it uses col('Name') to refer the Name column and use .isNull() to identify Null values in that column.alias('missingName')
: it gives the returned Boolean column a name of "missingName".col('Age').isNull()
: it uses col('Age') to refer the Age column and use .isNull() to identify Null values in that column.alias('missingAge')
: it gives the returned Boolean column a name of "missingAge".
# Stop the Spark Sessionspark.stop()