Replacing Null Values
Last updated on: 2025-05-30
Null values in a DataFrame can disrupt data analysis and hinder accurate numerical computations. To ensure clean and reliable datasets, it is crucial to handle nulls properly. In this article, we explore various methods to replace null values in Apache Spark DataFrames.
Let's consider a student DataFrame containing null values:
+---+--------+-----+---------+
| ID| Name|Marks| Address|
+---+--------+-----+---------+
| 1| Ajay| 25| NULL|
| 2|Bharghav| NULL|Hyderabad|
| 3| Chaitra| 30| NULL|
| 4| NULL| 28| Bhopal|
| 5| Shohaib| NULL| Chennai|
+---+--------+-----+---------+
Method-1: Replacing Null Values in a Single Column
You can replace null values in a specific column with a constant value using na.fill()
.
val replaceSingleColumn = df.na.fill(0, Seq("Marks"))
replaceSingleColumn.show()
Output
+---+--------+-----+---------+
| ID| Name|Marks| Address|
+---+--------+-----+---------+
| 1| Ajay| 25| NULL|
| 2|Bharghav| 0|Hyderabad|
| 3| Chaitra| 30| NULL|
| 4| NULL| 28| Bhopal|
| 5| Shohaib| 0| Chennai|
+---+--------+-----+---------+
Method-2: Replace Null Values in Multiple Columns
You can replace null values in a specific column with a constant value using na.fill()
.
val replaceMultipleColumns=df.na
.fill(
Map("Marks" -> 0, "Address" -> "Unknown")
)
replaceMultipleColumns.show()
Output
+---+--------+-----+---------+
| ID| Name|Marks| Address|
+---+--------+-----+---------+
| 1| Ajay| 25| Unknown|
| 2|Bharghav| 0|Hyderabad|
| 3| Chaitra| 30| Unknown|
| 4| NULL| 28| Bhopal|
| 5| Shohaib| 0| Chennai|
+---+--------+-----+---------+
Method-3: Replace using coalesce()
The coalesce()
function returns the first non-null value from a list of inputs. It’s handy when replacing nulls in a single column.
val replaceCoalesce = df.withColumn("Marks", coalesce(col("Marks"), lit(15)))
replaceCoalesce.show()
Output
+---+--------+-----+---------+
| ID| Name|Marks| Address|
+---+--------+-----+---------+
| 1| Ajay| 25| NULL|
| 2|Bharghav| 15|Hyderabad|
| 3| Chaitra| 30| NULL|
| 4| NULL| 28| Bhopal|
| 5| Shohaib| 15| Chennai|
+---+--------+-----+---------+
Method 4: Using when-otherwise
For conditional replacements, use when().otherwise()
to apply different logic for nulls.
val dfUpdated = df
.withColumn("Name", when(col("Name").isNull, "Unknown")
.otherwise(col("Name")))
.withColumn("Marks", when(col("Marks").isNull, 15)
.otherwise(col("Marks")))
.withColumn("Address", when(col("Address").isNull, "Not Provided")
.otherwise(col("Address")))
dfUpdated.show()
Output
+---+--------+-----+------------+
| ID| Name|Marks| Address|
+---+--------+-----+------------+
| 1| Ajay| 25|Not Provided|
| 2|Bharghav| 15| Hyderabad|
| 3| Chaitra| 30|Not Provided|
| 4| Unknown| 28| Bhopal|
| 5| Shohaib| 15| Chennai|
+---+--------+-----+------------+
Summary
In this article, we explored different methods to replace null values in Spark DataFrames.
-
We covered replacing nulls in a single column, multiple columns using a map, using coalesce() for fallback values, and applying conditional logic with
when().otherwise()
. -
Each method is useful based on the complexity of the replacement logic, with
na.fill()
being straightforward andwhen().otherwise()
offering greater flexibility.