Replacing Null Values

Last updated on: 2025-05-30

Null values in a DataFrame can disrupt data analysis and hinder accurate numerical computations. To ensure clean and reliable datasets, it is crucial to handle nulls properly. In this article, we explore various methods to replace null values in Apache Spark DataFrames.

Let's consider a student DataFrame containing null values:

+---+--------+-----+---------+
| ID|    Name|Marks|  Address|
+---+--------+-----+---------+
|  1|    Ajay|   25|     NULL|
|  2|Bharghav| NULL|Hyderabad|
|  3| Chaitra|   30|     NULL|
|  4|    NULL|   28|   Bhopal|
|  5| Shohaib| NULL|  Chennai|
+---+--------+-----+---------+

Method-1: Replacing Null Values in a Single Column

You can replace null values in a specific column with a constant value using na.fill().

val replaceSingleColumn = df.na.fill(0, Seq("Marks"))

replaceSingleColumn.show()

Output

+---+--------+-----+---------+
| ID|    Name|Marks|  Address|
+---+--------+-----+---------+
|  1|    Ajay|   25|     NULL|
|  2|Bharghav|    0|Hyderabad|
|  3| Chaitra|   30|     NULL|
|  4|    NULL|   28|   Bhopal|
|  5| Shohaib|    0|  Chennai|
+---+--------+-----+---------+

Method-2: Replace Null Values in Multiple Columns

You can replace null values in a specific column with a constant value using na.fill().

val replaceMultipleColumns=df.na
  .fill(
    Map("Marks" -> 0, "Address" -> "Unknown")
  )

replaceMultipleColumns.show()

Output

+---+--------+-----+---------+
| ID|    Name|Marks|  Address|
+---+--------+-----+---------+
|  1|    Ajay|   25|  Unknown|
|  2|Bharghav|    0|Hyderabad|
|  3| Chaitra|   30|  Unknown|
|  4|    NULL|   28|   Bhopal|
|  5| Shohaib|    0|  Chennai|
+---+--------+-----+---------+

Method-3: Replace using coalesce()

The coalesce() function returns the first non-null value from a list of inputs. It’s handy when replacing nulls in a single column.

val replaceCoalesce = df.withColumn("Marks", coalesce(col("Marks"), lit(15)))

replaceCoalesce.show()

Output

+---+--------+-----+---------+
| ID|    Name|Marks|  Address|
+---+--------+-----+---------+
|  1|    Ajay|   25|     NULL|
|  2|Bharghav|   15|Hyderabad|
|  3| Chaitra|   30|     NULL|
|  4|    NULL|   28|   Bhopal|
|  5| Shohaib|   15|  Chennai|
+---+--------+-----+---------+

Method 4: Using when-otherwise

For conditional replacements, use when().otherwise() to apply different logic for nulls.

val dfUpdated = df
  .withColumn("Name", when(col("Name").isNull, "Unknown")
    .otherwise(col("Name")))
  .withColumn("Marks", when(col("Marks").isNull, 15)
    .otherwise(col("Marks")))
  .withColumn("Address", when(col("Address").isNull, "Not Provided")
    .otherwise(col("Address")))

dfUpdated.show()

Output

+---+--------+-----+------------+
| ID|    Name|Marks|     Address|
+---+--------+-----+------------+
|  1|    Ajay|   25|Not Provided|
|  2|Bharghav|   15|   Hyderabad|
|  3| Chaitra|   30|Not Provided|
|  4| Unknown|   28|      Bhopal|
|  5| Shohaib|   15|     Chennai|
+---+--------+-----+------------+

Summary

In this article, we explored different methods to replace null values in Spark DataFrames.

  • We covered replacing nulls in a single column, multiple columns using a map, using coalesce() for fallback values, and applying conditional logic with when().otherwise().

  • Each method is useful based on the complexity of the replacement logic, with na.fill() being straightforward and when().otherwise() offering greater flexibility.

References