Handling Null Values

Last updated on: 2025-05-30

Handling Null Values in Spark DataFrames

When working with large datasets, encountering null values is almost inevitable. It's rare to find a DataFrame entirely free of them. These null values, while often overlooked, can significantly impact analysis results and lead to misinterpretation or ambiguity. Therefore, managing them effectively is crucial in any data pipeline.

Apache Spark provides several built-in methods to deal with null values in DataFrames. In this article, we’ll explore the most common techniques for handling them.

Let’s use the following sample DataFrame to demonstrate the different approaches:

+----+-------+-----------+-----------+------------+
|Roll|   Name|Final Marks|Float Marks|Double Marks|
+----+-------+-----------+-----------+------------+
|   1|   Ajay|        300|       null|       92.75|
|   2|   null|        350|       63.2|        88.5|
|null|Chaitra|        320|       60.1|        75.8|
|   4|  Kamal|       null|       75.0|        null|
|   5| Sohaib|        450|       null|        90.6|
+----+-------+-----------+-----------+------------+

Dropping Rows with Null Values

  1. Drop Rows with All Fields Null

You can remove rows where all columns contain null values using na.drop("all")

val allNull = df.na.drop("all") // this omits the rows where all the values are null

allNull.show()

val fewNull = df.na.drop("any")// this omits the rows which has at least 1 null value

fewNull.show()

Output

+----+-------+-----------+-----------+------------+
|Roll|   Name|Final Marks|Float Marks|Double Marks|
+----+-------+-----------+-----------+------------+
|   1|   Ajay|        300|       null|       92.75|
|   2|   null|        350|       63.2|        88.5|
|null|Chaitra|        320|       60.1|        75.8|
|   4|  Kamal|       null|       75.0|        null|
|   5| Sohaib|        450|       null|        90.6|
+----+-------+-----------+-----------+------------+

Since no rows in our DataFrame have all fields as null, this will return the original DataFrame unchanged.

  1. Drop Rows with Any Nulls

To eliminate rows that have at least one null value, use na.drop("any")

+----+----+-----------+-----------+------------+
|Roll|Name|Final Marks|Float Marks|Double Marks|
+----+----+-----------+-----------+------------+
+----+----+-----------+-----------+------------+

Here, all rows contain at least one null value, so the resulting DataFrame is empty.

Dropping Rows Based on Nulls in Specific Columns

You can target specific columns when dropping nulls. For example:

val conditionNull = df.na.drop(
  Seq("Final Marks", "Float Marks")
) // omits the rows where "Final Marks" and "Float Marks" have null values

conditionNull.show() 

Output

+----+-------+-----------+-----------+------------+
|Roll|   Name|Final Marks|Float Marks|Double Marks|
+----+-------+-----------+-----------+------------+
|   2|   null|        350|       63.2|        88.5|
|null|Chaitra|        320|       60.1|        75.8|
+----+-------+-----------+-----------+------------+

This command removes rows where either "Final Marks" or "Float Marks" contains null values.

Replacing Null Values

  1. Replace All Nulls with a Single Value You can replace all null values of numeric type with a specific value using na.fill()
val fillNull = df.na.fill(0) // this will fill all the null values of numeric data type with 0

fillNull.show()

Output

+----+-------+-----------+-----------+------------+
|Roll|   Name|Final Marks|Float Marks|Double Marks|
+----+-------+-----------+-----------+------------+
|   1|   Ajay|        300|        0.0|       92.75|
|   2|   null|        350|       63.2|        88.5|
|   0|Chaitra|        320|       60.1|        75.8|
|   4|  Kamal|          0|       75.0|         0.0|
|   5| Sohaib|        450|        0.0|        90.6|
+----+-------+-----------+-----------+------------+

Note: Non-numeric nulls (like strings) are unaffected.

  1. Replace Nulls in a Specific Column To replace nulls in just one column, provide a list of column names
val nameNull = df.na.fill("Unknown", 
  Seq("Name")
) // executing this command will replace the null value in "Name" with "Unknown"

nameNull.show()

Output

+----+-------+-----------+-----------+------------+
|Roll|   Name|Final Marks|Float Marks|Double Marks|
+----+-------+-----------+-----------+------------+
|   1|   Ajay|        300|       null|       92.75|
|   2|Unknown|        350|       63.2|        88.5|
|null|Chaitra|        320|       60.1|        75.8|
|   4|  Kamal|       null|       75.0|        null|
|   5| Sohaib|        450|       null|        90.6|
+----+-------+-----------+-----------+------------+

Replacing Nulls with Custom Values Per Column

Replacing null values with different values for each column can be done using the Map() method inside na.fill(). This will create a Map of column names as key and the value to be replaced in that column.

val multiNull = df.na.fill(
  Map("Final Marks" -> 0, 
    "Float Marks" -> 55.7F, 
    "Name" -> "Unknown")
) // replace null values in each column with a different value.

multiNull.show()

Output

+----+-------+-----------+-----------+------------+
|Roll|   Name|Final Marks|Float Marks|Double Marks|
+----+-------+-----------+-----------+------------+
|   1|   Ajay|        300|       55.7|       92.75|
|   2|Unknown|        350|       63.2|        88.5|
|null|Chaitra|        320|       60.1|        75.8|
|   4|  Kamal|          0|       75.0|        null|
|   5| Sohaib|        450|       55.7|        90.6|
+----+-------+-----------+-----------+------------+

Summary

  • Null values are common in real-world data and must be addressed to maintain data accuracy.

  • Spark provides powerful options like na.drop() and na.fill() to either remove or substitute nulls.

  • You can choose to drop rows with nulls entirely, limit the removal to specific columns, or replace nulls with default values either globally or per column.

  • These operations are vital for preparing clean datasets suitable for analysis or modeling.

References