Parsing Null & Corrupt Values

Last updated on: 2025-05-30

When working with large datasets, JSON is a common format for data exchange. However, real-world data is often messy — containing null values or even corrupted records. These inconsistencies can interrupt the Spark session or lead to incorrect parsing, ultimately affecting downstream analysis.

In this article, we explore Spark’s built-in configurations to efficiently parse JSON files with null or corrupted values without compromising data integrity.

1. Parsing null and corrupted files in JSON files

Let’s consider a file corruptedValues.json:

[
  {
    "Roll": 1,
    "Name": "Ajay",
    "Marks": "fifty five"
  },
  {
    "Roll": 2,
    "Name": "Bharghav",
    "Marks": 63
  },
...

Here, the Marks field contains inconsistent data types (a mix of string and integer). By default, Spark uses the PERMISSIVE mode when reading JSON files, which allows it to parse such records by converting all values to a compatible type (usually StringType in case of mixed types).

val df = spark.read
  .option("multiLine", "true")
  .option("mode", "PERMISSIVE")
  .json("jsonFiles/corruptedValues.json")

df.show()

Output

+----------+--------+----+
|     Marks|    Name|Roll|
+----------+--------+----+
|fifty five|    Ajay|   1|
|        63|Bharghav|   2|
|     sixty| Chaitra|   3|
|        75|   Kamal|   4|
|        70|  Sohaib|   5|
+----------+--------+----+

Surprisingly, the default value to read the file is also set to PERMISSIVE. The only drawback is that the data type is changes while parsing.

df.printSchema()

Output

root
 |-- Marks: string (nullable = true)
 |-- Name: string (nullable = true)
 |-- Roll: long (nullable = true)

While this mode ensures the file is read successfully, it may compromise data type consistency, which can affect numeric computations.

2. Ignoring columns with all null values

Now consider another file, allNullValues.json, where the city field is null for all records:

[
  {
    "Roll": 1,
    "Name": "Ajay",
    "Marks": 55,
    "city": null
  },
  {
    "Roll": 2,
    "Name": "Bharghav",
    "Marks": 63,
    "city": null
  },
...

By default, Spark reads all columns, including those that are entirely null:

val allNullValues = spark.read
  .option("multiLine", "true")
 .json("jsonFiles/allNullValues.json")

allNullValues.show()
allNullValues.printSchema()

If we execute the above spark code, null values are also printed, occupying more memory which is of no use.

Output

+-----+--------+----+----+
|Marks|    Name|Roll|city|
+-----+--------+----+----+
|   55|    Ajay|   1|null|
|   63|Bharghav|   2|null|
|   60| Chaitra|   3|null|
|   75|   Kamal|   4|null|
|   70|  Sohaib|   5|null|
+-----+--------+----+----+

root
 |-- Marks: long (nullable = true)
 |-- Name: string (nullable = true)
 |-- Roll: long (nullable = true)
 |-- city: string (nullable = true)

To clean such unnecessary columns, use the inbuilt dropFieldIfAllNull option:

 val ignoreNullValues = spark.read
  .option("multiLine", "true")
  .option("dropFieldIfAllNull", "true")
  .json("jsonFiles/allNullValues.json")

ignoreNullValues.show()
ignoreNullValues.printSchema()

When the configurations are set to true, you can see the column city is dropped.

Output

+-----+--------+----+
|Marks|    Name|Roll|
+-----+--------+----+
|   55|    Ajay|   1|
|   63|Bharghav|   2|
|   60| Chaitra|   3|
|   75|   Kamal|   4|
|   70|  Sohaib|   5|
+-----+--------+----+
root
 |-- Marks: long (nullable = true)
 |-- Name: string (nullable = true)
 |-- Roll: long (nullable = true)

Summary

This article demonstrated how Spark can help gracefully handle anomalies in JSON files:

  • PERMISSIVE mode allows parsing of corrupted records without failing the session.

  • dropFieldIfAllNull configuration eliminates completely null columns to optimize memory and processing.

By understanding and applying these options, developers can build more resilient and efficient Spark data pipelines that are tolerant to dirty or inconsistent data.

References