Handling Corrupt Values
Last updated on: 2025-05-30
When working with large datasets, encountering corrupt or null values is common and often unavoidable. Apache Spark provides flexible options to handle these cases efficiently when reading CSV files.
Example CSV with Null and Corrupt Values
Consider a sample CSV file named corruptData.csv
Roll,Name,Final Marks,Float Marks,Double Marks
1,Ajay,300,55.5,
2,Bharghav,350,63.2,88.5
3,Chaitra,320,60.1,75.8
4,Kamal,360,75.0,82.3
5,Sohaib,gchbnv,70.8,90.6
Observations:
-
Ajay’s Double Marks value is missing (null).
-
Sohaib’s Final Marks is a non-numeric string, making it a corrupt entry.
Defining a Schema Explicitly
To correctly parse data types and manage corrupt/null values, define a schema:
val schema = StructType(Seq(
StructField("Roll", IntegerType, nullable = true),
StructField("Name", StringType, nullable = true),
StructField("Final Marks", IntegerType, nullable = true),
StructField("Float Marks", FloatType, nullable = true),
StructField("Double Marks", DoubleType, nullable = true)
))
Now read the file with this schema:
val df = spark.read
.schema(schema)
.option("header", "true")
.csv("csvFiles/corruptData.csv")
df.show()
When a csv file with corrupted values and null values is read in spark, both, null values and corrupted values are displayed as NULL
.
Output
+----+--------+-----------+-----------+------------+
|Roll| Name|Final Marks|Float Marks|Double Marks|
+----+--------+-----------+-----------+------------+
| 1| Ajay| 300| 55.5| NULL|
| 2|Bharghav| 350| 63.2| 88.5|
| 3| Chaitra| 320| 60.1| 75.8|
| 4| Kamal| 360| 75.0| 82.3|
| 5| Sohaib| NULL| 70.8| 90.6|
+----+--------+-----------+-----------+------------+
Return the list of null valued records - Capture Corrupt Records
We can use the method option("mode", "PERMISSIVE")
to identify and extract rows with corrupt data. We add an additional column to store bad records:.
This means that we have to redefine the schema of the table.
val schema1 = StructType(Seq(
StructField("Roll", IntegerType, nullable = true),
StructField("Name", StringType, nullable = true),
StructField("Final Marks", IntegerType, nullable = true),
StructField("Float Marks", FloatType, nullable = true),
StructField("Double Marks", DoubleType, nullable = true),
StructField("Bad records", StringType, nullable = true)
))
val permissiveDf = spark.read
.option("header", "true")
.schema(schema1)
.option("mode", "PERMISSIVE")
.option("columnNameOfCorruptRecord", "Bad records")
.csv("csvFiles/corruptData.csv")
permissiveDf.show()
Output
+----+--------+-----------+-----------+------------+-------------------------+
|Roll|Name |Final Marks|Float Marks|Double Marks|Bad records |
+----+--------+-----------+-----------+------------+-------------------------+
|1 |Ajay |300 |55.5 |NULL |NULL |
|2 |Bharghav|350 |63.2 |88.5 |NULL |
|3 |Chaitra |320 |60.1 |75.8 |NULL |
|4 |Kamal |360 |75.0 |82.3 |NULL |
|5 |Sohaib |NULL |70.8 |90.6 |5,Sohaib,gchbnv,70.8,90.6|
+----+--------+-----------+-----------+------------+-------------------------+
Dropping Corrupt Rows
To exclude rows with malformed data, we use the option("mode", "DROPMALFORMED")
, which omits the rows with corrupted values.
val corruptDf = spark.read
.option("header","true")
.schema(schema)
.option("mode", "DROPMALFORMED")
.csv("csvFiles/corruptData.csv")
corruptDf.show()
+----+--------+-----------+-----------+------------+
|Roll| Name|Final Marks|Float Marks|Double Marks|
+----+--------+-----------+-----------+------------+
| 1| Ajay| 300| 55.5| NULL|
| 2|Bharghav| 350| 63.2| 88.5|
| 3| Chaitra| 320| 60.1| 75.8|
| 4| Kamal| 360| 75.0| 82.3|
+----+--------+-----------+-----------+------------+
Note: Nulls (like Ajay’s missing Double Marks) are preserved. Only corrupted rows (like Sohaib’s) are dropped.
Replacing or Handling Null Values
After reading the file into a DataFrame, you can use Spark’s built-in DataFrame functions to handle nulls. For example:
df.na.fill(Map(
"Double Marks" -> 0.0,
"Final Marks" -> 0
))
To learn more about replacing and removing nulls, refer to Spark Null Values article.
Summary
In this guide, we explored how to:
-
Define a schema to correctly interpret null and corrupt data in CSV files.
-
Identify corrupt records using PERMISSIVE mode.
-
Drop malformed rows using DROPMALFORMED mode.
-
Replace null values programmatically using DataFrame operations.
These techniques are crucial when working with messy, real-world data in Spark.