File Value Parsing
Last updated on: 2025-05-30
When working with JSON data in Apache Spark, it’s not just the field names that need attention, the values themselves can present challenges. Especially when dealing with large or messy datasets, you may encounter formatting issues, unusual numerical values, or inconsistent data types.
Thankfully, Spark provides several configuration options to help you control how JSON data is parsed.
In this article, let us look at all those configurable options.
1. Parsing records with leading Zeros('0')
Problem Here’s a sample JSON file that contains roll numbers with leading zeros:
[
{
"Roll": 001,
"Name": "Ajay",
"Marks": 55
},
{
"Roll": 002,
"Name": "Bharghav",
"Marks": 63
},
...
In standard JSON, numeric values cannot start with a leading zero. If you try to load this file using Spark:
val leadZero = spark.read
.option("multiline", "true")
.option("inferSchema", "true")
.json("jsonFiles/leadZero.json")
leadZero.show()
Output
the queries from raw JSON/CSV files are disallowed when the referenced columns only include the internal corrupt record column
Solution
To handle this, enable the allowNumericLeadingZeros
option:
val leadZero = spark.read
.option("multiline", "true")
.option("inferSchema", "true")
.option("allowNumericLeadingZeros", "true")
.json("jsonFiles/leadZero.json")
leadZero.show()
Output
+-----+--------+----+
|Marks| Name|Roll|
+-----+--------+----+
| 55| Ajay| 1|
| 63|Bharghav| 2|
| 60| Chaitra| 3|
| 75| Kamal| 4|
| 70| Sohaib| 5|
+-----+--------+----+
2. Handling escape characters in strings
Problem
In JSON, special characters like double quotes and tabs are escaped using a backslash ().
Consider the following JSON file backslash.json
[
{
"Roll": 1,
"Name": "Ajay",
"Marks": 55,
"Dialogue": "I am in a \"Safe\" room"
},
{
"Roll": 2,
"Name": "Bharghav",
"Marks": 63,
"Dialogue": "Please don't \tgo there"
},
...
Without proper configuration, Spark might fail to interpret these escape characters correctly.
Solution
Use the allowBackslashEscapingAnyCharacter
option:
val backslash = spark.read
.option("multiLine", "true")
.option("allowBackslashEscapingAnyCharacter", "true")
.json("jsonFiles/backslash.json")
backslash.show(truncate = false)
Output
+----------------------------+-----+--------+----+
|Dialogue |Marks|Name |Roll|
+----------------------------+-----+--------+----+
|I am in a "Safe" room |55 |Ajay |1 |
|Please don't go there |63 |Bharghav|2 |
|The world is so "beautiful!"|60 |Chaitra |3 |
|I like ice-cream |75 |Kamal |4 |
|Let's go on a vacation |70 |Sohaib |5 |
+----------------------------+-----+--------+----+
3. Handling non-numeric values
Problem
Sometimes JSON data includes non-numeric numbers like NaN, Infinity, or -Infinity. These aren’t valid numeric values by default.To handle this case, we will use allowNonNumericNumbers
and set it to true.
Consider the following json file nonNum.json
[
{
"Roll": 1,
"Name": "Ajay",
"Marks": 55
},
{
"Roll": 2,
"Name": "Bharghav",
"Marks": NaN
},
...
Trying to load this with the default settings, allowNonNumericNumbers
= false, will give an error.
val nonNum = spark.read
.option("multiLine", "true")
.option("inferSchema", "true")
.option("allowNonNumericNumbers", "false")
.json("jsonFiles/nonNum.json")
nonNum.show()
Output
the queries from raw JSON/CSV files are disallowed when the referenced columns only include the internal corrupt record column
Solution Enable the allowNonNumericNumbers option
val nonNum = spark.read
.option("multiLine", "true")
.option("inferSchema", "true")
.option("allowNonNumericNumbers", "true")
.json("jsonFiles/nonNum.json")
nonNum.show()
Output
+--------+--------+----+
| Marks| Name|Roll|
+--------+--------+----+
| 55.0| Ajay| 1|
| NaN|Bharghav| 2|
|Infinity| Chaitra| 3|
| NaN| Kamal| 4|
| 70.0| Sohaib| 5|
+--------+--------+----+
4. Dealing with float values
There are many scenarios, where we want to make note of precise values, e.g., in product design.
Problem
By default, Spark reads decimal numbers as Double type. This can lead to rounding issues when high precision is needed.
Consider the following JSON file floatMarks.json
[
{
"Roll": 1,
"Name": "Ajay",
"Marks": 55.875454
},
{
"Roll": 2,
"Name": "Bharghav",
"Marks": 63.25451
},
...
By default, Spark infers the schema as:
val floatMarks = spark.read
.option("multiLine", "true")
.option("inferSchema", "true")
.json("jsonFiles/floatMarks.json")
floatMarks.printSchema()
Output
root
|-- Marks: double (nullable = true)
|-- Name: string (nullable = true)
|-- Roll: long (nullable = true)
Solution
Spark has the configuration prefersDecimal
which allows the data to be read in its original form, rather than changing its data type.
When prefersDecimal
is set to true, the Marks values will be read as Decimal type.
val floatMarks = spark.read
.option("multiLine", "true")
.option("inferSchema", "true")
.option("prefersDecimal", "true")
.json("jsonFiles/floatMarks.json")
floatMarks.printSchema()
Output
root
|-- Marks: decimal(11,9) (nullable = true)
|-- Name: string (nullable = true)
|-- Roll: long (nullable = true)
Summary
In this article, you learned:
-
JSON files may contain non-standard formats like numbers with leading zeros, escape characters, or special values like NaN and Infinity.
-
Spark provides configurable options such as
allowNumericLeadingZeros
,allowBackslashEscapingAnyCharacter
,allowNonNumericNumbers
, andprefersDecimal
to handle such cases. -
These options help improve the accuracy and flexibility of data parsing, especially when dealing with large or inconsistent datasets.
Proper configuration of these options ensures that Spark reads JSON files more reliably, helping to prevent data corruption and improving downstream processing in real-world scenarios.