File Value Parsing

Last updated on: 2025-05-30

When working with JSON data in Apache Spark, it’s not just the field names that need attention, the values themselves can present challenges. Especially when dealing with large or messy datasets, you may encounter formatting issues, unusual numerical values, or inconsistent data types.

Thankfully, Spark provides several configuration options to help you control how JSON data is parsed.

In this article, let us look at all those configurable options.

1. Parsing records with leading Zeros('0')

Problem Here’s a sample JSON file that contains roll numbers with leading zeros:

[
  {
    "Roll": 001,
    "Name": "Ajay",
    "Marks": 55
  },
  {
    "Roll": 002,
    "Name": "Bharghav",
    "Marks": 63
  },
...

In standard JSON, numeric values cannot start with a leading zero. If you try to load this file using Spark:

val leadZero = spark.read
  .option("multiline", "true")
  .option("inferSchema", "true")
  .json("jsonFiles/leadZero.json")
    
leadZero.show()

Output

the queries from raw JSON/CSV files are disallowed when the referenced columns only include the internal corrupt record column

Solution To handle this, enable the allowNumericLeadingZeros option:

val leadZero = spark.read
  .option("multiline", "true")
  .option("inferSchema", "true")
  .option("allowNumericLeadingZeros", "true")
  .json("jsonFiles/leadZero.json")

leadZero.show()

Output

+-----+--------+----+
|Marks|    Name|Roll|
+-----+--------+----+
|   55|    Ajay|   1|
|   63|Bharghav|   2|
|   60| Chaitra|   3|
|   75|   Kamal|   4|
|   70|  Sohaib|   5|
+-----+--------+----+

2. Handling escape characters in strings

Problem In JSON, special characters like double quotes and tabs are escaped using a backslash (). Consider the following JSON file backslash.json

[
  {
    "Roll": 1,
    "Name": "Ajay",
    "Marks": 55,
    "Dialogue": "I am in a \"Safe\" room"
  },
  {
    "Roll": 2,
    "Name": "Bharghav",
    "Marks": 63,
    "Dialogue": "Please don't \tgo there"
  },
...

Without proper configuration, Spark might fail to interpret these escape characters correctly.

Solution

Use the allowBackslashEscapingAnyCharacter option:

val backslash = spark.read
  .option("multiLine", "true")
  .option("allowBackslashEscapingAnyCharacter", "true")
  .json("jsonFiles/backslash.json")

backslash.show(truncate = false)

Output

+----------------------------+-----+--------+----+
|Dialogue                    |Marks|Name    |Roll|
+----------------------------+-----+--------+----+
|I am in a "Safe" room       |55   |Ajay    |1   |
|Please don't 	go there     |63   |Bharghav|2   |
|The world is so "beautiful!"|60   |Chaitra |3   |
|I like 	ice-cream    |75   |Kamal   |4   |
|Let's go on a vacation      |70   |Sohaib  |5   |
+----------------------------+-----+--------+----+

3. Handling non-numeric values

Problem Sometimes JSON data includes non-numeric numbers like NaN, Infinity, or -Infinity. These aren’t valid numeric values by default.To handle this case, we will use allowNonNumericNumbers and set it to true.

Consider the following json file nonNum.json

[
  {
    "Roll": 1,
    "Name": "Ajay",
    "Marks": 55
  },
  {
    "Roll": 2,
    "Name": "Bharghav",
    "Marks": NaN
  },
...

Trying to load this with the default settings, allowNonNumericNumbers = false, will give an error.

val nonNum = spark.read
  .option("multiLine", "true")
  .option("inferSchema", "true")
  .option("allowNonNumericNumbers", "false")
  .json("jsonFiles/nonNum.json")

nonNum.show()

Output

the queries from raw JSON/CSV files are disallowed when the referenced columns only include the internal corrupt record column

Solution Enable the allowNonNumericNumbers option

val nonNum = spark.read
  .option("multiLine", "true")
  .option("inferSchema", "true")
  .option("allowNonNumericNumbers", "true")
  .json("jsonFiles/nonNum.json")

nonNum.show()

Output

+--------+--------+----+
|   Marks|    Name|Roll|
+--------+--------+----+
|    55.0|    Ajay|   1|
|     NaN|Bharghav|   2|
|Infinity| Chaitra|   3|
|     NaN|   Kamal|   4|
|    70.0|  Sohaib|   5|
+--------+--------+----+

4. Dealing with float values

There are many scenarios, where we want to make note of precise values, e.g., in product design.

Problem

By default, Spark reads decimal numbers as Double type. This can lead to rounding issues when high precision is needed.

Consider the following JSON file floatMarks.json

[
  {
    "Roll": 1,
    "Name": "Ajay",
    "Marks": 55.875454
  },
  {
    "Roll": 2,
    "Name": "Bharghav",
    "Marks": 63.25451
  },
...

By default, Spark infers the schema as:

val floatMarks = spark.read
  .option("multiLine", "true")
  .option("inferSchema", "true")
  .json("jsonFiles/floatMarks.json")

floatMarks.printSchema()

Output

root
 |-- Marks: double (nullable = true)
 |-- Name: string (nullable = true)
 |-- Roll: long (nullable = true)

Solution

Spark has the configuration prefersDecimal which allows the data to be read in its original form, rather than changing its data type.

When prefersDecimal is set to true, the Marks values will be read as Decimal type.

val floatMarks = spark.read 
  .option("multiLine", "true")
  .option("inferSchema", "true")
  .option("prefersDecimal", "true")
  .json("jsonFiles/floatMarks.json")

floatMarks.printSchema()

Output

root
 |-- Marks: decimal(11,9) (nullable = true)
 |-- Name: string (nullable = true)
 |-- Roll: long (nullable = true)

Summary

In this article, you learned:

JSON files may contain non-standard formats like numbers with leading zeros, escape characters, or special values like NaN and Infinity.
Spark provides configurable options such as allowNumericLeadingZeros, allowBackslashEscapingAnyCharacter, allowNonNumericNumbers, and prefersDecimal to handle such cases.
These options help improve the accuracy and flexibility of data parsing, especially when dealing with large or inconsistent datasets.

Proper configuration of these options ensures that Spark reads JSON files more reliably, helping to prevent data corruption and improving downstream processing in real-world scenarios.

File Value Parsing

1. Parsing records with leading Zeros('0')

2. Handling escape characters in strings

3. Handling non-numeric values

4. Dealing with float values

Summary

Related Articles

References