File Parsing

Last updated on: 2025-05-30

JSON files are a common format for storing and exchanging structured data. However, when working with large or manually written JSON datasets, data might not always follow strict JSON standards. Spark provides several configuration options that allow us to read such malformed or non-standard JSON files seamlessly.

In this article, we’ll explore some key configuration options available in Spark for reading JSON files that have issues like single quotes, unquoted field names, or comments.

1. Reading JSON files with field names in single quotes(' ')

In standard JSON, field names must be enclosed in double quotes ("). However, sometimes we encounter JSON files where field names use single quotes ('). Spark throws an error unless we explicitly allow single quotes.

Example: singleQuote.json

[
  {
    "Roll": 1,
    'Name': "Ajay",
    "Marks": 55
  },
  {
    'Roll': 2,
    "Name": "Bharghav",
    "Marks": 63
  },
...

Without allowing single quotes:

Let us see what happens when allowSingleQuotes is set to false.

val stdMarks=spark.read.option("multiline","true")
      .option("inferSchema","true")
      .option("allowSingleQuotes","false")
      .json("jsonFiles/singleQuote.json")

stdMarks.show()

Output

the queries from raw JSON/CSV files are disallowed when the referenced columns only include the internal corrupt record column

As discussed, spark throws an error. Now lets see what happens if allowSingleQuotes is set to true.

With allowSingleQuotes = true:

val stdMarks=spark.read.option("multiline","true")
      .option("inferSchema","true")
      .option("allowSingleQuotes","true")
      .json("jsonFiles/singleQuote.json")

stdMarks.show()

Output

+-----+--------+----+
|Marks|    Name|Roll|
+-----+--------+----+
|   55|    Ajay|   1|
|   63|Bharghav|   2|
|   60| Chaitra|   3|
|   75|   Kamal|   4|
|   70|  Sohaib|   5|
+-----+--------+----+

Now the JSON file is being parsed correctly. It is better to set allowSingleQuotes to true when we are dealing with large files.

2. Reading JSON files with unquoted field names

This is a common issue because there are instances when the file authors forget to put field names inside quotes unintentionally. So to properly parse the files, we will use allowUnquotedFieldNames. This will ignore the error and parse the JSON file.

Example: unQuoted.json

[
  {
    "Roll": 1,
     Name: "Ajay",
    "Marks": 55
  },
  {
     Roll: 2,
    "Name": "Bharghav",
    "Marks": 63
  },
...

Without allowing unquoted field names:

val unquote=spark.read.option("multiline","true")
      .option("inferSchema","true")
      .json("jsonFiles/unQuoted.json")

unquote.show()

Executing the above spark command, will throw the error

the queries from raw JSON/CSV files are disallowed when the referenced columns only include the internal corrupt record column

With allowUnquotedFieldNames = true:

We will see the results after using allowUnquotedFieldNames.

val unquote=spark.read.option("multiline","true")
  .option("inferSchema","true")
  .option("allowUnquotedFieldNames","true")
  .json("jsonFiles/unQuoted.json")

unquote.show()

Output

+-----+--------+----+
|Marks|    Name|Roll|
+-----+--------+----+
|   55|    Ajay|   1|
|   63|Bharghav|   2|
|   60| Chaitra|   3|
|   75|   Kamal|   4|
|   70|  Sohaib|   5|
+-----+--------+----+

3. Reading JSON files containing comments

Standard JSON does not support comments. However, in practice, developers often add inline comments for clarification. Spark will fail when it encounters comments unless we enable the appropriate option.

Example: commentMarks.json

[
  {
    "Roll": 1,
    "Name": "Ajay",
    "Marks": 55   //These is the lowest score
  },
  {
    "Roll": 2,
    "Name": "Bharghav",
    "Marks": 63
  },
...

In few cases, the authors of JSON files tend to write a comment about the data, helping others to understand the data better. But when we try to read those JSON file, we get an error.

Without allowing comments:

Let's see what error we get in spark when we read a data with comments in it.

val commentMarks=spark.read.option("multiline","true")
      .option("inferSchema","true")
      .json("jsonFiles/commentMarks.json")

commentMarks.show()

Output

the queries from raw JSON/CSV files are disallowed when the referenced columns only include the internal corrupt record column

To handle this situation, we will use allowComments option. This ignores the comments and reads the data.

With allowComments = true:

Let's see how the output will be if allowComments is set to true.

val commentMarks=spark.read.option("multiline","true")
  .option("inferSchema","true")
  .option("allowComments","true")
  .json("jsonFiles/commentMarks.json")

commentMarks.show()

Output

+-----+--------+----+
|Marks|    Name|Roll|
+-----+--------+----+
|   55|    Ajay|   1|
|   63|Bharghav|   2|
|   60| Chaitra|   3|
|   75|   Kamal|   4|
|   70|  Sohaib|   5|
+-----+--------+----+

When allowComments is set to true, spark ignores the comments and read the JSON files. It is advised to have allowComments set to true when we are dealing with large files.

Summary

In this article, we have looked into:

  • Common issues faced while reading a JSON file.
  • Reading JSON files when the field names are enclosed in single quotes(' ') and not enclosed.
  • Reading JSON files when there are comments in the files.

These options are particularly useful when working with large or manually written JSON files that may not follow strict JSON formatting rules.

References