File Parsing
Last updated on: 2025-05-30
JSON files are a common format for storing and exchanging structured data. However, when working with large or manually written JSON datasets, data might not always follow strict JSON standards. Spark provides several configuration options that allow us to read such malformed or non-standard JSON files seamlessly.
In this article, we’ll explore some key configuration options available in Spark for reading JSON files that have issues like single quotes, unquoted field names, or comments.
1. Reading JSON files with field names in single quotes(' ')
In standard JSON, field names must be enclosed in double quotes ("). However, sometimes we encounter JSON files where field names use single quotes ('). Spark throws an error unless we explicitly allow single quotes.
Example: singleQuote.json
[
{
"Roll": 1,
'Name': "Ajay",
"Marks": 55
},
{
'Roll': 2,
"Name": "Bharghav",
"Marks": 63
},
...
Without allowing single quotes:
Let us see what happens when allowSingleQuotes
is set to false.
val stdMarks=spark.read.option("multiline","true")
.option("inferSchema","true")
.option("allowSingleQuotes","false")
.json("jsonFiles/singleQuote.json")
stdMarks.show()
Output
the queries from raw JSON/CSV files are disallowed when the referenced columns only include the internal corrupt record column
As discussed, spark throws an error. Now lets see what happens if allowSingleQuotes
is set to true.
With allowSingleQuotes = true:
val stdMarks=spark.read.option("multiline","true")
.option("inferSchema","true")
.option("allowSingleQuotes","true")
.json("jsonFiles/singleQuote.json")
stdMarks.show()
Output
+-----+--------+----+
|Marks| Name|Roll|
+-----+--------+----+
| 55| Ajay| 1|
| 63|Bharghav| 2|
| 60| Chaitra| 3|
| 75| Kamal| 4|
| 70| Sohaib| 5|
+-----+--------+----+
Now the JSON file is being parsed correctly. It is better to set allowSingleQuotes
to true when we are dealing with large files.
2. Reading JSON files with unquoted field names
This is a common issue because there are instances when the file authors forget to put field names inside quotes unintentionally.
So to properly parse the files, we will use allowUnquotedFieldNames
. This will ignore the error and parse the JSON file.
Example: unQuoted.json
[
{
"Roll": 1,
Name: "Ajay",
"Marks": 55
},
{
Roll: 2,
"Name": "Bharghav",
"Marks": 63
},
...
Without allowing unquoted field names:
val unquote=spark.read.option("multiline","true")
.option("inferSchema","true")
.json("jsonFiles/unQuoted.json")
unquote.show()
Executing the above spark command, will throw the error
the queries from raw JSON/CSV files are disallowed when the referenced columns only include the internal corrupt record column
With allowUnquotedFieldNames = true:
We will see the results after using allowUnquotedFieldNames
.
val unquote=spark.read.option("multiline","true")
.option("inferSchema","true")
.option("allowUnquotedFieldNames","true")
.json("jsonFiles/unQuoted.json")
unquote.show()
Output
+-----+--------+----+
|Marks| Name|Roll|
+-----+--------+----+
| 55| Ajay| 1|
| 63|Bharghav| 2|
| 60| Chaitra| 3|
| 75| Kamal| 4|
| 70| Sohaib| 5|
+-----+--------+----+
3. Reading JSON files containing comments
Standard JSON does not support comments. However, in practice, developers often add inline comments for clarification. Spark will fail when it encounters comments unless we enable the appropriate option.
Example: commentMarks.json
[
{
"Roll": 1,
"Name": "Ajay",
"Marks": 55 //These is the lowest score
},
{
"Roll": 2,
"Name": "Bharghav",
"Marks": 63
},
...
In few cases, the authors of JSON files tend to write a comment about the data, helping others to understand the data better. But when we try to read those JSON file, we get an error.
Without allowing comments:
Let's see what error we get in spark when we read a data with comments in it.
val commentMarks=spark.read.option("multiline","true")
.option("inferSchema","true")
.json("jsonFiles/commentMarks.json")
commentMarks.show()
Output
the queries from raw JSON/CSV files are disallowed when the referenced columns only include the internal corrupt record column
To handle this situation, we will use allowComments
option. This ignores the comments and reads the data.
With allowComments = true:
Let's see how the output will be if allowComments
is set to true.
val commentMarks=spark.read.option("multiline","true")
.option("inferSchema","true")
.option("allowComments","true")
.json("jsonFiles/commentMarks.json")
commentMarks.show()
Output
+-----+--------+----+
|Marks| Name|Roll|
+-----+--------+----+
| 55| Ajay| 1|
| 63|Bharghav| 2|
| 60| Chaitra| 3|
| 75| Kamal| 4|
| 70| Sohaib| 5|
+-----+--------+----+
When allowComments
is set to true, spark ignores the comments and read the JSON files. It is advised to have allowComments
set to true when we are dealing with large files.
Summary
In this article, we have looked into:
- Common issues faced while reading a JSON file.
- Reading JSON files when the field names are enclosed in single quotes(' ') and not enclosed.
- Reading JSON files when there are comments in the files.
These options are particularly useful when working with large or manually written JSON files that may not follow strict JSON formatting rules.