Handling JSON files

Last updated on: 2025-05-30

JavaScript Object Notation (JSON) is a popular format for storing and exchanging data due to its lightweight and human-readable structure. Apache Spark provides powerful capabilities for reading and writing JSON data, making it an essential tool for processing structured and semi-structured datasets.

In this article, we’ll explore how to read different types of JSON files in Spark and perform basic operations.

Reading JSON files

Let’s begin with a basic JSON file singleLine.json where each line represents a single JSON object:

{"name": "Alice", "age": 30, "city": "New York"}
{"name": "Bob", "age": 25, "city": "Los Angeles"}
{"name": "Charlie", "age": 35, "city": "Chicago"}

To read this file, we implement .json() method - .json(file_path.json)

val singleLine = spark.read
  .json("jsonFiles/singleLine.json")

singleLine.show()
singleLine.printSchema()

Output

 +---+---------+------+
|Age|     City|  Name|
+---+---------+------+
| 30|New Delhi| Anish|
| 25|  Lucknow|Bhavya|
| 35|  Chennai|Charan|
+---+---------+------+

root
 |-- Age: long (nullable = true)
 |-- City: string (nullable = true)
 |-- Name: string (nullable = true)

Reading Multiline JSON Files

Now consider a file students.json that contains data in a multiline format

[
  {
    "Roll": 1,
    "Name": "Ajay",
    "Marks": 55
  },
  {
    "Roll": 2,
    "Name": "Bharghav",
    "Marks": 63
  },
  {
    "Roll": 3,
    "Name": "Chaitra",
    "Marks": 60
  },
  {
    "Roll": 4,
    "Name": "Kamal",
    "Marks": 75
  },
  {
    "Roll": 5,
    "Name": "Sohaib",
    "Marks": 70
  }
]

Let us try the previous version of reading the JSON file and see if we are able to read it or not.

val multiline = spark.read
  .json("jsonFiles/students.json")
    
multiline.show()
multiline.printSchema()

Output

Exception in thread "main" org.apache.spark.sql.AnalysisException: Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the
referenced columns only include the internal corrupt record column

  root
 |-- _corrupt_record: string (nullable = true)

Reading the multiline JSON file raises an error stating that the data is corrupted. Even the schema shows that there exists corrupt data. Now how to handle this?

To solve it, we enable the multiline option as .option("multiLine","true").

val jsonDf = spark.read
  .option("multiLine", "true")
  .json("jsonFiles/students.json")

jsonDf.show()
jsonDf.printSchema()

Output

+-----+--------+----+
|Marks|    Name|Roll|
+-----+--------+----+
|   55|    Ajay|   1|
|   63|Bharghav|   2|
|   60| Chaitra|   3|
|   75|   Kamal|   4|
|   70|  Sohaib|   5|
+-----+--------+----+

root
 |-- Marks: long (nullable = true)
 |-- Name: string (nullable = true)
 |-- Roll: long (nullable = true)

It is advised to use option("multiLine","true") as it makes easier to read complex and large JSON files.

You might ask yourself, "There are many cases where the data is stored in nested JSON format, how to deal with those files?"

Let's answer that question as well.

Reading Nested JSON files

Consider a more complex JSON structure with nested fields, like in studentDetails.json

[
  {
    "Roll": 1,
    "Name": "Ajay",
    "Marks": 55,
    "Contact": [
      {
        "Mail": "[email protected]",
        "Mobile": "8973 113"
      }
    ]
  },
  {
    "Roll": 2,
    "Name": "Bharghav",
    "Marks": 63,
    "Contact": [
      {
        "Mail": "[email protected]",
        "Mobile": "9876 205"
      }
    ]
  },
...

Similar to reading the multiline JSON file, here as well we use option("multiLine", "true") to read nested JSON files.

val stdDetails = spark.read
  .option("multiLine","true")
  .json("jsonFiles/studentDetails.json")

stdDetails.printSchema()
stdDetails.show(truncate = false)

Output

+-------------------------------+-----+--------+----+
|Contact                        |Marks|Name    |Roll|
+-------------------------------+-----+--------+----+
|[{[email protected], 8973 113}]    |55   |Ajay    |1   |
|[{[email protected], 9876 205}]|63   |Bharghav|2   |
|[{[email protected], 7789 656}] |60   |Chaitra |3   |
|[{[email protected], 8867 325}]   |75   |Kamal   |4   |
|[{[email protected], 9546 365}]  |70   |Sohaib  |5   |
+-------------------------------+-----+--------+----+

root
 |-- Contact: array (nullable  =  true)
 |    |-- element: struct (containsNull  =  true)
 |    |    |-- mail: string (nullable  =  true)
 |    |    |-- mobile: string (nullable  =  true)
 |-- Marks: long (nullable  =  true)
 |-- Name: string (nullable  =  true)
 |-- Roll: long (nullable  =  true)

Summary

In this article, we learned:

  • How to read simple JSON files using Spark.

  • How to handle multiline JSON files with the multiLine option.

  • How to read nested JSON files and inspect their complex schema.

By using Spark's flexible JSON reading capabilities, you can efficiently process diverse data formats in your ETL pipelines and big data workflows.

References