JSON File Schema

Last updated on: 2025-05-30

When working with JSON files in Apache Spark, it's common to rely on Spark's automatic schema inference. However, there are scenarios where this might misinterpret certain data types. To ensure accuracy, especially for production-grade applications, it’s a good practice to define your own schema explicitly.

Let’s explore how to define and use custom schemas when reading JSON data in Spark.

Example JSON File

Suppose we have a JSON file studentMarks.json containing student records:

[
  {
    "Roll": 1,
    "Name": "Ajay",
    "Final Marks": 300,
    "Float Marks": 55.5,
    "Double Marks": 92.75
  },
  {
    "Roll": 2,
    "Name": "Bharghav",
    "Final Marks": 350,
    "Float Marks": 63.2,
    "Double Marks": 88.5
  },
...

Schema Inference: The Problem

If we read this JSON file with schema inference enabled:

val stdMarks = spark.read
  .option("multiLine", "true")
  .option("inferSchema", "true")
  .json("jsonFiles/studentMarks.json")

stdMarks.printSchema()

Output

  root
 |-- Double Marks: double (nullable = true)
 |-- Final Marks: long (nullable = true)
 |-- Float Marks: double (nullable = true)
 |-- Name: string (nullable = true)
 |-- Roll: long (nullable = true)

Here, Spark has inferred Roll as Long, although we may want it as a Short. Similarly, Float Marks has been interpreted as Double. These subtle inaccuracies can cause issues in downstream processes or analytics.

Defining a Custom Schema

To correct this, we can define our own schema using StructType and StructField:

val ownSchema = StructType(Seq(
  StructField("Roll", ShortType, true),
  StructField("Name", StringType, true),
  StructField("Final Marks", IntegerType, true),
  StructField("Float Marks", FloatType, true),
  StructField("Double Marks", DoubleType, true)
))

val schMarks = spark.read
  .option("multiLine", "true")
  .schema(ownSchema)
  .json("jsonFiles/studentMarks.json")

schMarks.printSchema()

Output

root
 |-- Roll: short (nullable = true)
 |-- Name: string (nullable = true)
 |-- Final Marks: integer (nullable = true)
 |-- Float Marks: float (nullable = true)
 |-- Double Marks: double (nullable = true)

Now we see that the data types match our expectations and all the columns have been corrected.

Defining a Custom Schema for Nested JSON

Let’s say we are working with a nested JSON file such as studentDetails.json, which contains a nested Contact field:


[
{
"Roll": 1,
"Name": "Ajay",
"Final Marks": 300,
"Contact": {
"Mail": "[email protected]",
"Mobile": "8973 113"
}
},
{
"Roll": 2,
"Name": "Bharghav",
"Final Marks": 350,
"Contact": {
"Mail": "[email protected]",
"Mobile": "9876 205"
}
}
]

We define a nested schema by embedding a StructType within another StructType:

val sch = StructType(Seq(
  StructField("Roll", ShortType, true),
  StructField("Name", StringType, true),
  StructField("Final Marks", IntegerType, true),
  StructField("Contact", StructType(
    Seq(
    StructField("Mail", StringType, true),
    StructField("Mobile", StringType, true)
    )), true)
))
val nestedSchema=spark
  .option("multiLine", "true")
  .schema(sch)
  .json("jsonFiles/studentDetails.json")

nestedSchema.printSchema()

Output

root
 |-- Roll: short (nullable = true)
 |-- Name: string (nullable = true)
 |-- Final Marks: integer (nullable = true)
 |-- Contact: struct (nullable = true)
 |    |-- Mail: string (nullable = true)
 |    |-- Mobile: string (nullable = true)

Summary

In this article, we learned:

Spark's schema inference may not always match our expectations.
We can explicitly define a schema using StructType and StructField to control data types.
Custom schemas are especially useful when working with nested JSON structures.

Explicit schema definition makes Spark applications more predictable, maintainable, and robust—especially in production settings dealing with complex or large datasets.

JSON File Schema

Example JSON File

Schema Inference: The Problem

Defining a Custom Schema

Defining a Custom Schema for Nested JSON

Summary

Related articles

References