JSON File Schema
Last updated on: 2025-05-30
When working with JSON files in Apache Spark, it's common to rely on Spark's automatic schema inference. However, there are scenarios where this might misinterpret certain data types. To ensure accuracy, especially for production-grade applications, it’s a good practice to define your own schema explicitly.
Let’s explore how to define and use custom schemas when reading JSON data in Spark.
Example JSON File
Suppose we have a JSON file studentMarks.json
containing student records:
[
{
"Roll": 1,
"Name": "Ajay",
"Final Marks": 300,
"Float Marks": 55.5,
"Double Marks": 92.75
},
{
"Roll": 2,
"Name": "Bharghav",
"Final Marks": 350,
"Float Marks": 63.2,
"Double Marks": 88.5
},
...
Schema Inference: The Problem
If we read this JSON file with schema inference enabled:
val stdMarks = spark.read
.option("multiLine", "true")
.option("inferSchema", "true")
.json("jsonFiles/studentMarks.json")
stdMarks.printSchema()
Output
root
|-- Double Marks: double (nullable = true)
|-- Final Marks: long (nullable = true)
|-- Float Marks: double (nullable = true)
|-- Name: string (nullable = true)
|-- Roll: long (nullable = true)
Here, Spark has inferred Roll
as Long
, although we may want it as a Short
. Similarly, Float Marks
has been interpreted as Double
. These subtle inaccuracies can cause issues in downstream processes or analytics.
Defining a Custom Schema
To correct this, we can define our own schema using StructType
and StructField
:
val ownSchema = StructType(Seq(
StructField("Roll", ShortType, true),
StructField("Name", StringType, true),
StructField("Final Marks", IntegerType, true),
StructField("Float Marks", FloatType, true),
StructField("Double Marks", DoubleType, true)
))
val schMarks = spark.read
.option("multiLine", "true")
.schema(ownSchema)
.json("jsonFiles/studentMarks.json")
schMarks.printSchema()
Output
root
|-- Roll: short (nullable = true)
|-- Name: string (nullable = true)
|-- Final Marks: integer (nullable = true)
|-- Float Marks: float (nullable = true)
|-- Double Marks: double (nullable = true)
Now we see that the data types match our expectations and all the columns have been corrected.
Defining a Custom Schema for Nested JSON
Let’s say we are working with a nested JSON file such as studentDetails.json
, which contains a nested Contact
field:
[
{
"Roll": 1,
"Name": "Ajay",
"Final Marks": 300,
"Contact": {
"Mail": "[email protected]",
"Mobile": "8973 113"
}
},
{
"Roll": 2,
"Name": "Bharghav",
"Final Marks": 350,
"Contact": {
"Mail": "[email protected]",
"Mobile": "9876 205"
}
}
]
We define a nested schema by embedding a StructType
within another StructType
:
val sch = StructType(Seq(
StructField("Roll", ShortType, true),
StructField("Name", StringType, true),
StructField("Final Marks", IntegerType, true),
StructField("Contact", StructType(
Seq(
StructField("Mail", StringType, true),
StructField("Mobile", StringType, true)
)), true)
))
val nestedSchema=spark
.option("multiLine", "true")
.schema(sch)
.json("jsonFiles/studentDetails.json")
nestedSchema.printSchema()
Output
root
|-- Roll: short (nullable = true)
|-- Name: string (nullable = true)
|-- Final Marks: integer (nullable = true)
|-- Contact: struct (nullable = true)
| |-- Mail: string (nullable = true)
| |-- Mobile: string (nullable = true)
Summary
In this article, we learned:
-
Spark's schema inference may not always match our expectations.
-
We can explicitly define a schema using StructType and StructField to control data types.
-
Custom schemas are especially useful when working with nested JSON structures.
Explicit schema definition makes Spark applications more predictable, maintainable, and robust—especially in production settings dealing with complex or large datasets.