Writing DataFrame to JSON Files
Last updated on: 2025-05-30
In previous articles, we explored various ways to read JSON files in Spark and addressed potential issues that may arise. In this article, we’ll shift our focus to writing JSON files from Spark DataFrames, covering different scenarios including nested structures, null values, overwriting, and appending.
Sample DataFrame
Let's begin with a simple DataFrame:
+----+--------+-----------+
|Roll| Name|Final Marks|
+----+--------+-----------+
| 1| Ajay| 300|
| 2|Bharghav| 350|
| 3| Chaitra| 320|
| 4| Kamal| 360|
| 5| Sohaib| 450|
+----+--------+-----------+
1. Writing DataFrame to JSON File
To write a DataFrame to JSON, use the .write.json()
method:
This creates a folder named studentData
inside the jsonFiles
directory, containing multiple JSON files (typically one per partition).
df.write.json("jsonFiles/studentData")
This creates studentData directory inside the root directory jsonFiles. By default, this function creates n-JSON files where 'n' is the number of records.
Writing all the dataframe records to a single JSON file
To consolidate all records into a single JSON file, use .coalesce(1) before writing:
df.coalesce(1)
.write
.json("jsonFiles/combinedStudentData")
Upon executing the above function, Spark creates a single JSON file having the all the records.
Output
{"Roll":1,"Name":"Ajay","Final Marks":300}
{"Roll":2,"Name":"Bharghav","Final Marks":350}
{"Roll":3,"Name":"Chaitra","Final Marks":320}
{"Roll":4,"Name":"Kamal","Final Marks":360}
{"Roll":5,"Name":"Sohaib","Final Marks":450}
3. Writing nested JSON data
Spark seamlessly handles writing nested structures. Consider this nested DataFrame:
Consider we have a dataframe
+----+--------+-----+-------------------------------+
|Roll|Name |Marks|Contact |
+----+--------+-----+-------------------------------+
|1 |Ajay |55 |[[[email protected], 8973 113]] |
|2 |Bharghav|63 |[[[email protected], 9876 205]]|
|3 |Chaitra |60 |[[[email protected], 7789 656]] |
|4 |Kamal |75 |[[[email protected], 8867 325]] |
|5 |Sohaib |70 |[[[email protected], 9546 365]] |
+----+--------+-----+-------------------------------+
To write this into a JSON file, we will execute the below spark code
nestedDf.coalesce(1)
.write
.json("jsonFiles/studentDetails")
Output
{"Roll":1,"Name":"Ajay","Marks":55,"Contact":[{"Mail":"[email protected]","Mobile":"8973 113"}]}
{"Roll":2,"Name":"Bharghav","Marks":63,"Contact":[{"Mail":"[email protected]","Mobile":"9876 205"}]}
{"Roll":3,"Name":"Chaitra","Marks":60,"Contact":[{"Mail":"[email protected]","Mobile":"7789 656"}]}
{"Roll":4,"Name":"Kamal","Marks":75,"Contact":[{"Mail":"[email protected]","Mobile":"8867 325"}]}
{"Roll":5,"Name":"Sohaib","Marks":70,"Contact":[{"Mail":"[email protected]","Mobile":"9546 365"}]}
We can see that the JSON object has been created with nested JSON objects.
4. Overwriting existing JSON files
By default, Spark throws an error if we give a file path which already exists during file write. For example
nestedDf.coalesce(1)
.write
.json("jsonFiles/studentDetails")
Output
path file:jsonFiles/studentDetails already exists.
To overcome this, we will use the set the option overwrite
, which will overwrite the existing data.
nestedDf.coalesce(1)
.write
.mode("overwrite")
.json("jsonFiles/studentDetails")
5. Appending a new JSON file to existing path
We can set the option append
which will allow the spark to create a new JSON file in the already existing path.
nestedDf.coalesce(1)
.write
.mode("append")
.json("jsonFiles/combinedStudentData")
Output
{"Roll":1,"Name":"Ajay","Marks":55,"Contact":[{"Mail":"[email protected]","Mobile":"8973 113"}]}
{"Roll":2,"Name":"Bharghav","Marks":63,"Contact":[{"Mail":"[email protected]","Mobile":"9876 205"}]}
{"Roll":3,"Name":"Chaitra","Marks":60,"Contact":[{"Mail":"[email protected]","Mobile":"7789 656"}]}
{"Roll":4,"Name":"Kamal","Marks":75,"Contact":[{"Mail":"[email protected]","Mobile":"8867 325"}]}
{"Roll":5,"Name":"Sohaib","Marks":70,"Contact":[{"Mail":"[email protected]","Mobile":"9546 365"}]}
6. Handling null values in JSON file
Consider the dataframe
+----+--------+-----------+
|Roll| Name|Final Marks|
+----+--------+-----------+
| 1| Ajay| null|
| 2|Bharghav| 350|
| 3| Chaitra| 320|
| 4| null| 360|
| 5| Sohaib| 450|
+----+--------+-----------+
In the above dataframe, we have a couple of null values, here and there. Spark will ignore the null value fields, leading to data loss.
newDf.coalesce(1)
.write
.mode("overwrite")
.json("jsonFiles/corruptStudentData")
Output
{"Roll":1,"Name":"Ajay"}
{"Roll":2,"Name":"Bharghav","Final Marks":350}
{"Roll":3,"Name":"Chaitra","Final Marks":320}
{"Roll":4,"Final Marks":360}
{"Roll":5,"Name":"Sohaib","Final Marks":450}
To overcome this, we will set the option ignoreNullFields
to False which will parse the null value fields as well
newDf.coalesce(1)
.write
.mode("overwrite")
.option("ignoreNullFields","false")
.json("jsonFiles/corruptStudentData")
Output
{"Roll":1,"Name":"Ajay","Final Marks":null}
{"Roll":2,"Name":"Bharghav","Final Marks":350}
{"Roll":3,"Name":"Chaitra","Final Marks":320}
{"Roll":4,"Name":null,"Final Marks":360}
{"Roll":5,"Name":"Sohaib","Final Marks":450}
Summary
In this article, you learned:
-
Writing DataFrames to JSON files in Spark.
-
Writing all records to a single file.
-
Creating nested JSON structures.
-
Handling overwriting and appending behaviors.
-
Preserving null values in output.
Understanding these nuances will help ensure your Spark JSON writing operations are both efficient and data-complete.