Writing DataFrame to JSON Files

Last updated on: 2025-05-30

In previous articles, we explored various ways to read JSON files in Spark and addressed potential issues that may arise. In this article, we’ll shift our focus to writing JSON files from Spark DataFrames, covering different scenarios including nested structures, null values, overwriting, and appending.

Sample DataFrame

Let's begin with a simple DataFrame:

+----+--------+-----------+
|Roll|    Name|Final Marks|
+----+--------+-----------+
|   1|    Ajay|        300|
|   2|Bharghav|        350|
|   3| Chaitra|        320|
|   4|   Kamal|        360|
|   5|  Sohaib|        450|
+----+--------+-----------+

1. Writing DataFrame to JSON File

To write a DataFrame to JSON, use the .write.json() method:

This creates a folder named studentData inside the jsonFiles directory, containing multiple JSON files (typically one per partition).

df.write.json("jsonFiles/studentData")

This creates studentData directory inside the root directory jsonFiles. By default, this function creates n-JSON files where 'n' is the number of records.

Writing all the dataframe records to a single JSON file

To consolidate all records into a single JSON file, use .coalesce(1) before writing:

df.coalesce(1)
  .write
  .json("jsonFiles/combinedStudentData")

Upon executing the above function, Spark creates a single JSON file having the all the records.

Output

{"Roll":1,"Name":"Ajay","Final Marks":300}
{"Roll":2,"Name":"Bharghav","Final Marks":350}
{"Roll":3,"Name":"Chaitra","Final Marks":320}
{"Roll":4,"Name":"Kamal","Final Marks":360}
{"Roll":5,"Name":"Sohaib","Final Marks":450}

3. Writing nested JSON data

Spark seamlessly handles writing nested structures. Consider this nested DataFrame:

Consider we have a dataframe

+----+--------+-----+-------------------------------+
|Roll|Name    |Marks|Contact                        |
+----+--------+-----+-------------------------------+
|1   |Ajay    |55   |[[[email protected], 8973 113]]    |
|2   |Bharghav|63   |[[[email protected], 9876 205]]|
|3   |Chaitra |60   |[[[email protected], 7789 656]] |
|4   |Kamal   |75   |[[[email protected], 8867 325]]   |
|5   |Sohaib  |70   |[[[email protected], 9546 365]]  |
+----+--------+-----+-------------------------------+

To write this into a JSON file, we will execute the below spark code

nestedDf.coalesce(1)
  .write
  .json("jsonFiles/studentDetails")

Output

{"Roll":1,"Name":"Ajay","Marks":55,"Contact":[{"Mail":"[email protected]","Mobile":"8973 113"}]}
{"Roll":2,"Name":"Bharghav","Marks":63,"Contact":[{"Mail":"[email protected]","Mobile":"9876 205"}]}
{"Roll":3,"Name":"Chaitra","Marks":60,"Contact":[{"Mail":"[email protected]","Mobile":"7789 656"}]}
{"Roll":4,"Name":"Kamal","Marks":75,"Contact":[{"Mail":"[email protected]","Mobile":"8867 325"}]}
{"Roll":5,"Name":"Sohaib","Marks":70,"Contact":[{"Mail":"[email protected]","Mobile":"9546 365"}]}

We can see that the JSON object has been created with nested JSON objects.

4. Overwriting existing JSON files

By default, Spark throws an error if we give a file path which already exists during file write. For example

nestedDf.coalesce(1)
  .write
  .json("jsonFiles/studentDetails")

Output

path file:jsonFiles/studentDetails already exists.

To overcome this, we will use the set the option overwrite , which will overwrite the existing data.

nestedDf.coalesce(1)
  .write
  .mode("overwrite")
  .json("jsonFiles/studentDetails")

5. Appending a new JSON file to existing path

We can set the option append which will allow the spark to create a new JSON file in the already existing path.

nestedDf.coalesce(1)
  .write
  .mode("append") 
  .json("jsonFiles/combinedStudentData")

Output

{"Roll":1,"Name":"Ajay","Marks":55,"Contact":[{"Mail":"[email protected]","Mobile":"8973 113"}]}
{"Roll":2,"Name":"Bharghav","Marks":63,"Contact":[{"Mail":"[email protected]","Mobile":"9876 205"}]}
{"Roll":3,"Name":"Chaitra","Marks":60,"Contact":[{"Mail":"[email protected]","Mobile":"7789 656"}]}
{"Roll":4,"Name":"Kamal","Marks":75,"Contact":[{"Mail":"[email protected]","Mobile":"8867 325"}]}
{"Roll":5,"Name":"Sohaib","Marks":70,"Contact":[{"Mail":"[email protected]","Mobile":"9546 365"}]}

6. Handling null values in JSON file

Consider the dataframe

+----+--------+-----------+
|Roll|    Name|Final Marks|
+----+--------+-----------+
|   1|    Ajay|       null|
|   2|Bharghav|        350|
|   3| Chaitra|        320|
|   4|    null|        360|
|   5|  Sohaib|        450|
+----+--------+-----------+

In the above dataframe, we have a couple of null values, here and there. Spark will ignore the null value fields, leading to data loss.

newDf.coalesce(1)
  .write
  .mode("overwrite")
  .json("jsonFiles/corruptStudentData")

Output

{"Roll":1,"Name":"Ajay"}
{"Roll":2,"Name":"Bharghav","Final Marks":350}
{"Roll":3,"Name":"Chaitra","Final Marks":320}
{"Roll":4,"Final Marks":360}
{"Roll":5,"Name":"Sohaib","Final Marks":450}

To overcome this, we will set the option ignoreNullFields to False which will parse the null value fields as well

newDf.coalesce(1)
  .write
  .mode("overwrite")
  .option("ignoreNullFields","false")
  .json("jsonFiles/corruptStudentData")

Output

{"Roll":1,"Name":"Ajay","Final Marks":null}
{"Roll":2,"Name":"Bharghav","Final Marks":350}
{"Roll":3,"Name":"Chaitra","Final Marks":320}
{"Roll":4,"Name":null,"Final Marks":360}
{"Roll":5,"Name":"Sohaib","Final Marks":450}

Summary

In this article, you learned:

  • Writing DataFrames to JSON files in Spark.

  • Writing all records to a single file.

  • Creating nested JSON structures.

  • Handling overwriting and appending behaviors.

  • Preserving null values in output.

Understanding these nuances will help ensure your Spark JSON writing operations are both efficient and data-complete.

References