Writing Files in Spark

Last updated on: 2025-05-30

Until now, we have explored various techniques for reading CSV files in Spark, including:

  • Handling standard and encoded CSV files,
  • Defining custom schemas, and
  • Selecting only relevant fields to optimize performance.

In this article, we shift our focus to writing data to CSV files using Spark, explore the default behavior, and address common challenges as following:

Write a DataFrame to csv file

By default, when writing a DataFrame to a CSV file in Spark, the following occurs:

  • Spark creates a directory at the target path.

  • The data is partitioned and written across multiple CSV files.

Let’s consider the following DataFrame:

+----+--------+-----------+-----------+------------+
|Roll|    Name|Final Marks|Float Marks|Double Marks|
+----+--------+-----------+-----------+------------+
|   1|    Ajay|        300|       55.5|       92.75|
|   2|Bharghav|        350|       63.2|        88.5|
|   3| Chaitra|        320|       60.1|        75.8|
|   4|   Kamal|        360|       75.0|        82.3|
|   5|  Sohaib|        450|       70.8|        90.6|
+----+--------+-----------+-----------+------------+

To write this DataFrame into a CSV:

df.write
  .format("csv")
  .option("header", "true")
  .save("csvFiles/studentData")

This will create a directory csvFiles/studentData with multiple part files, each containing a portion of the data. This default behavior may not be desirable if you need a single consolidated CSV file.

Writing to a single csv file

Spark has a method coalesce() which will help us define the number of csv files we need.

df.coalesce(1)
  .write
  .format("csv")
  .option("header", "true")
  .save("csvFiles/combinedStudentData")

Output

Roll,Name,Final Marks,Float Marks,Double Marks
1,Ajay,300,55.5,92.75
2,Bharghav,350,63.2,88.5
3,Chaitra,320,60.1,75.8
4,Kamal,360,75.0,82.3
5,Sohaib,450,70.8,90.6

NOTE: coalesce(n) helps in creating the desired number of csv files. The maximum number of splits we can create out of a dataframe is the total number of records dataframe holds. The default value for coalesce is coalesce(1) which will write the data to 1 csv file.

Update existing csv file

Sometimes, we might want to update the output file,folder with new data. But spark doesn't allow that and throws an error when executed. How do we tackle it? - through the .mode("overwrite") option:

df.coalesce(1)
  .write
  .format("csv")
  .option("header", "true")
  .save("csvFiles/studentData")

When the above spark code is executed, spark will raise an error saying that the file path already exists.

Output

path file:csvFiles/studentData already exists.

This replaces the existing directory and its contents with the new output.

To tackle this situation, we will use mode("overwrite") configuration which will overwrite the existing folder.

df.coalesce(1)
  .write
  .format("csv")
  .option("header","true")
  .mode("overwrite")
  .save("csvFiles/studentData")

Output

Roll,Name,Final Marks,Float Marks,Double Marks
1,Ajay,300,55.5,92.75
2,Bharghav,350,63.2,88.5
3,Chaitra,320,60.1,75.8
4,Kamal,360,75.0,82.3
5,Sohaib,450,70.8,90.6

Write csv file with custom delimiter

We know that it is possible to read a csv file which has a delimiter other than the default comma(,). Similarly, we can even write a csv file with a custom delimiter. For example, to use a pipe (|):

df.coalesce(1)
  .write
  .format("csv")
  .option("header", "true")
  .option("delimiter", "|")
  .mode("overwrite")
  .save("csvFiles/customDelimiterData")

Output

Roll|Name|Final Marks|Float Marks|Double Marks
1|Ajay|300|55.5|92.75
2|Bharghav|350|63.2|88.5
3|Chaitra|320|60.1|75.8
4|Kamal|360|75.0|82.3
5|Sohaib|450|70.8|90.6

You can also use other delimiters like ".", ";", or a space.

Append new files to existing folder

To add new data without overwriting existing files in the existing folder, we use mode("append") configuration.

df.coalesce(1)
  .write
  .format("csv")
  .option("header", "true")
  .option("delimiter", ";")
  .mode("append")
  .save("csvFiles/customDelimiterData")

This creates additional files in the same directory without modifying existing ones.

The below csv file is created and stored in the customDelimiter along with the other files.

Output

Roll;Name;Final Marks;Float Marks;Double Marks
1;Ajay;300;55.5;92.75
2;Bharghav;350;63.2;88.5
3;Chaitra;320;60.1;75.8
4;Kamal;360;75.0;82.3
5;Sohaib;450;70.8;90.6

Summary

In this article, we have seen how:

  • We can write a spark dataframe into CSV files.
  • To write the spark dataframe into a single csv file.
  • We overwrite the files of an existing directory.
  • To write a csv file with a custom delimiter.
  • We append new csv files to the existing directory.
  • The default behavior of Spark When Writing CSV Files.

References