Comparing partitionBy(), repartition(), and coalesce() in Spark csv Writes

Last updated on: 2025-05-30

In previous articles, we explored three different methods used to manage partitions while writing CSV files in Apache Spark: partitionBy(), repartition(), and coalesce(). Each serves a distinct purpose and has its own pros and cons depending on the scenario.

In this article, we’ll compare these three methods, discuss their similarities and differences, and help you understand when to use each one.

When to use partitionBy()?

The partitionBy() method is used during the write operation to organize the output files based on the values of one or more specified columns. This results in a directory structure that groups files according to the column values, making data retrieval more efficient during analysis.

Use case:

Use partitionBy() when you want to store data in a well-organized structure based on column values to improve query performance.

df.coalesce(1)
  .write
  .partitionBy("Age")
  .option("header","true")
  .mode("overwrite")
  .csv("csvFiles/combinedStudentAgeData")

Output

Roll,Name,DOB,Submit Time,Final Marks
3,Chaitra,12/12/2010,2025/02/17 12:45:10,75.8
5,Sohaib,14/04/2009,2025/02/17 12:55:20,90.6
8,Ganesh,30/09/2010,2025/02/17 12:50:30,88.2
11,Jasmine,10/02/2011,2025/02/17 12:05:25,79.5
Roll,Name,DOB,Submit Time,Final Marks
1,Ajay,01/01/2010,2025/02/17 12:30:45,92.7
4,Kamal,25/08/2010,2025/02/17 12:40:05,82.3
6,Divya,18/07/2010,2025/02/17 12:20:15,85.4
9,Hema,05/11/2009,2025/02/17 12:15:45,91.0
12,Kiran,28/06/2009,2025/02/17 12:00:40,93.8
Roll,Name,DOB,Submit Time,Final Marks
2,Bharghav,04/06/2009,2025/02/17 12:35:30,88.5
7,Faisal,23/05/2009,2025/02/17 12:25:50,78.9
10,Ishaan,20/03/2008,2025/02/17 12:10:05,87.6

Pros:

  • Organized directory structure for faster filtering.

  • Improved query efficiency with tools like Hive, Presto, or Athena.

Cons:

  • Can lead to many small files if there are numerous unique values.

  • Requires more disk writes and can increase job execution time

When to use repartition()?

The repartition() method is used to increase or redistribute the number of partitions in memory before writing data. It performs a full shuffle of the data across the network, ensuring an even distribution of records.

Use case:

Use repartition() to evenly distribute large datasets for parallel processing or when writing to multiple output files.

df.repartition(5)
  .write
  .option("header", "true")
  .mode("overwrite")
  .csv("csvFiles/studentDataDistributed")

Pros:

  • Balances load for parallel processing.

  • Ensures evenly sized output files.

Cons:

  • Full shuffle operation is expensive.

  • Should be avoided for small datasets or when minimizing I/O is critical.

When to use coalesce()?

The coalesce() method is used to reduce the number of partitions. It avoids full shuffles by merging existing partitions, making it more efficient than repartition() when reducing partition count.

Use case:

Use coalesce() to combine data into fewer output files, particularly for small datasets or when a single file output is needed.

df.coalesce(1)
  .write
  .format("csv")
  .option("header", "true")
  .mode("overwrite")
  .save("csvFiles/studentData")

Sample Output (Single File)

Roll,Name,Final Marks,Float Marks,Double Marks
1,Ajay,300,55.5,92.75
2,Bharghav,350,63.2,88.5
3,Chaitra,320,60.1,75.8
4,Kamal,360,75.0,82.3
5,Sohaib,450,70.8,90.6

Pros:

  • Reduces number of output files.

  • More efficient for small datasets.

Cons:

  • Not ideal for large datasets—may lead to data skew or memory pressure.

  • Can slow down writes if used inappropriately.

Summary

In this article, we have seen:

  • The use cases for partitionBy, repartition and coalesce methods in Spark.
  • The pros and cons of each of these methods.

Understanding these methods and when to apply them can significantly optimize your data writing strategy in Spark. In the next article, we will explore how these partitioning strategies impact performance during read operations and how to balance between efficient writes and fast reads.

References