Writing Files using Partitions

Last updated on: 2025-05-30

Sometimes, even though we prefer to store data in a single file, it becomes more efficient and readable to split the data based on certain conditions. This can make it easier to navigate and analyze subsets of data without having to filter through an entire dataset.

Splitting Data into multiple csv files

Let’s say we have the following DataFrame:

val df=Seq(
        (1, "Ajay", 14, "01/01/2010", "2025/02/17 12:30:45", 92.7),
        (2, "Bharghav", 15, "04/06/2009", "2025/02/17 12:35:30", 88.5),
        (3, "Chaitra", 13, "12/12/2010", "2025/02/17 12:45:10", 75.8),
        (4, "Kamal", 14, "25/08/2010", "2025/02/17 12:40:05", 82.3),
        (5, "Sohaib", 13, "14/04/2009", "2025/02/17 12:55:20", 90.6),
        (6, "Divya", 14, "18/07/2010", "2025/02/17 12:20:15", 85.4),
        (7, "Faisal", 15, "23/05/2009", "2025/02/17 12:25:50", 78.9),
        (8, "Ganesh", 13, "30/09/2010", "2025/02/17 12:50:30", 88.2),
        (9, "Hema", 14, "05/11/2009", "2025/02/17 12:15:45", 91.0),
        (10, "Ishaan", 15, "20/03/2008", "2025/02/17 12:10:05", 87.6),
        (11, "Jasmine", 13, "10/02/2011", "2025/02/17 12:05:25", 79.5),
        (12, "Kiran", 14, "28/06/2009", "2025/02/17 12:00:40", 93.8)
    ).toDF("Roll", "Name", "Age","DOB","Submit Time","Final Marks")

We can split the data by a column value — for example, Age — using the partitionBy() method:

df.write
  .partitionBy("Age")
  .option("header", "true")
  .mode("overwrite")
  .csv("csvFiles/studentAgeData")

This will create subdirectories for each unique age that is present in the dataframe, like Age=13, Age=14, Age=15, with data stored accordingly.

When we open the subdirectories, you see that for each record, a unique file has been created. This is because by default, Spark writes multiple files per partition.

Avoiding Multiple Files per Partition

We can handle the situation of multiple files per partition. To consolidate all records of a group like Age=13 into a single file, we can use coalesce(1) to put all the records of same age into a single file.

df.coalesce(1)
  .write
  .partitionBy("Age")
  .option("header", "true")
  .mode("overwrite")
  .csv("csvFiles/combinedStudentAgeData")

Example Outputs:

Age = 13

Roll,Name,DOB,Submit Time,Final Marks
3,Chaitra,12/12/2010,2025/02/17 12:45:10,75.8
5,Sohaib,14/04/2009,2025/02/17 12:55:20,90.6
8,Ganesh,30/09/2010,2025/02/17 12:50:30,88.2
11,Jasmine,10/02/2011,2025/02/17 12:05:25,79.5

Age = 14

Roll,Name,DOB,Submit Time,Final Marks
1,Ajay,01/01/2010,2025/02/17 12:30:45,92.7
4,Kamal,25/08/2010,2025/02/17 12:40:05,82.3
6,Divya,18/07/2010,2025/02/17 12:20:15,85.4
9,Hema,05/11/2009,2025/02/17 12:15:45,91.0
12,Kiran,28/06/2009,2025/02/17 12:00:40,93.8

Age=15

Roll,Name,DOB,Submit Time,Final Marks
2,Bharghav,04/06/2009,2025/02/17 12:35:30,88.5
7,Faisal,23/05/2009,2025/02/17 12:25:50,78.9
10,Ishaan,20/03/2008,2025/02/17 12:10:05,87.6

Partition the data based on a condition

To classify the students as Pass' or 'Fail, based on marks compared to the cutoff, we use when() and otherwise() methods.

val cutOff = 90.0
val marksDf = df.withColumn("Pass/Fail",
  when(col("Final Marks")<cutOff,"Pass")
    .otherwise("Fail"))

marksDf.coalesce(1)
  .write
  .partitionBy("Pass/Fail")
  .option("header", "true")
  .mode("overwrite")
  .csv("csvFiles/studentResults")

Output Pass

Roll,Name,Age,DOB,Submit Time,Final Marks
1,Ajay,14,01/01/2010,2025/02/17 12:30:45,92.7
5,Sohaib,13,14/04/2009,2025/02/17 12:55:20,90.6
9,Hema,14,05/11/2009,2025/02/17 12:15:45,91.0
12,Kiran,14,28/06/2009,2025/02/17 12:00:40,93.8

Fail

Roll,Name,Age,DOB,Submit Time,Final Marks
2,Bharghav,15,04/06/2009,2025/02/17 12:35:30,88.5
3,Chaitra,13,12/12/2010,2025/02/17 12:45:10,75.8
4,Kamal,14,25/08/2010,2025/02/17 12:40:05,82.3
6,Divya,14,18/07/2010,2025/02/17 12:20:15,85.4
7,Faisal,15,23/05/2009,2025/02/17 12:25:50,78.9
8,Ganesh,13,30/09/2010,2025/02/17 12:50:30,88.2
10,Ishaan,15,20/03/2008,2025/02/17 12:10:05,87.6
11,Jasmine,13,10/02/2011,2025/02/17 12:05:25,79.5

What is repartition()?

The repartition() method redistributes the data randomly across a specified number of partitions, which can help optimize processing speed for large datasets.

df.repartition(5)
  .write
  .option("header", "true")
  .mode("overwrite")
  .csv("csvFiles/studentDataDistributed")

Key Points:

  • repartition() is used mainly for performance tuning.

  • It’s helpful when you need more partitions (e.g. for parallelism in large-scale data).

  • Unlike partitionBy(), it doesn’t use column values to group the data.

Summary

In this article, you learned:

  • How to use partitionBy() to split data into multiple CSVs based on column values.

  • How to use coalesce() to ensure one file per group.

  • How to conditionally partition data (e.g., Pass/Fail) using when() and otherwise().

  • The purpose and use of repartition() for better parallelism and performance.

This approach is extremely useful when organizing or exporting data for reporting, auditing, or analysis.

References