Compressing CSV Files in Spark

Last updated on: 2025-05-30

CSV files are widely used due to their simplicity and readability. However, with large datasets, these files can become bulky and storage-intensive. To tackle this, Apache Spark supports built-in compression options that reduce file size during write operations. In this article, we explore various compression methods available in Spark and how to effectively apply them when writing CSV files.

Dataset Reference

We'll use the same sample DataFrame as in previous articles Writing CSV files using partitions

val df = Seq(
    (1, "Ajay", 14, "01/01/2010", "2025/02/17 12:30:45", 92.7),
    (2, "Bharghav", 15, "04/06/2009", "2025/02/17 12:35:30", 88.5),
    (3, "Chaitra", 13, "12/12/2010", "2025/02/17 12:45:10", 75.8),
    (4, "Kamal", 14, "25/08/2010", "2025/02/17 12:40:05", 82.3),
    (5, "Sohaib", 13, "14/04/2009", "2025/02/17 12:55:20", 90.6),
    (6, "Divya", 14, "18/07/2010", "2025/02/17 12:20:15", 85.4),
    (7, "Faisal", 15, "23/05/2009", "2025/02/17 12:25:50", 78.9),
    (8, "Ganesh", 13, "30/09/2010", "2025/02/17 12:50:30", 88.2),
    (9, "Hema", 14, "05/11/2009", "2025/02/17 12:15:45", 91.0),
    (10, "Ishaan", 15, "20/03/2008", "2025/02/17 12:10:05", 87.6),
    (11, "Jasmine", 13, "10/02/2011", "2025/02/17 12:05:25", 79.5),
    (12, "Kiran", 14, "28/06/2009", "2025/02/17 12:00:40", 93.8)
).toDF("Roll", "Name", "Age","DOB","Submit Time","Final Marks")

Compression Codecs Supported by Spark

Before we explore examples, it’s important to know that Spark supports the following compression codecs:

  • gzip
  • bzip2
  • snappy
  • lz4
  • deflate

Using gzip Compression

gzip is commonly used due to its strong compression ratio. To enable it, use the compression option while writing.

df.write
  .partitionBy("Age")
  .option("header", "true")
  .option("compression", "gzip")
  .mode("overwrite")
  .csv("csvFiles/compressedStudentAgeData")

Observation:

In the output folder, compressed folders will be created, segregated by age of the students. But when we deeply observe, each csv file will have only 1 record. How to handle it?

Combine Partitions with Compression

To reduce the number of files and still compress, we can opt coalesce() method.

df.coalesce(1)
  .write
  .partitionBy("Age")
  .option("header", "true")
  .option("compression", "gzip")
  .mode("overwrite")
  .csv("csvFiles/compressedStudentAgeData")

This creates one compressed file per partition (here, per Age group), simplifying storage and access.

For more on coalesce, please refer to our articles on:

The benefit of using gzip compression is that, it drastically reduces the file size, making it efficient for storing and exchanges, but it uses more memory for decompression.

Fast Compression with Snappy

To compress the data faster, without worrying about the storage, we can configure the option snappy. It is best used when performance is of priority and we are dealing with big data. For small data, gzip would suffice.

df.coalesce(1)
  .write
  .partitionBy("Age")
  .option("header", "true")
  .option("compression", "snappy")
  .mode("overwrite")
  .csv("csvFiles/snappyCompressedData")

High Compression with BZIP2

bzip2 offers stronger compression than gzip, but it's CPU intensive. This method is

  • Best for Archival or long-term storage
  • Not ideal for frequent reads due to slower decompression.
df.coalesce(1)
  .write
  .partitionBy("Age")
  .option("header","true")
  .option("compression","bzip2")
  .mode("overwrite")
  .csv("csvFiles/bzipCompressedData")

Balanced Compression with Deflate

deflate() processes the dataframe with good balance between speed and compression ratio.

df.coalesce(1)
  .write
  .partitionBy("Age")
  .option("header","true")
  .option("compression","deflate")
  .mode("overwrite")
  .csv("csvFiles/deflateCompressedData")

Good for: Balanced performance and storage requirements.

Summary

In this article, we explored:

  • Compression options supported by Spark

  • How to compress CSV files using Spark’s write API

  • How to combine coalesce() with compression for efficient storage

  • Pros and cons of each codec:

    • GZIP: High compression, slower

    • Snappy: Fast, lower compression

    • BZIP2: Highest compression, CPU-intensive

    • Deflate: Balanced option

Choosing the right compression depends on your priorities—whether it’s storage, speed, or processing efficiency.

References