Handling CSV Files

Last updated on: 2025-05-30

Data can be stored in multiple formats—one common format is CSV.

CSV (Comma-Separated Values) is a simple file format where each line represents a data record, and fields are separated by commas.

When working with big data, reading data from CSV files efficiently becomes important. Apache Spark provides powerful tools to read, transform, and write CSV files at scale. Let us now look into handling CSV data using spark.

Reading CSV files in Spark

Sample CSV File: students.csv

Roll,Name,Marks
1,Ajay,55
2,Bharghav,63
3,Chaitra,60
4,Kamal,75
5,Sohaib,70

Basic CSV Reading Syntax in Spark

To load the csv files in spark framework, we need to use spark.read, followed by .option() and .csv(filepath).

//reading a csv file
val readCsv = spark.read
  .option("header", "true")
  .option("inferSchema", "true")
  .option("encoding", "UTF-8")
  .csv("csvFiles/students.csv")

readCsv.show()
readCsv.printSchema()

Explanation:

  • .option("header", "true"): Treats the first row as column names.

  • .option("inferSchema", "true"): Infers data types for each column.

  • .option("encoding", "UTF-8"): Handles character encoding.

Output

+----+--------+-----+
|Roll|    Name|Marks|
+----+--------+-----+
|   1|    Ajay|   55|
|   2|Bharghav|   63|
|   3| Chaitra|   60|
|   4|   Kamal|   75|
|   5|  Sohaib|   70|
+----+--------+-----+

root
 |-- Roll: integer (nullable = true)
 |-- Name: string (nullable = true)
 |-- Marks: integer (nullable = true)

Define the header of the csv file

Spark provides a flag header which when set to true, will read the first row of the csv file as the heading. When the flag header is set to false, the headers will not be read and custom column names will be printed.

With Header:

val headCsvYes = spark.read
  .option("header", "true")
  .option("inferSchema", "true")
  .csv("csvFiles/students.csv")

headCsvYes.show()

Output

+----+--------+-----+
|Roll|    Name|Marks|
+----+--------+-----+
|   1|    Ajay|   55|
|   2|Bharghav|   63|
|   3| Chaitra|   60|
|   4|   Kamal|   75|
|   5|  Sohaib|   70|
+----+--------+-----+

Without Header

val headCsv = spark.read
  .option("header", "false")
  .option("inferSchema", "true")
  .csv("csvFiles/students.csv")

headCsv.show()

Output

+----+--------+-----+
| _c0|     _c1|  _c2|
+----+--------+-----+
|Roll|    Name|Marks|
|   1|    Ajay|   55|
|   2|Bharghav|   63|
|   3| Chaitra|   60|
|   4|   Kamal|   75|
|   5|  Sohaib|   70|
+----+--------+-----+

Without specifying the header, Spark assigns default column names like _c0, _c1, etc.

Reading csv files with different Delimiters

Assume that we have a csv file with a delimiter(;). To read the data with semicolon(;) as delimiter, we specify the delimiter in the option("delimiter", ";") method.

Consider the csv file studentsDiffDelimiter.csv

Roll;Name;Marks
1;Ajay;55
2;Bharghav;63
3;Chaitra;60
4;Kamal;75
5;Sohaib;70

Spark Code to Read:

val delimiterCsv = spark.read
  .option("header", "true")
  .option("inferSchema", "true")
  .option("encoding", "UTF-8")
  .option("delimiter", ";")
  .csv("csvFiles/studentsDiffDelimiter.csv")

delimiterCsv.show()

Output

+----+--------+-----+
|Roll|    Name|Marks|
+----+--------+-----+
|   1|    Ajay|   55|
|   2|Bharghav|   63|
|   3| Chaitra|   60|
|   4|   Kamal|   75|
|   5|  Sohaib|   70|
+----+--------+-----+

You can also use other delimiters like "|", "\t" (tab), " " (space), etc., by changing the value in option("delimiter", "...").

Selecting Specific Columns from csv file

We use select() function to select the desired columns from the csv data.

val selectCols = readCsv.select("Name", "Marks")

selectCols.show()

Output

+--------+-----+
|    Name|Marks|
+--------+-----+
|    Ajay|   55|
|Bharghav|   63|
| Chaitra|   60|
|   Kamal|   75|
|  Sohaib|   70|
+--------+-----+

Filter rows from a csv file

We use the filter() function to filter out the rows which have specific values.

val selectVals = readCsv.filter("Marks>=63")

selectVals.show()

Output

+----+--------+-----+
|Roll|    Name|Marks|
+----+--------+-----+
|   2|Bharghav|   63|
|   4|   Kamal|   75|
|   5|  Sohaib|   70|
+----+--------+-----+

Renaming columns of a csv file

We can pot for withColumnRenamed() method to rename the column of the csv file.

val colName = readCsv.withColumnRenamed("Marks", "Math Marks")

colName.show()

Output

+----+--------+----------+
|Roll|    Name|Math Marks|
+----+--------+----------+
|   1|    Ajay|        55|
|   2|Bharghav|        63|
|   3| Chaitra|        60|
|   4|   Kamal|        75|
|   5|  Sohaib|        70|
+----+--------+----------+

Write the csv file with the updates

Writing the csv file means to save the changes that were carried out after converting the csv file to a dataframe. To save the changes, we use write() method, followed by .csv(folderpath).

colName.write
  .format("csv")
  .mode("overwrite")
  .option("header","true")
  .csv("output")

This writes the updated data of the output file in the output directory.

Summary

In this article, you learned how to handle CSV files in Spark:

  • Read CSV files using spark.read.csv()

  • Control headers with option("header", "true" | "false")

  • Use custom delimiters with option("delimiter", ";")

  • Select and filter rows

  • Rename columns

  • Save the DataFrame back as a CSV file

These operations are essential for preprocessing and analyzing structured data using Apache Spark.

References