Handling CSV Files
Last updated on: 2025-05-30
Data can be stored in multiple formats—one common format is CSV.
CSV (Comma-Separated Values) is a simple file format where each line represents a data record, and fields are separated by commas.
When working with big data, reading data from CSV files efficiently becomes important. Apache Spark provides powerful tools to read, transform, and write CSV files at scale. Let us now look into handling CSV data using spark.
Reading CSV files in Spark
Sample CSV File: students.csv
Roll,Name,Marks
1,Ajay,55
2,Bharghav,63
3,Chaitra,60
4,Kamal,75
5,Sohaib,70
Basic CSV Reading Syntax in Spark
To load the csv files in spark framework, we need to use spark.read
, followed by .option()
and .csv(filepath)
.
//reading a csv file
val readCsv = spark.read
.option("header", "true")
.option("inferSchema", "true")
.option("encoding", "UTF-8")
.csv("csvFiles/students.csv")
readCsv.show()
readCsv.printSchema()
Explanation:
-
.option("header", "true"): Treats the first row as column names.
-
.option("inferSchema", "true"): Infers data types for each column.
-
.option("encoding", "UTF-8"): Handles character encoding.
Output
+----+--------+-----+
|Roll| Name|Marks|
+----+--------+-----+
| 1| Ajay| 55|
| 2|Bharghav| 63|
| 3| Chaitra| 60|
| 4| Kamal| 75|
| 5| Sohaib| 70|
+----+--------+-----+
root
|-- Roll: integer (nullable = true)
|-- Name: string (nullable = true)
|-- Marks: integer (nullable = true)
Define the header of the csv file
Spark provides a flag header
which when set to true, will read the first row of the csv file as the heading.
When the flag header
is set to false, the headers will not be read and custom column names will be printed.
With Header:
val headCsvYes = spark.read
.option("header", "true")
.option("inferSchema", "true")
.csv("csvFiles/students.csv")
headCsvYes.show()
Output
+----+--------+-----+
|Roll| Name|Marks|
+----+--------+-----+
| 1| Ajay| 55|
| 2|Bharghav| 63|
| 3| Chaitra| 60|
| 4| Kamal| 75|
| 5| Sohaib| 70|
+----+--------+-----+
Without Header
val headCsv = spark.read
.option("header", "false")
.option("inferSchema", "true")
.csv("csvFiles/students.csv")
headCsv.show()
Output
+----+--------+-----+
| _c0| _c1| _c2|
+----+--------+-----+
|Roll| Name|Marks|
| 1| Ajay| 55|
| 2|Bharghav| 63|
| 3| Chaitra| 60|
| 4| Kamal| 75|
| 5| Sohaib| 70|
+----+--------+-----+
Without specifying the header, Spark assigns default column names like _c0, _c1, etc.
Reading csv files with different Delimiters
Assume that we have a csv file with a delimiter(;). To read the data with semicolon(;) as delimiter,
we specify the delimiter in the option("delimiter", ";")
method.
Consider the csv file studentsDiffDelimiter.csv
Roll;Name;Marks
1;Ajay;55
2;Bharghav;63
3;Chaitra;60
4;Kamal;75
5;Sohaib;70
Spark Code to Read:
val delimiterCsv = spark.read
.option("header", "true")
.option("inferSchema", "true")
.option("encoding", "UTF-8")
.option("delimiter", ";")
.csv("csvFiles/studentsDiffDelimiter.csv")
delimiterCsv.show()
Output
+----+--------+-----+
|Roll| Name|Marks|
+----+--------+-----+
| 1| Ajay| 55|
| 2|Bharghav| 63|
| 3| Chaitra| 60|
| 4| Kamal| 75|
| 5| Sohaib| 70|
+----+--------+-----+
You can also use other delimiters like "|", "\t" (tab), " " (space), etc., by changing the value in option("delimiter", "...").
Selecting Specific Columns from csv file
We use select()
function to select the desired columns from the csv data.
val selectCols = readCsv.select("Name", "Marks")
selectCols.show()
Output
+--------+-----+
| Name|Marks|
+--------+-----+
| Ajay| 55|
|Bharghav| 63|
| Chaitra| 60|
| Kamal| 75|
| Sohaib| 70|
+--------+-----+
Filter rows from a csv file
We use the filter()
function to filter out the rows which have specific values.
val selectVals = readCsv.filter("Marks>=63")
selectVals.show()
Output
+----+--------+-----+
|Roll| Name|Marks|
+----+--------+-----+
| 2|Bharghav| 63|
| 4| Kamal| 75|
| 5| Sohaib| 70|
+----+--------+-----+
Renaming columns of a csv file
We can pot for withColumnRenamed()
method to rename the column of the csv file.
val colName = readCsv.withColumnRenamed("Marks", "Math Marks")
colName.show()
Output
+----+--------+----------+
|Roll| Name|Math Marks|
+----+--------+----------+
| 1| Ajay| 55|
| 2|Bharghav| 63|
| 3| Chaitra| 60|
| 4| Kamal| 75|
| 5| Sohaib| 70|
+----+--------+----------+
Write the csv file with the updates
Writing the csv file means to save the changes that were carried out after converting the csv file to a dataframe.
To save the changes, we use write()
method, followed by .csv(folderpath)
.
colName.write
.format("csv")
.mode("overwrite")
.option("header","true")
.csv("output")
This writes the updated data of the output file in the output
directory.
Summary
In this article, you learned how to handle CSV files in Spark:
-
Read CSV files using
spark.read.csv()
-
Control headers with
option("header", "true" | "false")
-
Use custom delimiters with
option("delimiter", ";")
-
Select and filter rows
-
Rename columns
-
Save the DataFrame back as a CSV file
These operations are essential for preprocessing and analyzing structured data using Apache Spark.