File Operations

Last updated on: 2025-05-30

CSV files can store data in a variety of formats:

  • Records may appear on a single line, separated by delimiters.

  • Delimiters used can vary—common examples include commas (,), pipes (|), or semicolons (;).

  • In some cases, individual records may span multiple lines.

Apache Spark offers several options to correctly interpret and read CSV files, regardless of how the data is structured. In this article, we’ll explore how to handle multiline records and different line separators when working with CSV files in Spark.

Reading CSV files with Multiline Records

Sometimes, fields in a CSV file—especially those containing text may include newline characters, causing a single record to span multiple lines. By default, Spark reads each line as a separate row, which can result in incorrect parsing. To handle such cases, Spark provides the multiline option.

The multiline option is set to false by default. To demonstrate this behavior, consider the following multiLine.csv file:

Consider the csv file multiLine.csv

Roll,Name,Marks,Description
1,Ajay,55,"He loves
to sing"
2,Bharghav,63,"He is a
basketball player"
3,Chaitra,60,"She is the best
person for debate competitions"
4,Kamal,75, "He is the topper of the class"
5,Sohaib,70,"He is the son of
Vice Principal"

If we attempt to read this file without enabling multiline parsing:

val df = spark.read
  .option("header", "true")
  .csv("csvFiles/multiline.csv")

df.show(truncate = false)

Output

+-------------------------------+--------+-----+--------------------------------+
|Roll                           |Name    |Marks| Description                    |
+-------------------------------+--------+-----+--------------------------------+
|1                              |Ajay    |55   |He loves                        |
|to sing"                       |NULL    |NULL |NULL                            |
|2                              |Bharghav|63   |He is a                         |
|basketball player"             |NULL    |NULL |NULL                            |
|3                              |Chaitra |60   |She is the best                 |
|person for debate competitions"|NULL    |NULL |NULL                            |
|4                              |Kamal   |75   |He is the topper of the class   |
|5                              |Sohaib  |70   |He is the son of                |
|Vice Principal"                |NULL    |NULL |NULL                            |
+-------------------------------+--------+-----+--------------------------------+

As seen above, Spark incorrectly splits records by line, treating each new line as a separate row.

To fix this, enable the multiline option to true

val multilineDf = spark.read
  .option("header", "true")
  .option("multiline", "true")
  .csv("csvFiles/multiline.csv")

multilineDf.show(truncate = false)

Output

+----+--------+-----+-----------------------------------------------+
|Roll|Name    |Marks| Description                                   |
+----+--------+-----+-----------------------------------------------+
|1   |Ajay    |55   |He loves\nto sing                              |
|2   |Bharghav|63   |He is a\nbasketball player                     |
|3   |Chaitra |60   |She is the best\nperson for debate competitions|
|4   |Kamal   |75   |He is the topper of the class                  |
|5   |Sohaib  |70   |He is the son of\nVice Principal               |
+----+--------+-----+-----------------------------------------------+

The data is now parsed correctly, treating each multiline record as a single row.

Handling Different Line Separators in CSV Files

Line endings in CSV files can vary depending on the operating system used to create them: There are 3 types of line separators,

  • CR (\r) - Used in Older and Classic Mac-OS
  • LF (\n) - Common in Linux and modern Mac-OS
  • CRLF (\r\n) - Standard in Windows

Let’s say we have two files:

  • students.csv encoded with CRLF (\r\n) line endings.

  • studentsLineSeparators.csv encoded with LF (\n) line endings.

Here is an example CSV structure:

Roll,Name,Marks
1,Ajay,55
2,Bharghav,63
3,Chaitra,60
4,Kamal,75
5,Sohaib,70

Now, if we read the studentsLineSeparators.csv file using the wrong line separator (\r\n instead of \n), Spark will misinterpret the structure:

val lineSepCsv = spark.read
  .option("header", "true")
  .option("lineSep", "\r\n")
  .csv("csvFiles/studentsLineSeparators.csv")

lineSepCsv.show()

** Incorrect Output**

+----+----+--------+----+-----+--------+-----+-------+-----+-----+-----+------+---+
|Roll|Name|Marks\n1|Ajay|55\n2|Bharghav|63\n3|Chaitra|60\n4|Kamal|75\n5|Sohaib| 70|
+----+----+--------+----+-----+--------+-----+-------+-----+-----+-----+------+---+
+----+----+--------+----+-----+--------+-----+-------+-----+-----+-----+------+---+

To correctly parse the file, we must specify the appropriate line separator:

val lineSepCsv1 = spark.read
  .option("header", "true")
  .option("lineSep", "\n")
  .csv("csvFiles/studentsLineSeparators.csv")

lineSepCsv1.show()

Correct Output

+----+--------+-----+
|Roll|    Name|Marks|
+----+--------+-----+
|   1|    Ajay|   55|
|   2|Bharghav|   63|
|   3| Chaitra|   60|
|   4|   Kamal|   75|
|   5|  Sohaib|   70|
+----+--------+-----+

Now, the file is parsed as expected.

Summary

In this article, you learned:

  • How to use the multiline option in Spark to correctly parse CSV files with records spanning multiple lines.

  • The importance of specifying the correct lineSep when reading CSV files written with different line endings (CR, LF, CRLF).

  • What happens when incorrect line separators are specified—resulting in malformed or misinterpreted data.

These techniques help ensure accurate data ingestion when working with varied CSV formats in Apache Spark.

References