Handling Unescaped Quotes

Last updated on: 2025-05-30

Sometimes source data files contain records with unescaped quotes inside field values. For example:

She said, "hello" to everyone.

Here, the entire sentence is one field, but the inner quotes ("hello") can cause parsing issues if not handled properly.

Reading csv files with Unescaped Quote Records

Consider the csv file delimiter.csv with the following content:

Roll,Name,Marks, Dialouge
1,Ajay,55,"He said "good morning" to the teacher"
2,Bharghav,63,"This is the "locker" in the country"
3,Chaitra,60,"You cannotb be serious"
4,Kamal,75,"I love "apple pie""
5,Sohaib,70,"It is raining outside"

Notice that rows 1, 2, and 4 contain unescaped quotes inside the Dialogue field (good morning, locker, and apple pie).

Spark’s unescapedQuoteHandling Option

Apache Spark provides the unescapedQuoteHandling option to help deal with unescaped quotes in CSV files. It offers multiple modes for parsing such data.

Let’s explore how Spark behaves with different settings.

Method-1: STOP_AT_CLOSING_QUOTE

val dfDelimiter = spark.read
  .option("header", "true")
  .option("unescapedQuoteHandling", "STOP_AT_CLOSING_QUOTE")
  .csv("csvFiles/delimiter.csv")

dfDelimiter.show(truncate = false)

Output

+----+--------+-----+-------------------------------------+
|Roll|Name    |Marks| Dialogue                        |
+----+--------+-----+-------------------------------------+
|1   |Ajay    |55   |He said "good morning" to the teacher|
|2   |Bharghav|63   |This is the "locker" in the country  |
|3   |Chaitra |60   |You cannot be serious                |
|4   |Kamal   |75   |I love "apple pie"                   |
|5   |Sohaib  |70   |It is raining outside                |
+----+--------+-----+-------------------------------------+

This option stops reading at the closing quote and generally works well but may truncate data in some cases.

Method-2: BACK_TO_DELIMITER

val dfDelimiter1 = spark.read
      .option("header", "true")
      .option("unescapedQuoteHandling", "BACK_TO_DELIMITER")
      .csv("csvFiles/delimiter.csv")

    dfDelimiter1.show(truncate = false)

Output

+----+--------+-----+--------------------------------------+
|Roll|Name    |Marks| Dialogue                         |
+----+--------+-----+--------------------------------------+
|1   |Ajay    |55   |"He said good morning" to the teacher"|
|2   |Bharghav|63   |"This is the locker" in the country"  |
|3   |Chaitra |60   |You cannot be serious                 |
|4   |Kamal   |75   |"I love apple pie" "                  |
|5   |Sohaib  |70   |It is raining outside                 |
+----+--------+-----+--------------------------------------+

This method moves back to the delimiter upon encountering an unescaped quote, which avoids truncation but can cause misinterpretation of data.

Method-3: STOP_AT_DELIMITER

val dfDelimiter2 = spark.read
  .option("header", "true")
  .option("unescapedQuoteHandling", "STOP_AT_DELIMITER")
  .csv("csvFiles/delimiter.csv")

dfDelimiter2.show(truncate = false)

Output

+----+--------+-----+---------------------------------------+
|Roll|Name    |Marks| Dialogue                          |
+----+--------+-----+---------------------------------------+
|1   |Ajay    |55   |"He said "good morning" to the teacher"|
|2   |Bharghav|63   |"This is the "locker" in the country"  |
|3   |Chaitra |60   |You cannot be serious                  |
|4   |Kamal   |75   |"I love "apple pie" "                  |
|5   |Sohaib  |70   |It is raining outside                  |
+----+--------+-----+---------------------------------------+

This option stops reading when the delimiter is reached, which might also avoid truncation but can lead to incorrect parsing.

We have seen how spark is reading the csv file for every option of unescapedQuoteHandling.

Choosing the Right Option

  • STOP_AT_CLOSING_QUOTE often works well but can sometimes truncate data.

  • BACK_TO_DELIMITER and STOP_AT_DELIMITER reduce truncation risk but might cause data misinterpretation.

For large and diverse datasets, it's crucial to test and choose the unescapedQuoteHandling option that best fits your data to maintain both accuracy and completeness.

Summary

In this article we have seen:

  • What can be the challenges while reading data with unescaped quotes.
  • How spark methods can help us with dealing the columns with unescaped quotes.
  • What are the pros and cons for each of the options of unescapedQuoteHandling

References