Handling Encodings

Last updated on: 2025-05-30

Encoding CSV files properly ensures that data is correctly interpreted and displayed across systems and regions. Encoding defines how characters are stored and read in a file. Using the wrong encoding can corrupt special characters, making them unreadable or causing data loss—especially for non-English languages like Hindi, German, Chinese, etc.

When converting encodings from one type to another, it's essential to ensure compatibility to avoid the risk of losing important information.

This article explores how to handle encoding and encoding conversions for CSV files using Apache Spark.

Safely reading encoded CSV Files

If the encoding of a CSV file is known, it should be specified explicitly when reading the file. If the encoding is unknown, it's recommended to try using UTF-8, which supports a wide range of characters and languages. Consider the following UTF-8 encoded CSV file: empSalary.csv

Name, Nationality, Salary
Jürgen Müller,Germany,£1400
Élodie Durand,France,£1450
José García,Spain,£1395
Åsa Björk,Sweden,£1500

Reading the File with Incorrect Encoding

Let's read the file using the Windows-1252 encoding instead of UTF-8:

val encodeDf = spark.read
  .option("header", "true")
  .option("inferSchema", "true")
  .option("encoding","Windows-1252")
  .csv("csvFiles/empSalary.csv")
    
encodeDf.show()
encodeDf.printSchema()

Output

+---------------+------------+-------+
|           Name| Nationality| Salary|
+---------------+------------+-------+
|Jürgen Müller|     Germany| £1400|
| Élodie Durand|      France| £1450|
|   José García|       Spain| £1395|
|    Åsa Björk|      Sweden| £1500|
+---------------+------------+-------+

root
 |-- Name: string (nullable = true)
 |--  Nationality: string (nullable = true)
 |--  Salary: string (nullable = true)

Here, the special characters have been misinterpreted, leading to corrupted values in the Name and Salary columns.

Reading the File with Correct Encoding

val encodeDf1 = spark.read
  .option("header","true")
  .option("inferSchema", "true")
  .option("encoding", "UTF-8")
  .csv("csvFiles/empSalary.csv")
    
encodeDf1.show()
encodeDf1.printSchema()

Output

+-------------+------------+-------+
|         Name| Nationality| Salary|
+-------------+------------+-------+
|Jürgen Müller|     Germany|  £1400|
|Élodie Durand|      France|  £1450|
|  José García|       Spain|  £1395|
|    Åsa Björk|      Sweden|  £1500|
+-------------+------------+-------+

root
 |-- Name: string (nullable = true)
 |--  Nationality: string (nullable = true)
 |--  Salary: string (nullable = true)

The characters are now correctly interpreted, preserving data accuracy and readability.

Common File Encodings

Here are some common encodings you may encounter:

  • UTF-8: Widely used and supports most characters and languages.

  • UTF-8 with BOM: Often required for compatibility with Excel on Windows.

  • Windows-1252: Commonly used for Western European languages.

When working with files from different sources or languages, always test for encoding compatibility to prevent issues during data ingestion.

Summary

In this article, we covered:

  • The importance of encoding in CSV files and the risks of incorrect encoding.

  • How to read CSV files with the correct encoding using Apache Spark.

  • A comparison of file outputs when read with incorrect vs. correct encoding.

  • A brief overview of commonly used encoding formats.

Using the correct encoding is essential for maintaining data integrity and ensuring accurate downstream processing.

References