Handling Date and Time Format

Last updated on: 2025-05-30

When working with big data, it’s common to deal with CSV files containing date and timestamp values. These values are crucial for deriving insights and performing accurate analyses. However, inconsistent or non-standard formats often require careful parsing to avoid misinterpretation or loss of information.

Apache Spark provides options to handle such custom date and timestamp formats effectively. In this article, we’ll explore multiple ways to read CSV files with non-standard date and timestamp formats using Spark.

Sample Dataset 1: studentDate.csv

Consider the csv file studentDate.csv

ID,Name,LongValue,DOB, Submit Time,DoubleMarks
1,"Ajay",10,"2010:01:01","2025-02-17 12:30:45",92.75
2,"Bharghav",20,"2009:06:04","2025-02-17 08:15:30",88.5
3,"Chaitra",30,"2010:12:12","2025-02-17 14:45:10",75.8
4,"Kamal",40,"2010:08:25","2025-02-17 17:10:05",82.3
5,"Sohaib",50,"2009:04:14","2025-02-17 09:55:20",90.6

Here, the DOB is in the format yyyy:MM:dd, and Submit Time is in the format yyyy-MM-dd HH:mm:ss.

Reading CSV with Custom Date Format

We can instruct Spark to parse these values correctly using the .option("dateFormat", ...) and .option("timestampFormat", ...) flags:

val dateDf2 = spark.read
  .option("header", "true")
  .option("inferSchema", "true")
  .option("dateFormat", "yyyy:MM:dd")
  .option("timestampFormat", "yyyy-MM-dd HH:mm:ss")
  .csv("csvFiles/studentDate.csv")

println("Updated CSV File")
dateDf2.show()
dateDf2.printSchema()

Output

Updated CSV File
+---+--------+---------+----------+-------------------+-----------+
| ID|    Name|LongValue|       DOB|        Submit Time|DoubleMarks|
+---+--------+---------+----------+-------------------+-----------+
|  1|    Ajay|       10|2010-01-01|2025-02-17 12:30:45|      92.75|
|  2|Bharghav|       20|2009-06-04|2025-02-17 08:15:30|       88.5|
|  3| Chaitra|       30|2010-12-12|2025-02-17 14:45:10|       75.8|
|  4|   Kamal|       40|2010-08-25|2025-02-17 17:10:05|       82.3|
|  5|  Sohaib|       50|2009-04-14|2025-02-17 09:55:20|       90.6|
+---+--------+---------+----------+-------------------+-----------+

root
 |-- ID: integer (nullable = true)
 |-- Name: string (nullable = true)
 |-- LongValue: integer (nullable = true)
 |-- DOB: date (nullable = true)
 |--  Submit Time: timestamp (nullable = true)
 |-- DoubleMarks: double (nullable = true)

Spark correctly infers the types and formats for DOB and Submit Time.

Sample Dataset 2: studentDate2.csv

Consider the csv file studentDate2.csv

ID,Name,LongValue,DOB,Submit Time,DoubleMarks
1,"Ajay",10,"01/01/2010","2025/02/17 12:30:45",92.75
2,"Bharghav",20,"04/06/2009","2025/02/17 12:35:30",88.5
3,"Chaitra",30,"12/12/2010","2025/02/17 12:45:10",75.8
4,"Kamal",40,"25/08/2010","2025/02/17 12:40:05",82.3
5,"Sohaib",50,"14/04/2009","2025/02/17 12:55:20",90.6

In this file, the DOB uses the dd/MM/yyyy format and the Submit Time uses yyyy/MM/dd HH:mm:ss.

Reading CSV with Different Custom Formats

We can handle these variations using the same options with updated format strings:

val dateDf3 = spark.read
  .option("header", "true")
  .option("inferSchema", "true")
  .option("dateFormat", "dd/MM/yyyy")
  .option("timestampFormat", "yyyy/MM/dd HH:mm:ss")
  .csv("csvFiles/studentDate2.csv")

println("New CSV file")
dateDf3.show()
dateDf3.printSchema()

Output

New CSV file
+---+--------+---------+----------+-------------------+-----------+
| ID|    Name|LongValue|       DOB|        Submit Time|DoubleMarks|
+---+--------+---------+----------+-------------------+-----------+
|  1|    Ajay|       10|2010-01-01|2025-02-17 12:30:45|      92.75|
|  2|Bharghav|       20|2009-06-04|2025-02-17 12:35:30|       88.5|
|  3| Chaitra|       30|2010-12-12|2025-02-17 12:45:10|       75.8|
|  4|   Kamal|       40|2010-08-25|2025-02-17 12:40:05|       82.3|
|  5|  Sohaib|       50|2009-04-14|2025-02-17 12:55:20|       90.6|
+---+--------+---------+----------+-------------------+-----------+

root
 |-- ID: integer (nullable = true)
 |-- Name: string (nullable = true)
 |-- LongValue: integer (nullable = true)
 |-- DOB: date (nullable = true)
 |-- Submit Time: timestamp (nullable = true)
 |-- DoubleMarks: double (nullable = true)

Again, Spark successfully parses the values when given the correct format.

Why Format Matters

If the correct format isn’t specified, Spark will treat these fields as strings instead of date or timestamp, which may:

  • Break downstream data transformations or comparisons.

  • Lead to incorrect aggregations or filtering logic.

  • Affect the performance and integrity of your analytics pipeline.

Summary

  • Spark provides dateFormat and timestampFormat options to correctly parse date and timestamp fields in non-standard formats.

  • Always inspect your input file formats before processing, especially when ingesting data from varied sources.

  • Incorrect formats lead to schema misinterpretation and data quality issues

References