Handling Date and Time Format
Last updated on: 2025-05-30
When working with big data, it’s common to deal with CSV files containing date and timestamp values. These values are crucial for deriving insights and performing accurate analyses. However, inconsistent or non-standard formats often require careful parsing to avoid misinterpretation or loss of information.
Apache Spark provides options to handle such custom date and timestamp formats effectively. In this article, we’ll explore multiple ways to read CSV files with non-standard date and timestamp formats using Spark.
Sample Dataset 1: studentDate.csv
Consider the csv file studentDate.csv
ID,Name,LongValue,DOB, Submit Time,DoubleMarks
1,"Ajay",10,"2010:01:01","2025-02-17 12:30:45",92.75
2,"Bharghav",20,"2009:06:04","2025-02-17 08:15:30",88.5
3,"Chaitra",30,"2010:12:12","2025-02-17 14:45:10",75.8
4,"Kamal",40,"2010:08:25","2025-02-17 17:10:05",82.3
5,"Sohaib",50,"2009:04:14","2025-02-17 09:55:20",90.6
Here, the DOB is in the format yyyy:MM:dd, and Submit Time is in the format yyyy-MM-dd HH:mm:ss.
Reading CSV with Custom Date Format
We can instruct Spark to parse these values correctly using the .option("dateFormat", ...)
and .option("timestampFormat", ...)
flags:
val dateDf2 = spark.read
.option("header", "true")
.option("inferSchema", "true")
.option("dateFormat", "yyyy:MM:dd")
.option("timestampFormat", "yyyy-MM-dd HH:mm:ss")
.csv("csvFiles/studentDate.csv")
println("Updated CSV File")
dateDf2.show()
dateDf2.printSchema()
Output
Updated CSV File
+---+--------+---------+----------+-------------------+-----------+
| ID| Name|LongValue| DOB| Submit Time|DoubleMarks|
+---+--------+---------+----------+-------------------+-----------+
| 1| Ajay| 10|2010-01-01|2025-02-17 12:30:45| 92.75|
| 2|Bharghav| 20|2009-06-04|2025-02-17 08:15:30| 88.5|
| 3| Chaitra| 30|2010-12-12|2025-02-17 14:45:10| 75.8|
| 4| Kamal| 40|2010-08-25|2025-02-17 17:10:05| 82.3|
| 5| Sohaib| 50|2009-04-14|2025-02-17 09:55:20| 90.6|
+---+--------+---------+----------+-------------------+-----------+
root
|-- ID: integer (nullable = true)
|-- Name: string (nullable = true)
|-- LongValue: integer (nullable = true)
|-- DOB: date (nullable = true)
|-- Submit Time: timestamp (nullable = true)
|-- DoubleMarks: double (nullable = true)
Spark correctly infers the types and formats for DOB
and Submit Time
.
Sample Dataset 2: studentDate2.csv
Consider the csv file studentDate2.csv
ID,Name,LongValue,DOB,Submit Time,DoubleMarks
1,"Ajay",10,"01/01/2010","2025/02/17 12:30:45",92.75
2,"Bharghav",20,"04/06/2009","2025/02/17 12:35:30",88.5
3,"Chaitra",30,"12/12/2010","2025/02/17 12:45:10",75.8
4,"Kamal",40,"25/08/2010","2025/02/17 12:40:05",82.3
5,"Sohaib",50,"14/04/2009","2025/02/17 12:55:20",90.6
In this file, the DOB uses the dd/MM/yyyy format and the Submit Time uses yyyy/MM/dd HH:mm:ss.
Reading CSV with Different Custom Formats
We can handle these variations using the same options with updated format strings:
val dateDf3 = spark.read
.option("header", "true")
.option("inferSchema", "true")
.option("dateFormat", "dd/MM/yyyy")
.option("timestampFormat", "yyyy/MM/dd HH:mm:ss")
.csv("csvFiles/studentDate2.csv")
println("New CSV file")
dateDf3.show()
dateDf3.printSchema()
Output
New CSV file
+---+--------+---------+----------+-------------------+-----------+
| ID| Name|LongValue| DOB| Submit Time|DoubleMarks|
+---+--------+---------+----------+-------------------+-----------+
| 1| Ajay| 10|2010-01-01|2025-02-17 12:30:45| 92.75|
| 2|Bharghav| 20|2009-06-04|2025-02-17 12:35:30| 88.5|
| 3| Chaitra| 30|2010-12-12|2025-02-17 12:45:10| 75.8|
| 4| Kamal| 40|2010-08-25|2025-02-17 12:40:05| 82.3|
| 5| Sohaib| 50|2009-04-14|2025-02-17 12:55:20| 90.6|
+---+--------+---------+----------+-------------------+-----------+
root
|-- ID: integer (nullable = true)
|-- Name: string (nullable = true)
|-- LongValue: integer (nullable = true)
|-- DOB: date (nullable = true)
|-- Submit Time: timestamp (nullable = true)
|-- DoubleMarks: double (nullable = true)
Again, Spark successfully parses the values when given the correct format.
Why Format Matters
If the correct format isn’t specified, Spark will treat these fields as strings instead of date or timestamp, which may:
-
Break downstream data transformations or comparisons.
-
Lead to incorrect aggregations or filtering logic.
-
Affect the performance and integrity of your analytics pipeline.
Summary
-
Spark provides
dateFormat
andtimestampFormat
options to correctly parse date and timestamp fields in non-standard formats. -
Always inspect your input file formats before processing, especially when ingesting data from varied sources.
-
Incorrect formats lead to schema misinterpretation and data quality issues