Date and Time Parsing
Last updated on: 2025-05-30
In today’s world, date and time information is critical across many domains. Whether it's tracking financial transactions or recording student exam timings, date and time data is used everywhere.
However, working with this type of data can be challenging. Different regions or systems may store date and time in different formats, which can lead to errors or difficulties during data analysis.
To solve this issue, Apache Spark provides options to correctly read and process date and time values from files. In this article, we’ll learn how to handle such values using Spark.
Sample JSON File: studentSubmitData.json
Consider the following JSON file with student data - studentSubmitData.json
{
"Roll": 1,
"Name": "Ajay",
"Age": 14,
"DOB": "01/01/2010",
"Submit Time": "2025/02/17 12:30:45",
"Final Marks": 92.7
},
{
"Roll": 2,
"Name": "Bharghav",
"Age": 15,
"DOB": "04/06/2009",
"Submit Time": "2025/02/17 12:35:30",
"Final Marks": 88.5
},
...
1. Parsing date values in JSON file
JSON does not have a built-in Date
or TimeStamp
format. It only stores values in String
or Numeric
or other equivalent data types.
Thus, the column DOB
is treated as String
Let’s read the JSON file:
val submitData = spark.read
.option("multiLine", "true")
.option("inferSchema", "true")
.json("jsonFiles/studentSubmitData.json")
submitData.printSchema()
Output
root
|-- Age: long (nullable = true)
|-- DOB: string (nullable = true)
|-- Final Marks: double (nullable = true)
|-- Name: string (nullable = true)
|-- Roll: long (nullable = true)
|-- Submit Time: string (nullable = true)
As you can see, the DOB
and Submit Time
columns are interpreted as string types.
2. Converting String to Date Format
To convert the DOB
field to a proper Date type, we can use the to_date()
function from org.apache.spark.sql.functions.
import org.apache.spark.sql.functions._
val submitData = spark.read
.option("multiLine", "true")
.option("inferSchema", "true")
.json("jsonFiles/studentSubmitData.json")
.withColumn("DOB", to_date(col("DOB"), "dd/MM/yyyy"))
submitData.show()
submitData.printSchema()
Output
+---+----------+-----------+--------+----+-------------------+
|Age| DOB|Final Marks| Name|Roll| Submit Time|
+---+----------+-----------+--------+----+-------------------+
| 14|2010-01-01| 92.7| Ajay| 1|2025/02/17 12:30:45|
| 15|2009-06-04| 88.5|Bharghav| 2|2025/02/17 12:35:30|
| 13|2010-12-12| 75.8| Chaitra| 3|2025/02/17 12:45:10|
| 14|2010-08-25| 82.3| Kamal| 4|2025/02/17 12:40:05|
| 13|2009-04-14| 90.6| Sohaib| 5|2025/02/17 12:55:20|
+---+----------+-----------+--------+----+-------------------+
root
|-- Age: long (nullable = true)
|-- DOB: date (nullable = true)
|-- Final Marks: double (nullable = true)
|-- Name: string (nullable = true)
|-- Roll: long (nullable = true)
|-- Submit Time: string (nullable = true)
Now, DOB
is correctly converted to the Date data type.
3. Parsing JSON files with timestamp records
To convert the Submit Time
field into a Timestamp, we use the to_timestamp()
function. This function also requires the correct format string.
val submitTime = spark.read
.option("multiLine", "true")
.option("inferSchema", "true")
.json("jsonFiles/studentSubmitData.json")
.withColumn("DOB", to_date(col("DOB"), "dd/MM/yyyy"))
.withColumn("Submit Time", to_timestamp(col("Submit Time"), "yyyy/MM/dd HH:mm:ss"))
submitTime.show()
submitTime.printSchema()
Output
+---+----------+-----------+--------+----+-------------------+
|Age| DOB|Final Marks| Name|Roll| Submit Time|
+---+----------+-----------+--------+----+-------------------+
| 14|2010-01-01| 92.7| Ajay| 1|2025-02-17 12:30:45|
| 15|2009-06-04| 88.5|Bharghav| 2|2025-02-17 12:35:30|
| 13|2010-12-12| 75.8| Chaitra| 3|2025-02-17 12:45:10|
| 14|2010-08-25| 82.3| Kamal| 4|2025-02-17 12:40:05|
| 13|2009-04-14| 90.6| Sohaib| 5|2025-02-17 12:55:20|
+---+----------+-----------+--------+----+-------------------+
root
|-- Age: long (nullable = true)
|-- DOB: date (nullable = true)
|-- Final Marks: double (nullable = true)
|-- Name: string (nullable = true)
|-- Roll: long (nullable = true)
|-- Submit Time: timestamp (nullable = true)
The Submit Time field is now properly formatted as a Timestamp.
Summary
In this article, we learned:
-
JSON files store date and time values as strings by default.
-
We can convert string columns to proper Date and Timestamp types using
to_date()
andto_timestamp()
functions in Spark. -
Correct parsing of date and time fields is important for accurate analysis, especially when working with non-standard formats.