CSV File Schema Handling
Last updated on: 2025-05-30
Viewing and Defining the Schema of a CSV File in Spark
Apache Spark provides powerful features to infer or define schemas when working with CSV files. This article demonstrates:
-
How to automatically infer schema
-
How to define a custom schema
-
How to enforce a schema using `.option("enforceSchema", "true/false")
For any given csv file, spark automatically understands the schema with the help of the flag inferSchema
.
Example CSV File – students.csv
Roll,Name,Marks
1,Ajay,55
2,Bharghav,63
3,Chaitra,60
4,Kamal,75
5,Sohaib,70
Automatically Inferring Schema
By using the inferSchema
option, Spark can detect the data types based on the contents of the file.
val df = spark.read
.option("inferSchema", "true")
.option("header", "true")
.csv("csvFiles/students.csv")
df.show()
df.printSchema()
Output
+----+--------+-----+
|Roll| Name|Marks|
+----+--------+-----+
| 1| Ajay| 55|
| 2|Bharghav| 63|
| 3| Chaitra| 60|
| 4| Kamal| 75|
| 5| Sohaib| 70|
+----+--------+-----+
root
|-- Roll: integer (nullable = true)
|-- Name: string (nullable = true)
|-- Marks: integer (nullable = true)
Reading CSV with a Custom Schema
You can explicitly define the schema using Spark’s StructType
.
val defineSchema = StructType(Seq(
StructField("Roll", IntegerType, true),
StructField("Name", StringType, true),
StructField(" Marks", DoubleType, true)
))
val dfSchema = spark.read
.schema(defineSchema)
.option("header", "true")
.csv("csvFiles/students.csv")
dfSchema.show()
dfSchema.printSchema()
Output
+----+--------+------+
|Roll| Name| Marks|
+----+--------+------+
| 1| Ajay| 55.0|
| 2|Bharghav| 63.0|
| 3| Chaitra| 60.0|
| 4| Kamal| 75.0|
| 5| Sohaib| 70.0|
+----+--------+------+
root
|-- Roll: integer (nullable = true)
|-- Name: string (nullable = true)
|-- Marks: double (nullable = true)
! Be careful with leading/trailing whitespaces in column names.
Understanding enforceSchema()
The enforceSchema option allows users to force Spark to use the provided schema rather than inferring it.
When enforceSchema()
is set to false, all the datatypes are read as StringType, regardless of content.
val dfEnforceSchema = spark.read
.option("enforceSchema", "false")
.option("header", "true")
.csv("csvFiles/students.csv")
dfEnforceSchema.show()
dfEnforceSchema.printSchema()
Output
+----+--------+-----+
|Roll| Name|Marks|
+----+--------+-----+
| 1| Ajay| 55|
| 2|Bharghav| 63|
| 3| Chaitra| 60|
| 4| Kamal| 75|
| 5| Sohaib| 70|
+----+--------+-----+
root
|-- Roll: string (nullable = true)
|-- Name: string (nullable = true)
|-- Marks: string (nullable = true)
If enforceSchema()
is set to true, spark will enforce the defined schema onto the csv file.
val enfSchema = StructType(Seq(
StructField("Roll", IntegerType, true),
StructField("Name", StringType, true),
StructField("Marks", FloatType, true)
))
val df2 = spark.read
.option("enforceSchema", "true")
.schema(enfSchema)
.option("header", "true")
.csv("csvFiles/students.csv")
df2.show()
df2.printSchema()
Output
+----+--------+-----+
|Roll| Name|Marks|
+----+--------+-----+
| 1| Ajay| 55.0|
| 2|Bharghav| 63.0|
| 3| Chaitra| 60.0|
| 4| Kamal| 75.0|
| 5| Sohaib| 70.0|
+----+--------+-----+
root
|-- Roll: integer (nullable = true)
|-- Name: string (nullable = true)
|-- Marks: float (nullable = true)
Summary
In this article, you learned how to:
-
Use
inferSchema
to let Spark automatically detect column data types. -
Define a custom schema using
StructType
and applied it with.schema()
. -
Use
enforceSchema
to control whether Spark strictly applies the provided schema.
When enforceSchema = false
, all columns are read as StringType
.
When enforceSchema = true
, Spark enforces the specified data types on the CSV.