CSV File Schema Handling

Last updated on: 2025-05-30

Viewing and Defining the Schema of a CSV File in Spark

Apache Spark provides powerful features to infer or define schemas when working with CSV files. This article demonstrates:

  • How to automatically infer schema

  • How to define a custom schema

  • How to enforce a schema using `.option("enforceSchema", "true/false")

For any given csv file, spark automatically understands the schema with the help of the flag inferSchema.

Example CSV Filestudents.csv

Roll,Name,Marks
1,Ajay,55
2,Bharghav,63
3,Chaitra,60
4,Kamal,75
5,Sohaib,70

Automatically Inferring Schema

By using the inferSchema option, Spark can detect the data types based on the contents of the file.

val df = spark.read
  .option("inferSchema", "true")
  .option("header", "true")
  .csv("csvFiles/students.csv")

df.show()
df.printSchema()

Output

+----+--------+-----+
|Roll|    Name|Marks|
+----+--------+-----+
|   1|    Ajay|   55|
|   2|Bharghav|   63|
|   3| Chaitra|   60|
|   4|   Kamal|   75|
|   5|  Sohaib|   70|
+----+--------+-----+

root
 |-- Roll: integer (nullable = true)
 |-- Name: string (nullable = true)
 |-- Marks: integer (nullable = true)

Reading CSV with a Custom Schema

You can explicitly define the schema using Spark’s StructType.

val defineSchema = StructType(Seq(
      StructField("Roll", IntegerType, true),
      StructField("Name", StringType, true),
      StructField(" Marks", DoubleType, true)
    ))

val dfSchema = spark.read
  .schema(defineSchema)
  .option("header", "true")
  .csv("csvFiles/students.csv")

dfSchema.show()
dfSchema.printSchema()

Output

+----+--------+------+
|Roll|    Name| Marks|
+----+--------+------+
|   1|    Ajay|  55.0|
|   2|Bharghav|  63.0|
|   3| Chaitra|  60.0|
|   4|   Kamal|  75.0|
|   5|  Sohaib|  70.0|
+----+--------+------+

root
 |-- Roll: integer (nullable = true)
 |-- Name: string (nullable = true)
 |--  Marks: double (nullable = true)

! Be careful with leading/trailing whitespaces in column names.

Understanding enforceSchema()

The enforceSchema option allows users to force Spark to use the provided schema rather than inferring it. When enforceSchema() is set to false, all the datatypes are read as StringType, regardless of content.

val dfEnforceSchema = spark.read
  .option("enforceSchema", "false")
  .option("header", "true")
  .csv("csvFiles/students.csv")

dfEnforceSchema.show()
dfEnforceSchema.printSchema()

Output

+----+--------+-----+
|Roll|    Name|Marks|
+----+--------+-----+
|   1|    Ajay|   55|
|   2|Bharghav|   63|
|   3| Chaitra|   60|
|   4|   Kamal|   75|
|   5|  Sohaib|   70|
+----+--------+-----+

root
 |-- Roll: string (nullable = true)
 |-- Name: string (nullable = true)
 |-- Marks: string (nullable = true)

If enforceSchema() is set to true, spark will enforce the defined schema onto the csv file.

val enfSchema = StructType(Seq(
  StructField("Roll", IntegerType, true),
  StructField("Name", StringType, true),
  StructField("Marks", FloatType, true)
    ))

val df2 = spark.read
  .option("enforceSchema", "true")
  .schema(enfSchema)
  .option("header", "true")
  .csv("csvFiles/students.csv")

df2.show()
df2.printSchema()

Output

+----+--------+-----+
|Roll|    Name|Marks|
+----+--------+-----+
|   1|    Ajay| 55.0|
|   2|Bharghav| 63.0|
|   3| Chaitra| 60.0|
|   4|   Kamal| 75.0|
|   5|  Sohaib| 70.0|
+----+--------+-----+

root
 |-- Roll: integer (nullable = true)
 |-- Name: string (nullable = true)
 |-- Marks: float (nullable = true)

Summary

In this article, you learned how to:

  • Use inferSchema to let Spark automatically detect column data types.

  • Define a custom schema using StructType and applied it with .schema().

  • Use enforceSchema to control whether Spark strictly applies the provided schema.

When enforceSchema = false, all columns are read as StringType.

When enforceSchema = true, Spark enforces the specified data types on the CSV.

References