Struct Data type

Last updated on: 2025-05-30

StructType is a complex data type in apache spark. It is used to store groups of multiple columns as a single entity. Simply put, structs are dataframes within dataframes. Typically, StructType is a collection of StructField elements, each of which includes:

  • A name
  • A datatype
  • A nullable flag - a flag indicating whether it allows null values

Creating a dataframe with StructType

val studentSchema = StructType(Array(
      StructField("ID", IntegerType, nullable = false),
      StructField("Name", StringType, nullable = false),
      StructField("Marks", IntegerType, nullable = false),
      StructField("Hobbies", StructType(Array(
        StructField("hobby1", StringType, nullable = false),
        StructField("hobby2", StringType, nullable = false)
      )),nullable = true)
    ))

val sampleData = Seq(
  Row(1, "Ajay", 55, Row("Singing", "Sudoku")),
  Row(2, "Bhargav", 63, Row("Dancing", "Painting")),
  Row(3, "Chaitra", 60, Row("Chess", "Long Walks")),
  Row(4, "Kamal", 75, Row("Reading Books", "Cooking")),
  Row(5, "Sohaib", 70, Row("Singing", "Cooking"))
)

val df = spark.createDataFrame(
  spark.sparkContext.parallelize(sampleData), 
  studentSchema)

df.show(truncate = false)

Output

+---+-------+-----+------------------------+
|ID |Name   |Marks|Hobbies                 |
+---+-------+-----+------------------------+
|1  |Ajay   |55   |[Singing, Sudoku]       |
|2  |Bhargav|63   |[Dancing, Painting]     |
|3  |Chaitra|60   |[Chess, Long Walks]     |
|4  |Kamal  |75   |[Reading Books, Cooking]|
|5  |Sohaib |70   |[Singing, Cooking]      |
+---+-------+-----+------------------------+

Access the nested fields of the dataframe

To access individual fields inside the Hobbies struct, you need to first understand the schema:

df.printSchema()  

Output

root
 |-- ID: integer (nullable = false)
 |-- Name: string (nullable = false)
 |-- Marks: integer (nullable = false)
 |-- Hobbies: struct (nullable = true)
 |    |-- hobby1: string (nullable = false)
 |    |-- hobby2: string (nullable = false)

Accessing Fields Inside a Struct

Once you're familiar with the schema, you can extract individual nested fields like this:

val accessNestedFields = df.select(col("Name"),
  col("Hobbies.hobby1"))

accessNestedFields.show()

Output

+-------+-------------+
|   Name|       hobby1|
+-------+-------------+
|   Ajay|      Singing|
|Bhargav|      Dancing|
|Chaitra|        Chess|
|  Kamal|Reading Books|
| Sohaib|      Singing|
+-------+-------------+

You can apply the same logic to extract other nested fields.

Renaming a nested column

val renameColumns = df.select(col("Name"), 
  col("Hobbies.hobby1")
    .alias("hobby")
)
  
renameColumns.show()
+-------+-------------+
|   Name|        hobby|
+-------+-------------+
|   Ajay|      Singing|
|Bhargav|      Dancing|
|Chaitra|        Chess|
|  Kamal|Reading Books|
| Sohaib|      Singing|
+-------+-------------+

Flattening Struct columns

You can flatten nested struct fields using selectExpr().

val flattenStruct = df.selectExpr("Name",
  "Hobbies.hobby1 as hobby1",
  "Hobbies.hobby2 as hobby2")
  
flattenStruct.show()

Output

+-------+-------------+----------+
|   Name|       hobby1|    hobby2|
+-------+-------------+----------+
|   Ajay|      Singing|    Sudoku|
|Bhargav|      Dancing|  Painting|
|Chaitra|        Chess|Long Walks|
|  Kamal|Reading Books|   Cooking|
| Sohaib|      Singing|   Cooking|
+-------+-------------+----------+

To read more about select() and selectExpr(), check out this article: Select vs SelectExpr.

Filtering dataframe rows based on nested field values

You can also apply filters to nested fields within a struct:

val filterColumns = df.filter(col("Hobbies.hobby1") === "Singing")

filterColumns.show(truncate = false)

Output

+---+------+-----+------------------+
| ID|  Name|Marks|           Hobbies|
+---+------+-----+------------------+
|  1|  Ajay|   55| [Singing, Sudoku]|
|  5|Sohaib|   70|[Singing, Cooking]|
+---+------+-----+------------------+

Summary

  • The StructType in Spark enables you to combine multiple fields into one structured column, making it easier to manage nested data.

  • Each nested field is defined using StructField, specifying its name, type, and nullability.

  • You can access, rename, flatten, and filter nested fields using Spark SQL functions like col(), alias(), and selectExpr().

References