Struct Data type
Last updated on: 2025-05-30
StructType
is a complex data type in apache spark. It is used to store groups of multiple columns as a single entity. Simply put, structs are dataframes within dataframes.
Typically, StructType is a collection of StructField elements, each of which includes:
- A name
- A datatype
- A nullable flag - a flag indicating whether it allows null values
Creating a dataframe with StructType
val studentSchema = StructType(Array(
StructField("ID", IntegerType, nullable = false),
StructField("Name", StringType, nullable = false),
StructField("Marks", IntegerType, nullable = false),
StructField("Hobbies", StructType(Array(
StructField("hobby1", StringType, nullable = false),
StructField("hobby2", StringType, nullable = false)
)),nullable = true)
))
val sampleData = Seq(
Row(1, "Ajay", 55, Row("Singing", "Sudoku")),
Row(2, "Bhargav", 63, Row("Dancing", "Painting")),
Row(3, "Chaitra", 60, Row("Chess", "Long Walks")),
Row(4, "Kamal", 75, Row("Reading Books", "Cooking")),
Row(5, "Sohaib", 70, Row("Singing", "Cooking"))
)
val df = spark.createDataFrame(
spark.sparkContext.parallelize(sampleData),
studentSchema)
df.show(truncate = false)
Output
+---+-------+-----+------------------------+
|ID |Name |Marks|Hobbies |
+---+-------+-----+------------------------+
|1 |Ajay |55 |[Singing, Sudoku] |
|2 |Bhargav|63 |[Dancing, Painting] |
|3 |Chaitra|60 |[Chess, Long Walks] |
|4 |Kamal |75 |[Reading Books, Cooking]|
|5 |Sohaib |70 |[Singing, Cooking] |
+---+-------+-----+------------------------+
Access the nested fields of the dataframe
To access individual fields inside the Hobbies struct, you need to first understand the schema:
df.printSchema()
Output
root
|-- ID: integer (nullable = false)
|-- Name: string (nullable = false)
|-- Marks: integer (nullable = false)
|-- Hobbies: struct (nullable = true)
| |-- hobby1: string (nullable = false)
| |-- hobby2: string (nullable = false)
Accessing Fields Inside a Struct
Once you're familiar with the schema, you can extract individual nested fields like this:
val accessNestedFields = df.select(col("Name"),
col("Hobbies.hobby1"))
accessNestedFields.show()
Output
+-------+-------------+
| Name| hobby1|
+-------+-------------+
| Ajay| Singing|
|Bhargav| Dancing|
|Chaitra| Chess|
| Kamal|Reading Books|
| Sohaib| Singing|
+-------+-------------+
You can apply the same logic to extract other nested fields.
Renaming a nested column
val renameColumns = df.select(col("Name"),
col("Hobbies.hobby1")
.alias("hobby")
)
renameColumns.show()
+-------+-------------+
| Name| hobby|
+-------+-------------+
| Ajay| Singing|
|Bhargav| Dancing|
|Chaitra| Chess|
| Kamal|Reading Books|
| Sohaib| Singing|
+-------+-------------+
Flattening Struct columns
You can flatten nested struct fields using selectExpr()
.
val flattenStruct = df.selectExpr("Name",
"Hobbies.hobby1 as hobby1",
"Hobbies.hobby2 as hobby2")
flattenStruct.show()
Output
+-------+-------------+----------+
| Name| hobby1| hobby2|
+-------+-------------+----------+
| Ajay| Singing| Sudoku|
|Bhargav| Dancing| Painting|
|Chaitra| Chess|Long Walks|
| Kamal|Reading Books| Cooking|
| Sohaib| Singing| Cooking|
+-------+-------------+----------+
To read more about select()
and selectExpr()
, check out this article: Select vs SelectExpr.
Filtering dataframe rows based on nested field values
You can also apply filters to nested fields within a struct:
val filterColumns = df.filter(col("Hobbies.hobby1") === "Singing")
filterColumns.show(truncate = false)
Output
+---+------+-----+------------------+
| ID| Name|Marks| Hobbies|
+---+------+-----+------------------+
| 1| Ajay| 55| [Singing, Sudoku]|
| 5|Sohaib| 70|[Singing, Cooking]|
+---+------+-----+------------------+
Summary
-
The
StructType
in Spark enables you to combine multiple fields into one structured column, making it easier to manage nested data. -
Each nested field is defined using
StructField
, specifying its name, type, and nullability. -
You can access, rename, flatten, and filter nested fields using Spark SQL functions like
col()
,alias()
, andselectExpr()
.