Arrays and Map Types
Last updated on: 2025-05-30
What are Arrays?
Arrays are a collection of elements within a single row. They are useful for storing multiple values corresponding to a single entity.
To create a dataframe with Array Type variables, we use the Array()
function.
Here’s how it works:
val df = Seq(
(1, "Ajay", Array(55, 65, 72)),
(2, "Bharghav", Array(63, 79, 71)),
(3, "Chaitra", Array(60, 63, 75)),
(4, "Kamal", Array(73, 75, 69)),
(5, "Sohaib", Array(86, 74, 70))
).toDF("Roll", "Name", "Marks")
df.show()
Output
+----+--------+------------+
|Roll| Name| Marks|
+----+--------+------------+
| 1| Ajay|[55, 65, 72]|
| 2|Bharghav|[63, 79, 71]|
| 3| Chaitra|[60, 63, 75]|
| 4| Kamal|[73, 75, 69]|
| 5| Sohaib|[86, 74, 70]|
+----+--------+------------+
Note: Array indexing begin at 0, so for an array of size n, the index of the last element would be n-1.
Extracting Array Elements from a Column
To retrieve a specific element from an array column, use .withColumn() and specify the index:
val arrayElements = df.withColumn("Marks",
col("Marks"(1))
)
arrayElements.show()
Output
+----+--------+-----+
|Roll| Name|Marks|
+----+--------+-----+
| 1| Ajay| 65|
| 2|Bharghav| 79|
| 3| Chaitra| 63|
| 4| Kamal| 75|
| 5| Sohaib| 74|
+----+--------+-----+
Finding the Size of an Array in a dataframe
You can determine the number of elements in an array using the size()
function.
val arrSize = df.withColumn("Marks Length",
size(col("Marks"))
)
arrSize.show()
Output
+----+--------+------------+------------+
|Roll| Name| Marks|Marks Length|
+----+--------+------------+------------+
| 1| Ajay|[55, 65, 72]| 3|
| 2|Bharghav|[63, 79, 71]| 3|
| 3| Chaitra|[60, 63, 75]| 3|
| 4| Kamal|[73, 75, 69]| 3|
| 5| Sohaib|[86, 74, 70]| 3|
+----+--------+------------+------------+
Check if an array contains a specific value or not?
The array_contains()
function can be used to check for a specific element in an array:
// returns true if the array contains marks '63'
val hasMarks = df.withColumn("Has Marks",
array_contains(col("Marks", 63))
)
hasMarks.show()
Output
+----+--------+------------+---------+
|Roll| Name| Marks|Has Marks|
+----+--------+------------+---------+
| 1| Ajay|[55, 65, 72]| false|
| 2|Bharghav|[63, 79, 71]| true|
| 3| Chaitra|[60, 63, 75]| true|
| 4| Kamal|[73, 75, 69]| false|
| 5| Sohaib|[86, 74, 70]| false|
+----+--------+------------+---------+
Flattening Array Elements using explode()
Exploding the array elements means to o transform array elements into separate rows. To accomplish this, we use explode()
inside withColumn()
function.
val arrExplode = df.withColumn("Marks",
explode(col("Marks"))
)
arrExplode.show()
Output
+----+--------+-----+
|Roll| Name|Marks|
+----+--------+-----+
| 1| Ajay| 55|
| 1| Ajay| 65|
| 1| Ajay| 72|
| 2|Bharghav| 63|
| 2|Bharghav| 79|
| 2|Bharghav| 71|
| 3| Chaitra| 60|
| 3| Chaitra| 63|
| 3| Chaitra| 75|
| 4| Kamal| 73|
| 4| Kamal| 75|
| 4| Kamal| 69|
| 5| Sohaib| 86|
| 5| Sohaib| 74|
| 5| Sohaib| 70|
+----+--------+-----+
What are Maps in Spark?
Maps store key-value pairs, similar to dictionaries or hash tables in other languages.
Construct a dataframe with Map Type
To create a dataframe with Map type rows, use the Map()
function.
val mapDf = Seq(
(1, Map("name"->"Ajay", "Marks"->"55")),
(2, Map("name"->"Bharghav", "Marks"->"63")),
(3, Map("name"->"Chaitra", "Marks"->"60")),
(4, Map("name"->"Kamal", "Marks"->"75")),
(5, Map("name"->"Sohaib", "Marks"->"70"))
).toDF("Id", "Details")
mapDf.show(truncate = false)
Output
+---+-------------------------------+
|Id |Details |
+---+-------------------------------+
|1 |{name -> Ajay, Marks -> 55} |
|2 |{name -> Bharghav, Marks -> 63}|
|3 |{name -> Chaitra, Marks -> 60} |
|4 |{name -> Kamal, Marks -> 75} |
|5 |{name -> Sohaib, Marks -> 70} |
+---+-------------------------------+
Access Map values in a dataframe
val mapValues = mapDf.withColumn("Name",
col("Details"("name")))
.withColumn("Marks",
col("Details"("Marks")))
mapValues.show(truncate = false)
Output
+---+----------------------------------+--------+-----+
| Id| Details| Name|Marks|
+---+----------------------------------+--------+-----+
| 1|Map(name -> Ajay, Marks -> 55) | Ajay| 55|
| 2|Map(name -> Bharghav, Marks -> 63)|Bharghav| 63|
| 3|Map(name -> Chaitra, Marks -> 60) | Chaitra| 60|
| 4|Map(name -> Kamal, Marks -> 75) | Kamal| 75|
| 5|Map(name -> Sohaib, Marks -> 70) | Sohaib| 70|
+---+----------------------------------+--------+-----+
Retrieving Keys and Values from a Map
To list out the keys and values separately, use map_keys()
and map_values()
.
val keyValue = mapDf.withColumn("Keys",
map_keys(col("Details")))
.withColumn("Values",
map_values(col("Details")))
keyValue.show(truncate = false)
Output
+---+----------------------------------+-------------+-------------+
| Id| Details| Keys| Values|
+---+----------------------------------+-------------+-------------+
| 1|Map(name -> Ajay, Marks -> 55) |[name, Marks]| [Ajay, 55]|
| 2|Map(name -> Bharghav, Marks -> 63)|[name, Marks]|[Bharghav, 63|
| 3|Map(name -> Chaitra, Marks -> 60) |[name, Marks]|[Chaitra, 60]|
| 4|Map(name -> Kamal, Marks -> 75) |[name, Marks]| [Kamal, 75]|
| 5|Map(name -> Sohaib, Marks -> 70) |[name, Marks]| [Sohaib, 70]|
+---+----------------------------------+-------------+-------------+
Summary
In this article, we covered:
- Arrays in Spark: structure, access, length, condition checks, and flattening.
- Maps in Spark: creation, element access, and splitting into keys and values.
Arrays and Maps are essential data structures in Spark for handling complex data within DataFrames, especially in big data processing tasks.