Arrays and Map Types

Last updated on: 2025-05-30

What are Arrays?

Arrays are a collection of elements within a single row. They are useful for storing multiple values corresponding to a single entity.

To create a dataframe with Array Type variables, we use the Array() function.

Here’s how it works:

val df = Seq(
  (1, "Ajay", Array(55, 65, 72)),
  (2, "Bharghav", Array(63, 79, 71)),
  (3, "Chaitra", Array(60, 63, 75)),
  (4, "Kamal", Array(73, 75, 69)),
  (5, "Sohaib", Array(86, 74, 70))
).toDF("Roll", "Name", "Marks")

df.show()

Output

+----+--------+------------+
|Roll|    Name|       Marks|
+----+--------+------------+
|   1|    Ajay|[55, 65, 72]|
|   2|Bharghav|[63, 79, 71]|
|   3| Chaitra|[60, 63, 75]|
|   4|   Kamal|[73, 75, 69]|
|   5|  Sohaib|[86, 74, 70]|
+----+--------+------------+

Note: Array indexing begin at 0, so for an array of size n, the index of the last element would be n-1.

Extracting Array Elements from a Column

To retrieve a specific element from an array column, use .withColumn() and specify the index:

val arrayElements = df.withColumn("Marks",
  col("Marks"(1))
)

arrayElements.show()

Output

+----+--------+-----+
|Roll|    Name|Marks|
+----+--------+-----+
|   1|    Ajay|   65|
|   2|Bharghav|   79|
|   3| Chaitra|   63|
|   4|   Kamal|   75|
|   5|  Sohaib|   74|
+----+--------+-----+

Finding the Size of an Array in a dataframe

You can determine the number of elements in an array using the size() function.

val arrSize = df.withColumn("Marks Length",
  size(col("Marks"))
)

arrSize.show()

Output

+----+--------+------------+------------+
|Roll|    Name|       Marks|Marks Length|
+----+--------+------------+------------+
|   1|    Ajay|[55, 65, 72]|           3|
|   2|Bharghav|[63, 79, 71]|           3|
|   3| Chaitra|[60, 63, 75]|           3|
|   4|   Kamal|[73, 75, 69]|           3|
|   5|  Sohaib|[86, 74, 70]|           3|
+----+--------+------------+------------+

Check if an array contains a specific value or not?

The array_contains() function can be used to check for a specific element in an array:

// returns true if the array contains marks '63'
val hasMarks = df.withColumn("Has Marks", 
  array_contains(col("Marks", 63))
)

hasMarks.show()

Output

+----+--------+------------+---------+
|Roll|    Name|       Marks|Has Marks|
+----+--------+------------+---------+
|   1|    Ajay|[55, 65, 72]|    false|
|   2|Bharghav|[63, 79, 71]|     true|
|   3| Chaitra|[60, 63, 75]|     true|
|   4|   Kamal|[73, 75, 69]|    false|
|   5|  Sohaib|[86, 74, 70]|    false|
+----+--------+------------+---------+

Flattening Array Elements using explode()

Exploding the array elements means to o transform array elements into separate rows. To accomplish this, we use explode() inside withColumn() function.

val arrExplode = df.withColumn("Marks",
  explode(col("Marks"))
)

arrExplode.show()

Output

+----+--------+-----+
|Roll|    Name|Marks|
+----+--------+-----+
|   1|    Ajay|   55|
|   1|    Ajay|   65|
|   1|    Ajay|   72|
|   2|Bharghav|   63|
|   2|Bharghav|   79|
|   2|Bharghav|   71|
|   3| Chaitra|   60|
|   3| Chaitra|   63|
|   3| Chaitra|   75|
|   4|   Kamal|   73|
|   4|   Kamal|   75|
|   4|   Kamal|   69|
|   5|  Sohaib|   86|
|   5|  Sohaib|   74|
|   5|  Sohaib|   70|
+----+--------+-----+

What are Maps in Spark?

Maps store key-value pairs, similar to dictionaries or hash tables in other languages.

Construct a dataframe with Map Type

To create a dataframe with Map type rows, use the Map() function.

val mapDf = Seq(
  (1, Map("name"->"Ajay", "Marks"->"55")),
  (2, Map("name"->"Bharghav", "Marks"->"63")),
  (3, Map("name"->"Chaitra", "Marks"->"60")),
  (4, Map("name"->"Kamal", "Marks"->"75")),
  (5, Map("name"->"Sohaib", "Marks"->"70"))
).toDF("Id", "Details")

mapDf.show(truncate = false)

Output

+---+-------------------------------+
|Id |Details                        |
+---+-------------------------------+
|1  |{name -> Ajay, Marks -> 55}    |
|2  |{name -> Bharghav, Marks -> 63}|
|3  |{name -> Chaitra, Marks -> 60} |
|4  |{name -> Kamal, Marks -> 75}   |
|5  |{name -> Sohaib, Marks -> 70}  |
+---+-------------------------------+

Access Map values in a dataframe

val mapValues = mapDf.withColumn("Name", 
    col("Details"("name")))
  .withColumn("Marks", 
    col("Details"("Marks")))

mapValues.show(truncate = false)

Output

+---+----------------------------------+--------+-----+
| Id|                           Details|    Name|Marks| 
+---+----------------------------------+--------+-----+
|  1|Map(name -> Ajay, Marks -> 55)    |    Ajay|   55|
|  2|Map(name -> Bharghav, Marks -> 63)|Bharghav|   63|
|  3|Map(name -> Chaitra, Marks -> 60) | Chaitra|   60|
|  4|Map(name -> Kamal, Marks -> 75)   |   Kamal|   75| 
|  5|Map(name -> Sohaib, Marks -> 70)  |  Sohaib|   70|  
+---+----------------------------------+--------+-----+

Retrieving Keys and Values from a Map

To list out the keys and values separately, use map_keys() and map_values().

val keyValue = mapDf.withColumn("Keys", 
    map_keys(col("Details")))
  .withColumn("Values", 
    map_values(col("Details")))

keyValue.show(truncate = false)

Output

+---+----------------------------------+-------------+-------------+
| Id|                           Details|         Keys|       Values|
+---+----------------------------------+-------------+-------------+
|  1|Map(name -> Ajay, Marks -> 55)    |[name, Marks]|   [Ajay, 55]|
|  2|Map(name -> Bharghav, Marks -> 63)|[name, Marks]|[Bharghav, 63|
|  3|Map(name -> Chaitra, Marks -> 60) |[name, Marks]|[Chaitra, 60]|
|  4|Map(name -> Kamal, Marks -> 75)   |[name, Marks]|  [Kamal, 75]|
|  5|Map(name -> Sohaib, Marks -> 70)  |[name, Marks]| [Sohaib, 70]|
+---+----------------------------------+-------------+-------------+

Summary

In this article, we covered:

  • Arrays in Spark: structure, access, length, condition checks, and flattening.
  • Maps in Spark: creation, element access, and splitting into keys and values.

Arrays and Maps are essential data structures in Spark for handling complex data within DataFrames, especially in big data processing tasks.

References