Data Aggregations with Grouping Operations

Last updated on: 2025-05-30

Apache Spark offers powerful capabilities for data aggregation, simplifying complex data analysis and extraction. Two key features that enable advanced aggregation are the groupBy and grouping sets functionalities.

In this article, we’ll explore these features using practical examples to demonstrate how they work and when to use each.

groupBy() operator

The groupBy() operator in Spark is used to group records based on one or more columns and apply aggregation functions on the grouped data.

If you’d like a refresher on various aggregation functions, check out our article on Aggregate Functions.

Example DataFrame Let’s start with the following sample DataFrame:

+--------+-----------+----------+-----+
|Item no.|  Item Name|  Category|  MRP|
+--------+-----------+----------+-----+
|       1|Paper Clips|Stationery|   23|
|       2|     Butter|     Dairy|   57|
|       3|      Jeans|   Clothes|  799|
|       4|      Shirt|   Clothes|  570|
|       5|Butter Milk|     Dairy|   50|
|       6|        Bag|   Apparel|  455|
|       7|      Shoes|   Apparel|  901|
|       8|    Stapler|Stationery|   50|
|       9|       Pens|Stationery|  120|
+--------+-----------+----------+-----+

Aggregating by Category

To calculate the total MRP for each category:

val result = df.groupBy("Category")
  .agg(sum("MRP")
    .alias("Category wise total"))
    
result.show()

Output

+----------+-------------------+
|  Category|Category wise total|
+----------+-------------------+
|   Apparel|               1356|
|Stationery|                193|
|     Dairy|                107|
|   Clothes|               1369|
+----------+-------------------+

Counting Items per Category

You can also count the number of items in each category:

val categoryCount = df.groupBy("Category")
  .agg(count("Category")
    .alias("Total number of items"))

categoryCount.show(truncate = false)

Output

+----------+----------------------+
|Category  |Total number of items |
+----------+----------------------+
|Apparel   |2                     |
|Stationery|3                     |
|Dairy     |2                     |
|Clothes   |2                     |
+----------+----------------------+

You can similarly apply other aggregation functions like avg(), min(), and max() with groupBy()

Grouping Sets: rollup() and cube()

In SQL, GROUPING SETS allow performing multiple levels of aggregations in a single query. Spark provides similar capabilities through the rollup() and cube() operators. Grouping sets are operators that allow users to implement multiple aggregations in one query.

Let’s look at a more detailed DataFrame for demonstrating these operations:

+--------+-----------+----------+---+------------------+-----------------+
|Item no.|  Item Name|  Category|MRP|  Discounted Price|  Price After Tax|
+--------+-----------+----------+---+------------------+-----------------+
|       1|Paper Clips|Stationery| 23|              20.7|            24.84|
|       2|     Butter|     Dairy| 57|51.300000000000004|            61.56|
|       3|      Jeans|   Clothes|799|             719.1|           862.92|
|       4|      Shirt|   Clothes|570|             513.0|            615.6|
|       5|Butter Milk|     Dairy| 50|              45.0|             54.0|
|       6|        Bag|   Apparel|455|             409.5|            491.4|
|       7|      Shoes|   Apparel|901|             810.9|973.0799999999999|
|       8|    Stapler|Stationery| 50|              45.0|             54.0|
|       9|       Pens|Stationery|120|             108.0|            129.6|
+--------+-----------+----------+---+------------------+-----------------+

rollup() operator

The rollup() function performs hierarchical aggregation. It calculates subtotals for each level of the hierarchy—ideal for time-series or categorical analysis.

Example: Aggregating with rollup()

val rollupDf = priceAfterTax.rollup("Category", "Item Name")
  .agg(sum("MRP")
    .as("Total MRP"),
    sum("Discounted Price")
      .as("Total Discounted Price"),
    sum("Price After Tax")
      .as("Total Price After Tax")
  ).orderBy("Category","Item Name")

rollupDf.show()

Output

+----------+-----------+---------+----------------------+---------------------+
|  Category|  Item Name|Total MRP|Total Discounted Price|Total Price After Tax|
+----------+-----------+---------+----------------------+---------------------+
|      null|       null|     3025|                2722.5|               3267.0|
|   Apparel|       null|     1356|                1220.4|              1464.48|
|   Apparel|        Bag|      455|                 409.5|                491.4|
|   Apparel|      Shoes|      901|                 810.9|    973.0799999999999|
|   Clothes|       null|     1369|                1232.1|              1478.52|
|   Clothes|      Jeans|      799|                 719.1|               862.92|
|   Clothes|      Shirt|      570|                 513.0|                615.6|
|     Dairy|       null|      107|     96.30000000000001|               115.56|
|     Dairy|     Butter|       57|    51.300000000000004|                61.56|
|     Dairy|Butter Milk|       50|                  45.0|                 54.0|
|Stationery|       null|      193|                 173.7|               208.44|
|Stationery|Paper Clips|       23|                  20.7|                24.84|
|Stationery|       Pens|      120|                 108.0|                129.6|
|Stationery|    Stapler|       50|                  45.0|                 54.0|
+----------+-----------+---------+----------------------+---------------------+

In the output, you can see that the category wise total and the total of all the items has been displayed in the same table.

cube() operator

Unlike rollup(), the cube() function computes all possible combinations of the specified grouping columns. This is helpful for exploring the data from every aggregation perspective. Example: Aggregating with cube()

val cubeDf = priceAfterTax.cube("Category", "Item Name")
  .agg(sum("MRP")
    .as("Total MRP"),
    sum("Discounted Price")
      .as("Total Discounted Price"),
    sum("Price After Tax")
      .as("Total Price After Tax")
  ).orderBy("Category","Item Name")

cubeDf.show()

Output

+----------+-----------+---------+----------------------+---------------------+
|  Category|  Item Name|Total MRP|Total Discounted Price|Total Price After Tax|
+----------+-----------+---------+----------------------+---------------------+
|      null|       null|     3025|                2722.5|               3267.0|
|      null|        Bag|      455|                 409.5|                491.4|
|      null|     Butter|       57|    51.300000000000004|                61.56|
|      null|Butter Milk|       50|                  45.0|                 54.0|
|      null|      Jeans|      799|                 719.1|               862.92|
|      null|Paper Clips|       23|                  20.7|                24.84|
|      null|       Pens|      120|                 108.0|                129.6|
|      null|      Shirt|      570|                 513.0|                615.6|
|      null|      Shoes|      901|                 810.9|    973.0799999999999|
|      null|    Stapler|       50|                  45.0|                 54.0|
|   Apparel|       null|     1356|                1220.4|              1464.48|
|   Apparel|        Bag|      455|                 409.5|                491.4|
|   Apparel|      Shoes|      901|                 810.9|    973.0799999999999|
|   Clothes|       null|     1369|                1232.1|              1478.52|
|   Clothes|      Jeans|      799|                 719.1|               862.92|
|   Clothes|      Shirt|      570|                 513.0|                615.6|
|     Dairy|       null|      107|     96.30000000000001|               115.56|
|     Dairy|     Butter|       57|    51.300000000000004|                61.56|
|     Dairy|Butter Milk|       50|                  45.0|                 54.0|
|Stationery|       null|      193|                 173.7|               208.44|
+----------+-----------+---------+----------------------+---------------------+

With cube(), Spark computes aggregations:

  • For every unique item

  • For every category

  • For all combinations of category and item

  • And the overall total

Note: While cube() provides the most exhaustive aggregation, it is computationally intensive—especially for large datasets. It is advisable to use groupBy() or rollup() for large-scale data to optimize performance and readability.

Summary

In this article, we covered:

  • How to use the groupBy() operator with aggregation functions.

  • The concept of grouping sets using Spark’s rollup() and cube() operators.

  • Practical use cases demonstrating subtotal and multidimensional aggregations.

  • Performance considerations for large datasets.

These aggregation tools in Spark make it easier to derive insights from complex datasets efficiently and effectively.

References