Data Aggregations with Grouping Operations
Last updated on: 2025-05-30
Apache Spark offers powerful capabilities for data aggregation, simplifying complex data analysis and extraction. Two key features that enable advanced aggregation are the groupBy
and grouping sets
functionalities.
In this article, we’ll explore these features using practical examples to demonstrate how they work and when to use each.
groupBy()
operator
The groupBy()
operator in Spark is used to group records based on one or more columns and apply aggregation functions on the grouped data.
If you’d like a refresher on various aggregation functions, check out our article on Aggregate Functions.
Example DataFrame Let’s start with the following sample DataFrame:
+--------+-----------+----------+-----+
|Item no.| Item Name| Category| MRP|
+--------+-----------+----------+-----+
| 1|Paper Clips|Stationery| 23|
| 2| Butter| Dairy| 57|
| 3| Jeans| Clothes| 799|
| 4| Shirt| Clothes| 570|
| 5|Butter Milk| Dairy| 50|
| 6| Bag| Apparel| 455|
| 7| Shoes| Apparel| 901|
| 8| Stapler|Stationery| 50|
| 9| Pens|Stationery| 120|
+--------+-----------+----------+-----+
Aggregating by Category
To calculate the total MRP for each category:
val result = df.groupBy("Category")
.agg(sum("MRP")
.alias("Category wise total"))
result.show()
Output
+----------+-------------------+
| Category|Category wise total|
+----------+-------------------+
| Apparel| 1356|
|Stationery| 193|
| Dairy| 107|
| Clothes| 1369|
+----------+-------------------+
Counting Items per Category
You can also count the number of items in each category:
val categoryCount = df.groupBy("Category")
.agg(count("Category")
.alias("Total number of items"))
categoryCount.show(truncate = false)
Output
+----------+----------------------+
|Category |Total number of items |
+----------+----------------------+
|Apparel |2 |
|Stationery|3 |
|Dairy |2 |
|Clothes |2 |
+----------+----------------------+
You can similarly apply other aggregation functions like avg()
, min()
, and max()
with groupBy()
Grouping Sets: rollup() and cube()
In SQL, GROUPING SETS
allow performing multiple levels of aggregations in a single query. Spark provides similar capabilities through the rollup()
and cube()
operators.
Grouping sets are operators that allow users to implement multiple aggregations in one query.
Let’s look at a more detailed DataFrame for demonstrating these operations:
+--------+-----------+----------+---+------------------+-----------------+
|Item no.| Item Name| Category|MRP| Discounted Price| Price After Tax|
+--------+-----------+----------+---+------------------+-----------------+
| 1|Paper Clips|Stationery| 23| 20.7| 24.84|
| 2| Butter| Dairy| 57|51.300000000000004| 61.56|
| 3| Jeans| Clothes|799| 719.1| 862.92|
| 4| Shirt| Clothes|570| 513.0| 615.6|
| 5|Butter Milk| Dairy| 50| 45.0| 54.0|
| 6| Bag| Apparel|455| 409.5| 491.4|
| 7| Shoes| Apparel|901| 810.9|973.0799999999999|
| 8| Stapler|Stationery| 50| 45.0| 54.0|
| 9| Pens|Stationery|120| 108.0| 129.6|
+--------+-----------+----------+---+------------------+-----------------+
rollup()
operator
The rollup() function performs hierarchical aggregation. It calculates subtotals for each level of the hierarchy—ideal for time-series or categorical analysis.
Example: Aggregating with rollup()
val rollupDf = priceAfterTax.rollup("Category", "Item Name")
.agg(sum("MRP")
.as("Total MRP"),
sum("Discounted Price")
.as("Total Discounted Price"),
sum("Price After Tax")
.as("Total Price After Tax")
).orderBy("Category","Item Name")
rollupDf.show()
Output
+----------+-----------+---------+----------------------+---------------------+
| Category| Item Name|Total MRP|Total Discounted Price|Total Price After Tax|
+----------+-----------+---------+----------------------+---------------------+
| null| null| 3025| 2722.5| 3267.0|
| Apparel| null| 1356| 1220.4| 1464.48|
| Apparel| Bag| 455| 409.5| 491.4|
| Apparel| Shoes| 901| 810.9| 973.0799999999999|
| Clothes| null| 1369| 1232.1| 1478.52|
| Clothes| Jeans| 799| 719.1| 862.92|
| Clothes| Shirt| 570| 513.0| 615.6|
| Dairy| null| 107| 96.30000000000001| 115.56|
| Dairy| Butter| 57| 51.300000000000004| 61.56|
| Dairy|Butter Milk| 50| 45.0| 54.0|
|Stationery| null| 193| 173.7| 208.44|
|Stationery|Paper Clips| 23| 20.7| 24.84|
|Stationery| Pens| 120| 108.0| 129.6|
|Stationery| Stapler| 50| 45.0| 54.0|
+----------+-----------+---------+----------------------+---------------------+
In the output, you can see that the category wise total and the total of all the items has been displayed in the same table.
cube()
operator
Unlike rollup()
, the cube()
function computes all possible combinations of the specified grouping columns. This is helpful for exploring the data from every aggregation perspective.
Example: Aggregating with cube()
val cubeDf = priceAfterTax.cube("Category", "Item Name")
.agg(sum("MRP")
.as("Total MRP"),
sum("Discounted Price")
.as("Total Discounted Price"),
sum("Price After Tax")
.as("Total Price After Tax")
).orderBy("Category","Item Name")
cubeDf.show()
Output
+----------+-----------+---------+----------------------+---------------------+
| Category| Item Name|Total MRP|Total Discounted Price|Total Price After Tax|
+----------+-----------+---------+----------------------+---------------------+
| null| null| 3025| 2722.5| 3267.0|
| null| Bag| 455| 409.5| 491.4|
| null| Butter| 57| 51.300000000000004| 61.56|
| null|Butter Milk| 50| 45.0| 54.0|
| null| Jeans| 799| 719.1| 862.92|
| null|Paper Clips| 23| 20.7| 24.84|
| null| Pens| 120| 108.0| 129.6|
| null| Shirt| 570| 513.0| 615.6|
| null| Shoes| 901| 810.9| 973.0799999999999|
| null| Stapler| 50| 45.0| 54.0|
| Apparel| null| 1356| 1220.4| 1464.48|
| Apparel| Bag| 455| 409.5| 491.4|
| Apparel| Shoes| 901| 810.9| 973.0799999999999|
| Clothes| null| 1369| 1232.1| 1478.52|
| Clothes| Jeans| 799| 719.1| 862.92|
| Clothes| Shirt| 570| 513.0| 615.6|
| Dairy| null| 107| 96.30000000000001| 115.56|
| Dairy| Butter| 57| 51.300000000000004| 61.56|
| Dairy|Butter Milk| 50| 45.0| 54.0|
|Stationery| null| 193| 173.7| 208.44|
+----------+-----------+---------+----------------------+---------------------+
With cube()
, Spark computes aggregations:
-
For every unique item
-
For every category
-
For all combinations of category and item
-
And the overall total
Note:
While cube()
provides the most exhaustive aggregation, it is computationally intensive—especially for large datasets. It is advisable to use groupBy()
or rollup()
for large-scale data to optimize performance and readability.
Summary
In this article, we covered:
-
How to use the
groupBy()
operator with aggregation functions. -
The concept of grouping sets using Spark’s
rollup()
andcube()
operators. -
Practical use cases demonstrating subtotal and multidimensional aggregations.
-
Performance considerations for large datasets.
These aggregation tools in Spark make it easier to derive insights from complex datasets efficiently and effectively.