Aggregations

Last updated on: 2025-05-30

In the previous articles, we explored various mathematical functions in Spark, from the basics to advanced use cases, and how they're applied to real-time data. Now, we take our discussion to the next level by introducing aggregate functions and their practical utility in data analysis.

What are Aggregate Functions?

Aggregate functions help summarize large datasets by reducing data to a single value or set of summary values, making analysis more efficient and meaningful.

Common Spark Aggregate Functions

Here are some frequently used aggregate functions in Spark:

  • sum() - Computes the sum of all values in a column
  • avg() - Computes the average of values in a column
  • min() - Returns the minimum value of a column
  • max() - Returns the maximum value of a column
  • count() - Returns the number of non-null entries in a column
  • variance() - Calculates the variance of values in a column

We'll illustrate each of these using the sample DataFrame below:

+----+--------+-----------+
|Roll|    Name|Final Marks|
+----+--------+-----------+
|   1|    Ajay|        300|
|   2|Bharghav|        350|
|   3| Chaitra|        320|
|   4|   Kamal|        360|
|   5|  Sohaib|        450|
|   6|    Neha|        390|
|   7|  Ramesh|        410|
|   8|   Sneha|        280|
|   9|   Varun|        330|
|  10|   Pooja|        370|
+----+--------+-----------+

Sum of Column Values

To sum the column values, use spark's sum() method inside the .agg() method

df.agg(sum("Final Marks").alias("Total Score")).show()

Output

+-----------+
|Total Score|
+-----------+
|       3560|
+-----------+

Average of Column Values

Use the avg() function inside the .agg() method to compute the mean.

df.agg(avg("Final Marks").alias("Average Score")).show()

Output

+-------------+
|Average Score|
+-------------+
|        356.0|
+-------------+

Minimum Value in a Column

We can use spark's min() method inside the .agg() method to find the minimum value in a given column. ex. To find the lowest mark -

df.agg(min("Final Marks").alias("Minimum Score")).show()

Output

+-------------+
|Minimum Score|
+-------------+
|          280|
+-------------+

Maximum Value in a Column

Use spark's max() method inside the .agg() method to find the maximum value in a given column ex. To find the highest mark -

df.agg(max("Final Marks").alias("Maximum Score")).show()

Output

+-------------+
|Maximum Score|
+-------------+
|          450|
+-------------+

Count of Records

count() function will return the total number of records in a given column in a dataframe.

df.agg(count("Name").alias("Total Students")).show()

Output

+--------------+
|Total Students|
+--------------+
|            10|
+--------------+

Variance of Values

Use variance() function to measure the spread of marks

df.agg(variance(col("Final Marks")).alias("Variance of Marks")).show()

Output

+------------------+
| Variance of Marks|
+------------------+
|2671.1111111111118|
+------------------+

Correlation Between Two Columns

Let's consider an updated DataFrame with an additional Unit Test column:

+----+--------+-----------+---------+
|Roll|    Name|Final Marks|Unit Test|
+----+--------+-----------+---------+
|   1|    Ajay|        300|       49|
|   2|Bharghav|        350|       65|
|   3| Chaitra|        320|       48|
|   4|   Kamal|        360|       70|
|   5|  Sohaib|        450|       28|
|   6|    Neha|        390|       49|
|   7|  Ramesh|        410|       77|
|   8|   Sneha|        280|       38|
|   9|   Varun|        330|       47|
|  10|   Pooja|        370|       99|
+----+--------+-----------+---------+

Use corr() to calculate the correlation between Final Marks and Unit Test scores:

updatedDf.agg(corr(col("Final Marks"),col("Unit Test")).alias("Correlation")).show()

Output

+-----------+
|Correlation|
+-----------+
|-0.24483685|
+-----------+

Interpretation: Correlation is the statistical measure of relationship (linear) between variables. Its value ranges from -1 to +1, where +1 indicates that there is a perfect positive correlation, 0 indicates no correlation and -1 implies a perfect negative correlation.

  • +1: Perfect positive correlation - The variables tend to move in the same direction (i.e., when one variable increases, the other variable also increases).
  • 0: No correlation. The variables do not have a relationship with each other.
  • -1: Perfect negative correlation. The variables tend to move in opposite directions (i.e., when one variable increases, the other variable decreases).

Summary

In this article, we learned:

  • What aggregate functions are and their importance in summarizing datasets.

  • Key aggregate functions like sum(), avg(), min(), max(), count(), variance(), and corr().

  • How to apply these functions using Spark's .agg() method.

These functions are foundational tools for any data analyst or engineer working with Spark to derive insights from large datasets quickly and effectively.

References