Ranking Window Functions
Last updated on: 2025-05-30
Window functions in Apache Spark are powerful tools for performing complex analytical operations across rows in a DataFrame. They allow calculations that depend on the values of surrounding rows, making them ideal for tasks such as ranking, running totals, and cumulative statistics. One of their core strengths lies in efficiently processing large-scale datasets.
o leverage these functions, you must first import the org.apache.spark.sql.expressions.Window
class into your Spark session.
In this article, we’ll focus on ranking functions—a group of window functions commonly used to assign order or rank to rows based on specified criteria.
We will cover the following ranking functions:
- rank()
- dense_rank()
- percent_rank()
- row_number()
Consider the dataframe
+---+--------+-----------+----------+-------------------+-----+
| ID| Name|Room Number| DOB| Submit Time|Marks|
+---+--------+-----------+----------+-------------------+-----+
| 1| Ajay| 10|2010-01-01|2025-02-17 12:30:45|92.75|
| 2|Bharghav| 20|2009-06-04|2025-02-17 08:15:30| 88.5|
| 3| Chaitra| 30|2010-12-12|2025-02-17 14:45:10| 75.8|
| 4| Kamal| 20|2010-08-25|2025-02-17 17:10:05| 82.3|
| 5| Sohaib| 30|2009-04-14|2025-02-17 09:55:20| 90.6|
| 6| Tanish| 20|2009-05-11|2025-02-17 09:45:30| 88.5|
| 7| Uday| 20|2009-09-06|2025-02-17 09:45:30| 92.3|
+---+--------+-----------+----------+-------------------+-----+
Ranking Records with rank()
The rank() function assigns a ranking number to each row within a partition. In the case of ties (i.e., equal values), it assigns the same rank and skips the subsequent number(s).
val rankRow = Window.partitionBy(col("Room Number"))
.orderBy(col("Marks"))
val result = df.withColumn("rank", rank().over(rankRow))
result.show()
Output
+---+--------+-----------+----------+-------------------+-----+----+
| ID| Name|Room Number| DOB| Submit Time|Marks|rank|
+---+--------+-----------+----------+-------------------+-----+----+
| 4| Kamal| 20|2010-08-25|2025-02-17 17:10:05| 82.3| 1|
| 2|Bharghav| 20|2009-06-04|2025-02-17 08:15:30| 88.5| 2|
| 6| Tanish| 20|2009-05-11|2025-02-17 09:45:30| 88.5| 2|
| 7| Uday| 20|2009-09-06|2025-02-17 09:45:30| 92.3| 4|
| 1| Ajay| 10|2010-01-01|2025-02-17 12:30:45|92.75| 1|
| 3| Chaitra| 30|2010-12-12|2025-02-17 14:45:10| 75.8| 1|
| 5| Sohaib| 30|2009-04-14|2025-02-17 09:55:20| 90.6| 2|
+---+--------+-----------+----------+-------------------+-----+----+
Each partition (based on Room Number) is ranked independently, and ties result in skipped ranks.
WUsing dense_rank() for Consecutive Ranking
The dense_rank() function works similarly to rank() but does not skip numbers after ties. It ensures consecutive ranks even when multiple rows share the same value.
val denseResult = df.withColumn("rank",dense_rank().over(rankRow))
denseResult.show()
Output
+---+--------+-----------+----------+-------------------+-----+----+
| ID| Name|Room Number| DOB| Submit Time|Marks|rank|
+---+--------+-----------+----------+-------------------+-----+----+
| 4| Kamal| 20|2010-08-25|2025-02-17 17:10:05| 82.3| 1|
| 2|Bharghav| 20|2009-06-04|2025-02-17 08:15:30| 88.5| 2|
| 6| Tanish| 20|2009-05-11|2025-02-17 09:45:30| 88.5| 2|
| 7| Uday| 20|2009-09-06|2025-02-17 09:45:30| 92.3| 3|
| 1| Ajay| 10|2010-01-01|2025-02-17 12:30:45|92.75| 1|
| 3| Chaitra| 30|2010-12-12|2025-02-17 14:45:10| 75.8| 1|
| 5| Sohaib| 30|2009-04-14|2025-02-17 09:55:20| 90.6| 2|
+---+--------+-----------+----------+-------------------+-----+----+
Note how there are no gaps in the ranking values compared to the rank() output.
Calculating Relative Rank with percent_rank()
The percent_rank() function returns the relative rank of a row as a decimal between 0 and 1. It is computed using the formula: (rank - 1) / (number of rows in partition - 1)
val percentRank = df.withColumn("Percent Rank",percent_rank().over(rankRow))
percentRank.show()
Output
+---+--------+-----------+----------+-------------------+-----+------------------+
| ID| Name|Room Number| DOB| Submit Time|Marks| Percent Rank|
+---+--------+-----------+----------+-------------------+-----+------------------+
| 4| Kamal| 20|2010-08-25|2025-02-17 17:10:05| 82.3| 0.0|
| 2|Bharghav| 20|2009-06-04|2025-02-17 08:15:30| 88.5|0.3333333333333333|
| 6| Tanish| 20|2009-05-11|2025-02-17 09:45:30| 88.5|0.3333333333333333|
| 7| Uday| 20|2009-09-06|2025-02-17 09:45:30| 92.3| 1.0|
| 1| Ajay| 10|2010-01-01|2025-02-17 12:30:45|92.75| 0.0|
| 3| Chaitra| 30|2010-12-12|2025-02-17 14:45:10| 75.8| 0.0|
| 5| Sohaib| 30|2009-04-14|2025-02-17 09:55:20| 90.6| 1.0|
+---+--------+-----------+----------+-------------------+-----+------------------+
This function is especially useful for statistical analysis and distribution-based logic.
Assigning Row Numbers with row_number()
The row_number() function assigns a unique, sequential number to each row within a partition, regardless of duplicate values.
val rowNumber = df.withColumn("Row Number", row_number().over(rankRow))
rowNumber.show()
Output
+---+--------+-----------+----------+-------------------+-----+----------+
| ID| Name|Room Number| DOB| Submit Time|Marks|Row Number|
+---+--------+-----------+----------+-------------------+-----+----------+
| 4| Kamal| 20|2010-08-25|2025-02-17 17:10:05| 82.3| 1|
| 2|Bharghav| 20|2009-06-04|2025-02-17 08:15:30| 88.5| 2|
| 6| Tanish| 20|2009-05-11|2025-02-17 09:45:30| 88.5| 3|
| 7| Uday| 20|2009-09-06|2025-02-17 09:45:30| 92.3| 4|
| 1| Ajay| 10|2010-01-01|2025-02-17 12:30:45|92.75| 1|
| 3| Chaitra| 30|2010-12-12|2025-02-17 14:45:10| 75.8| 1|
| 5| Sohaib| 30|2009-04-14|2025-02-17 09:55:20| 90.6| 2|
+---+--------+-----------+----------+-------------------+-----+----------+
This function is helpful when you need deterministic ordering for tasks like selecting the top N records per group.
Summary
In this article, you learned:
-
The concept of window functions and their significance in large-scale data processing.
-
Practical usage of four ranking-related functions in Spark:
-
rank() – assigns rank, skipping numbers on ties.
-
dense_rank() – assigns rank without skipping.
-
percent_rank() – computes relative position within a group.
-
row_number() – gives each row a unique sequential identifier.
These tools allow deeper insight and flexibility when working with structured data in Spark.