Sorting Records with Null Values
Last updated on: 2025-05-30
In the previous article, we covered how sorting works in Spark and how it helps in efficiently reading and organizing data. But what happens when your dataset contains null values? How does Spark treat these during sorting operations? This is a common scenario in big data processing, where incomplete or missing data is often inevitable.
In this article, we’ll explore how Spark handles sorting with null values and the specific functions it offers to manage them effectively. Before proceeding, you may want to revisit our Sort Functions article for a quick refresher.
Let’s consider a sample studentData DataFrame containing some null values:
+----+--------+-----+
|Roll| Name|Marks|
+----+--------+-----+
| 1| Ajay| 85|
| 2|Bharghav| 76|
| 3| Chaitra| null|
| 4| null| 90|
| 5| Sohaib| null|
+----+--------+-----+
Spark provides specific methods to sort data with null
values, allowing you to decide whether nulls should appear at the beginning or the end of the sorted results.
These functions are
1. asc_nulls_first
Sorts the specified column in ascending order and places null
values at the beginning (hence the name 'first').
val sortNullValues=studentData.orderBy(col("Name").asc_nulls_first)
sortNullValues.show()
Output
+----+--------+-----+
|Roll| Name|Marks|
+----+--------+-----+
| 4| null| 90|
| 1| Ajay| 85|
| 2|Bharghav| 76|
| 3| Chaitra| null|
| 5| Sohaib| null|
+----+--------+-----+
Since the sorting is applied only to the student names, null values in other columns are ignored.
Note: By default, Spark arranges the values in ascending order.
2. asc_nulls_last
Sorts the specified column in ascending order, placing null
values at the end.
val sortNullValues=studentData.orderBy(col("Name").asc_nulls_last)
sortNullValues.show()
Output
+----+--------+-----+
|Roll| Name|Marks|
+----+--------+-----+
| 1| Ajay| 85|
| 2|Bharghav| 76|
| 3| Chaitra| null|
| 5| Sohaib| null|
| 4| null| 90|
+----+--------+-----+
3. desc_nulls_first
Sorts the specified column in descending order and places null
values at the beginning.
val desNullValues=studentData.orderBy(col("Marks").desc_nulls_first)
desNullValues.show()
Output
+----+--------+-----+
|Roll| Name|Marks|
+----+--------+-----+
| 5| Sohaib| null|
| 3| Chaitra| null|
| 4| null| 90|
| 1| Ajay| 85|
| 2|Bharghav| 76|
+----+--------+-----+
Here the sorting is done in descending order based on marks and records with null
value in Marks column are placed on top.
The null value in the column 'Name' is treated as a string.
4. desc_nulls_last
Sorts the specified column in descending order and places null
values at the end.
val desNullValues=studentData.orderBy(col("Marks").desc_nulls_last)
desNullValues.show()
Output
+----+--------+-----+
|Roll| Name|Marks|
+----+--------+-----+
| 4| null| 90|
| 1| Ajay| 85|
| 2|Bharghav| 76|
| 5| Sohaib| null|
| 3| Chaitra| null|
+----+--------+-----+
Summary
In this article, we learned how Spark handles sorting with null
values and explored four dedicated functions:
-
asc_nulls_first: Ascending order with nulls at the top
-
asc_nulls_last: Ascending order with nulls at the bottom
-
desc_nulls_first: Descending order with nulls at the top
-
desc_nulls_last: Descending order with nulls at the bottom
These functions provide fine-grained control over how missing values are treated in sort operations, which is essential when working with real-world datasets containing incomplete information.