Advanced Joins
Last updated on: 2025-05-30
In the previous article, we explored the foundational joins in Spark that are crucial for combining datasets efficiently. These basic joins set the stage for more complex and specialized join operations. To refresh the basic joins, please refer to Essential Joins in Spark.
Now, let’s move on to 'Advanced Joins' in Spark. These types of joins are particularly useful for data filtering and for generating Cartesian products. Understanding them can significantly improve the performance and clarity of your Spark queries. Consider the below dataframes
Sample DataFrames
We’ll use the following two DataFrames for demonstration:
DataFrame-2
+----+----------------+---+
|Roll| Subject|Age|
+----+----------------+---+
| 1| Chemistry| 14|
| 2|Computer Science| 15|
| 5| BioChemistry| 16|
| 8| Statistics| 14|
| 7| Real Analysis| 15|
+----+----------------+---+
DataFrame-3
+----+------+-------------+
|Roll| Name| Subject|
+----+------+-------------+
| 4| Kamal| Biology|
| 5|Sohaib| BioChemistry|
| 7|Tanmay|Real Analysis|
| 8| Tanuj| Statistics|
+----+------+-------------+
1. Cross Join
A Cross Join, also known as a Cartesian Product, matches every row in the left DataFrame with every row in the right DataFrame.
val cJoin = df3.crossJoin(df2)
cJoin.show()
Output
+----+------+-------------+----+----------------+---+
|Roll| Name| Subject|Roll| Subject|Age|
+----+------+-------------+----+----------------+---+
| 4| Kamal| Biology| 1| Chemistry| 14|
| 4| Kamal| Biology| 2|Computer Science| 15|
| 4| Kamal| Biology| 5| BioChemistry| 16|
| 4| Kamal| Biology| 8| Statistics| 14|
| 4| Kamal| Biology| 7| Real Analysis| 15|
| 5|Sohaib| BioChemistry| 1| Chemistry| 14|
| 5|Sohaib| BioChemistry| 2|Computer Science| 15|
| 5|Sohaib| BioChemistry| 5| BioChemistry| 16|
| 5|Sohaib| BioChemistry| 8| Statistics| 14|
| 5|Sohaib| BioChemistry| 7| Real Analysis| 15|
------
This join is useful when all combinations of records from two DataFrames are needed, but use it with caution—it can produce very large outputs.
2. Left Semi Join
A Left Semi Join returns only the rows from the left DataFrame that have a matching value in the right DataFrame, based on the join key. Importantly, only the left DataFrame’s columns are included in the result.
This join is ideal when you simply want to check for the existence of matching records.
val leftSemi = df3.join(
df2, Seq("Subject"), "left_semi"
)
leftSemi.show()
Output
+-------------+----+------+
| Subject|Roll| Name|
+-------------+----+------+
| BioChemistry| 5|Sohaib|
|Real Analysis| 7|Tanmay|
| Statistics| 8| Tanuj|
+-------------+----+------+
3. Left Anti Join
The Left Anti Join does the opposite of a Left Semi Join. It returns only the rows from the left DataFrame that do not have a match in the right DataFrame.
This is useful for identifying records that are missing or unmatched between datasets.
val leftAnti = df3.join(
df2, Seq("Subject"), "left_anti"
)
leftAnti.show()
Output
+-------+----+-----+
|Subject|Roll| Name|
+-------+----+-----+
|Biology| 4|Kamal|
+-------+----+-----+
Summary
In this article, we explored advanced join types in Apache Spark:
-
Cross Join (Cartesian Product): Matches every row from one DataFrame with every row of another.
-
Left Semi Join: Returns rows from the left DataFrame that exist in the right DataFrame, without including right DataFrame columns.
-
Left Anti Join: Retrieves rows from the left DataFrame that have no match in the right DataFrame.
These joins are powerful tools for filtering and refining data in distributed processing, allowing you to build more efficient and expressive Spark queries.