Essential Joins
Last updated on: 2025-05-30
Joining is a core operation in data engineering that enables us to combine two datasets based on a common key. A join expression determines whether two rows should be joined, and the join type defines which rows should be included in the final result.
Spark provides a variety of join types for different use cases. In this article, we will cover the essential joins that form the foundation for working with DataFrames in Spark.
To refresh your memory on how to create DataFrames, refer to the Dataframes article.
Sample DataFrames
Let’s work with the following example DataFrames:
DataFrame-1
+----+--------+-----+
|Roll| Name|Marks|
+----+--------+-----+
| 1| Ajay| 55|
| 2|Bharghav| 63|
| 3| Chaitra| 60|
| 4| Kamal| 75|
| 5| Sohaib| 70|
+----+--------+-----+
DataFrame-2
+----+----------------+---+
|Roll| Subject|Age|
+----+----------------+---+
| 1| Chemistry| 14|
| 2|Computer Science| 15|
| 5| BioChemistry| 16|
| 8| Statistics| 14|
| 7| Real Analysis| 15|
+----+----------------+---+
Inner Join
Inner join returns rows that have common key in both dataframes. In the above 2 dataframes, Roll:[1,2,5] are common, so the details of those from both the tables will be joined.
val innerJoin = df1.join(
df2, Seq("Roll"), "inner"
)
innerJoin.show()
Output
+----+--------+-----+----------------+---+
|Roll| Name|Marks| Subject|Age|
+----+--------+-----+----------------+---+
| 1| Ajay| 55| Chemistry| 14|
| 2|Bharghav| 63|Computer Science| 15|
| 5| Sohaib| 70| BioChemistry| 16|
+----+--------+-----+----------------+---+
Full Outer Join
Returns all rows from both DataFrames. If there’s no match, missing values are filled with null
.
This is best applied when we want data from both dataframes to be in a single dataframe.
val outerJoin = df1.join(
df2, Seq("Roll"), "outer"
)
outerJoin.show()
Output
+----+--------+-----+----------------+----+
|Roll| Name|Marks| Subject| Age|
+----+--------+-----+----------------+----+
| 1| Ajay| 55| Chemistry| 14|
| 3| Chaitra| 60| null|null|
| 5| Sohaib| 70| BioChemistry| 16|
| 4| Kamal| 75| null|null|
| 8| null| null| Statistics| 14|
| 7| null| null| Real Analysis| 15|
| 2|Bharghav| 63|Computer Science| 15|
+----+--------+-----+----------------+----+
Left Outer Join
Returns all rows from the left DataFrame and matching rows from the right. The rows that do not match the key from the right side dataframe are omitted. Non-matches get null.
val leftOuterJoin = df1.join(
df2, Seq("Roll"), "left"
)
leftOuterJoin.show()
Output
+----+--------+-----+----------------+----+
|Roll| Name|Marks| Subject| Age|
+----+--------+-----+----------------+----+
| 1| Ajay| 55| Chemistry| 14|
| 2|Bharghav| 63|Computer Science| 15|
| 3| Chaitra| 60| null|null|
| 4| Kamal| 75| null|null|
| 5| Sohaib| 70| BioChemistry| 16|
+----+--------+-----+----------------+----+
Right Outer Join
Returns all rows from the right DataFrame and matching rows from the left. Non-matches get null
.
The rows that do not match the key from the left side dataframe are omitted.
val rightOuterJoin = df1.join(
df2, Seq("Roll"), "right"
)
rightOuterJoin.show()
Output
+----+--------+-----+----------------+---+
|Roll| Name|Marks| Subject|Age|
+----+--------+-----+----------------+---+
| 1| Ajay| 55| Chemistry| 14|
| 2|Bharghav| 63|Computer Science| 15|
| 5| Sohaib| 70| BioChemistry| 16|
| 8| null| null| Statistics| 14|
| 7| null| null| Real Analysis| 15|
+----+--------+-----+----------------+---+
Summary
In this article, we explored the Essential Joins in Spark:
-
Inner Join: Only matching records.
-
Full Outer Join: All records from both sides.
-
Left Outer Join: All records from the left + matching right.
-
Right Outer Join: All records from the right + matching left.
Understanding these foundational joins is crucial before diving into Advanced Spark Joins, where we explore filtering techniques and semi/anti joins.
Related Articles
- Dataframes
- Dataframe Select vs SelectExpr