Expressions
Last updated on: 2025-05-30
Understanding Expressions in Spark DataFrames
In Apache Spark, expressions refer to operations performed on DataFrame columns to derive new values. These expressions are integral to many transformations like select, filter, and agg.
In this article, we’ll explore how to use expressions effectively in Spark, especially within select, filter, and expr() methods.
Let’s start with a sample DataFrame for demonstration:
+----+--------+-----------+-----------+------------+
|Roll| Name|Final Marks|Float Marks|Double Marks|
+----+--------+-----------+-----------+------------+
| 1| Ajay| 300| 55.5| 92.75|
| 2|Bharghav| 350| 63.2| 88.5|
| 3| Chaitra| 320| 60.1| 75.8|
| 4| Kamal| 360| 75.0| 82.3|
| 5| Sohaib| 450| 70.8| 90.6|
+----+--------+-----------+-----------+------------+
Creating a New Column with Expressions
We can use the select() method to apply arithmetic expressions on columns and create new ones. For instance, here we add 5 to each value in the Double Marks
column:
val applyExpr = df.select(
col("Name"),
col("Double Marks"),
(col("Double Marks")+5).as("Updated Marks")
)
applyExpr.show()
Output
+--------+------------+-------------+
| Name|Double Marks|Updated Marks|
+--------+------------+-------------+
| Ajay| 92.75| 97.75|
|Bharghav| 88.5| 93.5|
| Chaitra| 75.8| 80.8|
| Kamal| 82.3| 87.3|
| Sohaib| 90.6| 95.6|
+--------+------------+-------------+
Applying String Operations with Expressions
String transformations are also possible using expressions. Below, we use the upper()
function to convert names to uppercase.
val applyString=df.select(col("Name"),
upper(col("Name")).as("Updated Name"))
applyString.show()
Output
+--------+------------+
| Name|Updated Name|
+--------+------------+
| Ajay| AJAY|
|Bharghav| BHARGHAV|
| Chaitra| CHAITRA|
| Kamal| KAMAL|
| Sohaib| SOHAIB|
+--------+------------+
Filtering Rows with Expressions
To filter rows using conditional expressions, embed the logic inside the filter()
method. Here's how to filter students with Final Marks > 330 and Float Marks >= 70
val filterExpr=df.filter(col("Final Marks")>330 &&
col("Float Marks")>=70.0
)
filterExpr.show()
Output
+----+------+-----------+-----------+------------+
|Roll| Name|Final Marks|Float Marks|Double Marks|
+----+------+-----------+-----------+------------+
| 4| Kamal| 360| 75.0| 82.3|
| 5|Sohaib| 450| 70.8| 90.6|
+----+------+-----------+-----------+------------+
Using expr() for Column Transformations
The expr()
function allows you to specify transformations as SQL-style strings.
Here, we multiply the Float Marks column by 1.5
val updateExpr=df.select(col("Name"),
col("Float Marks"),
expr("`Float Marks` * 1.5 as `Updated Marks`"))
updateExpr.show()
Output
+--------+-----------+------------------+
| Name|Float Marks| Updated Marks|
+--------+-----------+------------------+
| Ajay| 55.5| 83.25|
|Bharghav| 63.2| 94.80000114440918|
| Chaitra| 60.1| 90.14999771118164|
| Kamal| 75.0| 112.5|
| Sohaib| 70.8|106.20000457763672|
+--------+-----------+------------------+
To dive deeper into the difference between select() and selectExpr(), check out this article: Select vs SelectExpr
Summary
In this article, we explored how expressions in Spark enable column-level transformations and filtering. Key takeaways include:
-
Using
select()
with arithmetic or string expressions to create new columns. -
Filtering rows by embedding logical expressions in the
filter()
method. -
Applying SQL-style expressions using
expr()
for flexible transformations.