Expressions

Last updated on: 2025-05-30

Understanding Expressions in Spark DataFrames

In Apache Spark, expressions refer to operations performed on DataFrame columns to derive new values. These expressions are integral to many transformations like select, filter, and agg.

In this article, we’ll explore how to use expressions effectively in Spark, especially within select, filter, and expr() methods.

Let’s start with a sample DataFrame for demonstration:

+----+--------+-----------+-----------+------------+
|Roll|    Name|Final Marks|Float Marks|Double Marks|
+----+--------+-----------+-----------+------------+
|   1|    Ajay|        300|       55.5|       92.75|
|   2|Bharghav|        350|       63.2|        88.5|
|   3| Chaitra|        320|       60.1|        75.8|
|   4|   Kamal|        360|       75.0|        82.3|
|   5|  Sohaib|        450|       70.8|        90.6|
+----+--------+-----------+-----------+------------+

Creating a New Column with Expressions

We can use the select() method to apply arithmetic expressions on columns and create new ones. For instance, here we add 5 to each value in the Double Marks column:

val applyExpr = df.select( 
  col("Name"),
  col("Double Marks"),
  (col("Double Marks")+5).as("Updated Marks")
)

applyExpr.show()

Output

+--------+------------+-------------+
|    Name|Double Marks|Updated Marks|
+--------+------------+-------------+
|    Ajay|       92.75|        97.75|
|Bharghav|        88.5|         93.5|
| Chaitra|        75.8|         80.8|
|   Kamal|        82.3|         87.3|
|  Sohaib|        90.6|         95.6|
+--------+------------+-------------+

Applying String Operations with Expressions

String transformations are also possible using expressions. Below, we use the upper() function to convert names to uppercase.

val applyString=df.select(col("Name"),
  upper(col("Name")).as("Updated Name")) 

applyString.show()

Output

+--------+------------+
|    Name|Updated Name|
+--------+------------+
|    Ajay|        AJAY|
|Bharghav|    BHARGHAV|
| Chaitra|     CHAITRA|
|   Kamal|       KAMAL|
|  Sohaib|      SOHAIB|
+--------+------------+

Filtering Rows with Expressions

To filter rows using conditional expressions, embed the logic inside the filter() method. Here's how to filter students with Final Marks > 330 and Float Marks >= 70

val filterExpr=df.filter(col("Final Marks")>330 && 
  col("Float Marks")>=70.0 
)

filterExpr.show()

Output

+----+------+-----------+-----------+------------+
|Roll|  Name|Final Marks|Float Marks|Double Marks|
+----+------+-----------+-----------+------------+
|   4| Kamal|        360|       75.0|        82.3|
|   5|Sohaib|        450|       70.8|        90.6|
+----+------+-----------+-----------+------------+

Using expr() for Column Transformations

The expr() function allows you to specify transformations as SQL-style strings.

Here, we multiply the Float Marks column by 1.5

val updateExpr=df.select(col("Name"), 
  col("Float Marks"),
  expr("`Float Marks` * 1.5 as `Updated Marks`"))

updateExpr.show()

Output

+--------+-----------+------------------+
|    Name|Float Marks|     Updated Marks|
+--------+-----------+------------------+
|    Ajay|       55.5|             83.25|
|Bharghav|       63.2| 94.80000114440918|
| Chaitra|       60.1| 90.14999771118164|
|   Kamal|       75.0|             112.5|
|  Sohaib|       70.8|106.20000457763672|
+--------+-----------+------------------+

To dive deeper into the difference between select() and selectExpr(), check out this article: Select vs SelectExpr

Summary

In this article, we explored how expressions in Spark enable column-level transformations and filtering. Key takeaways include:

  • Using select() with arithmetic or string expressions to create new columns.

  • Filtering rows by embedding logical expressions in the filter() method.

  • Applying SQL-style expressions using expr() for flexible transformations.

References