Select() v/s SelectExpr() in Spark

Last updated on: 2025-05-30

Both select() and selectExpr() are commonly used to retrieve specific columns from a Spark DataFrame. While they serve the same purpose, there are important differences in how each is written and used. In this article, we’ll explore these differences with practical examples.

Key Difference

The primary distinction lies in syntax:

  • select() requires the use of the col() function to refer to column names and expressions.

  • selectExpr(), on the other hand, allows SQL-style expressions as strings, making it more concise and easier to write in many scenarios.

Let’s consider the following DataFrame for demonstration:

+----+--------+-----------+-----------+------------+
|Roll|    Name|Final Marks|Float Marks|Double Marks|
+----+--------+-----------+-----------+------------+
|   1|    Ajay|        300|       55.5|       92.75|
|   2|Bharghav|        350|       63.2|        88.5|
|   3| Chaitra|        320|       60.1|        75.8|
|   4|   Kamal|        360|       75.0|        82.3|
|   5|  Sohaib|        450|       70.8|        90.6|
+----+--------+-----------+-----------+------------+

Selecting Columns: select() vs. selectExpr()

While both methods achieve the same result, they differ ain how the syntax and is written. Both do the same job in selecting the columns. selectExpr() is a variant of select() that accepts SQL expressions. They differ in . Let's see how they differ.

val showCols = df.select(col("Roll"),  
  col("Name"),  
  col("Final Marks")
)

showCols.show()

val showColsExpr = df.selectExpr("Roll", "Name", "`Final Marks`")

showColsExpr.show()

Output

+----+--------+-----------+
|Roll|    Name|Final Marks|
+----+--------+-----------+
|   1|    Ajay|        300|
|   2|Bharghav|        350|
|   3| Chaitra|        320|
|   4|   Kamal|        360|
|   5|  Sohaib|        450|
+----+--------+-----------+

As shown above, both approaches return the same DataFrame. The difference is simply in the syntax used to write the selection logic.

Performing Column Operations with select() and selectExpr()

Although select() and selectExpr() yield same output, the syntax for selectExpr() is easier, ften more concise and readable, especially when performing operations on columns.

val mathOps = df.select(col("Name"),
  (col("Float Marks") + 10).as("Updated Float Marks")
)
  
mathOps.show()

val mathOpsExpr = df.selectExpr("Name",
  "`Float Marks` + 10 AS `Updated Float Marks`")
  
mathOpsExpr.show()

Output

+--------+-------------------+
|    Name|Updated Float Marks|
+--------+-------------------+
|    Ajay|               65.5|
|Bharghav|               73.2|
| Chaitra|               70.1|
|   Kamal|               85.0|
|  Sohaib|               80.8|
+--------+-------------------+

Both the methods return the same dataframe in the output.

Summary

In this article, we explored the difference between select() and selectExpr() in Spark DataFrames:

  • Both methods are used to select specific columns.

  • select() uses the col() function and requires .as()` for renaming or transforming columns.

  • selectExpr() allows SQL-style expressions, making the syntax shorter and easier to understand.

If you're comfortable with SQL syntax, selectExpr() can save you time and make your code more concise. However, both are equally powerful and choosing one over the other often comes down to personal preference and use case complexity.

References