DataFrame Column Operations

Last updated on: 2025-05-30

DataFrame column operations involve modifying, transforming, or enriching data columns to prepare datasets for deeper analysis and modeling.

In our previous article, we learned how to create a dataframe and conversion of dataframe. ou can revisit those concepts here: Dataframe

In this article, we’ll explore various column-level operations that can be performed on a Spark DataFrame.

Let’s use the following DataFrame as a reference:

+----+--------+-----+
|Roll|    Name|Marks|
+----+--------+-----+
|   1|    Ajay|   55|
|   2|Bharghav|   63|
|   3| Chaitra|   60|
|   4|   Kamal|   75|
|   5|  Sohaib|   70|
+----+--------+-----+

Add a New Column to the DataFrame

.withColumn() function is used to add a new column based on existing columns.

// Adding a new column to the created dataframe
val addColumn = df.withColumn("Updated Marks", 
  col("Marks") + 5
)

addColumn.show()

Output

+----+--------+-----+-------------+
|Roll|    Name|Marks|Updated Marks|
+----+--------+-----+-------------+
|   1|    Ajay|   55|           60|
|   2|Bharghav|   63|           68|
|   3| Chaitra|   60|           65|
|   4|   Kamal|   75|           80|
|   5|  Sohaib|   70|           75|
+----+--------+-----+-------------+

Rename an Existing Column

Use .withColumnRenamed() to rename one or more columns.

// Renaming an existing column
val renameColumn = df.withColumnRenamed("Roll", "Roll Number")

renameColumn.show()

Output

+-----------+--------+-----+
|Roll Number|    Name|Marks|
+-----------+--------+-----+
|          1|    Ajay|   55|
|          2|Bharghav|   63|
|          3| Chaitra|   60|
|          4|   Kamal|   75|
|          5|  Sohaib|   70|
+-----------+--------+-----+

Drop a Column from the DataFrame

.drop() function is used to drop specific columns from the dataframe.

// Dropping an existing column
val dropColumn = df.drop("Roll")

dropColumn.show()

Output

+--------+-----+
|    Name|Marks|
+--------+-----+
|    Ajay|   55|
|Bharghav|   63|
| Chaitra|   60|
|   Kamal|   75|
|  Sohaib|   70|
+--------+-----+

Select Specific Columns from the DataFrame

.select() function is used to select the desired columns from the dataframe.

val selectColumns = df.select("Name", "Marks")

selectColumns.show()

Output

+--------+-----+
|    Name|Marks|
+--------+-----+
|    Ajay|   55|
|Bharghav|   63|
| Chaitra|   60|
|   Kamal|   75|
|  Sohaib|   70|
+--------+-----+

You can learn more about .select() function through this Select vs SelectExpr article.

Filter Rows based on Column Values

Use .filter() function to filter the rows based on the column values.

// Filtering based on column values
val filterColumn = addColumn.filter(col("Updated Marks") > = 65)

filterColumn.show()

Output

+----+--------+-----+-------------+
|Roll|    Name|Marks|Updated Marks|
+----+--------+-----+-------------+
|   2|Bharghav|   63|           68|
|   3| Chaitra|   60|           65|
|   4|   Kamal|   75|           80|
|   5|  Sohaib|   70|           75|
+----+--------+-----+-------------+

Create a Column with Conditional Values

.when() and .otherwise() functions are used to to derive new columns conditionally. As these are part of spark.sql.functions._ object, we must import it before creating the spark session.

// Creating columns with conditional values
val dfCategory = addColumn.withColumn("Division", 
  when(col("Updated Marks") > 70, 
    "Distinction").otherwise("First class")
)

dfCategory.show()

Output

+----+--------+-----+-------------+-----------+
|Roll|    Name|Marks|Updated Marks|   Division|
+----+--------+-----+-------------+-----------+
|   1|    Ajay|   55|           60|First class|
|   2|Bharghav|   63|           68|First class|
|   3| Chaitra|   60|           65|First class|
|   4|   Kamal|   75|           80|Distinction|
|   5|  Sohaib|   70|           75|Distinction|
+----+--------+-----+-------------+-----------+

Summary

In this article, we covered key column-level operations in Spark DataFrames:

  • Adding new columns using .withColumn()

  • Renaming columns with .withColumnRenamed()

  • Dropping columns using .drop()

  • Selecting specific columns via .select()

  • Filtering rows using .filter()

  • Creating conditional columns using .when().otherwise()

These operations are foundational for preparing and transforming your data in Spark for further analysis and modeling.

References