Replacing String Values

Last updated on: 2025-05-30

Replacing string values in large DataFrames is a common requirement in data processing. This article covers three effective methods to achieve this in Apache Spark.

  • Using UDFs (User-Defined Functions)

  • Using when and otherwise expressions

  • Using Joins with lookup DataFrames

Let’s work with the following sample Student DataFrame:

+----+--------+----------------+---------+
|Roll|    Name|            Dept|     City|
+----+--------+----------------+---------+
|   1|    Ajay|         Physics|Hyderabad|
|   2|Bharghav|  Cyber Security|   Mumbai|
|   3| Chaitra|Material Science|   Indore|
|   4|   Kamal|          Design|     Puri|
|   5| Shohaib| Nuclear Science|   Cochin|
+----+--------+----------------+---------+

Using UDFs (User-Defined Functions)

Let us now add a new column State, where the city is located. For this, we need to create a map of city with its corresponding state, using UDF.

val state = Map(
  "Hyderabad" -> "Telangana",
  "Mumbai" -> "Maharashtra",
  "Indore" -> "Madhya Pradesh",
  "Puri" -> "Odisha",
  "Cochin" -> "Kerala"
)

// Define a UDF to map the states
val getStateUDF = udf(
  (city: String) => state.getOrElse(city,"Unknown")
)

val dfWithState = studentData.withColumn("State", getStateUDF(col("City")))

dfWithState.show()

Dataframe

+----+--------+----------------+---------+--------------+
|Roll|    Name|            Dept|     City|         State|
+----+--------+----------------+---------+--------------+
|   1|    Ajay|         Physics|Hyderabad|     Telangana|
|   2|Bharghav|  Cyber Security|   Mumbai|   Maharashtra|
|   3| Chaitra|Material Science|   Indore|Madhya Pradesh|
|   4|   Kamal|          Design|     Puri|        Odisha|
|   5| Shohaib| Nuclear Science|   Cochin|        Kerala|
+----+--------+----------------+---------+--------------+

The columns, Dept, City, State are all string values and demand larger memory to store the data. To optimize space requirements, we can replace the values with short-forms.

Method-1: Using UDFs

Replacing State name with short-form

Let’s convert State names into short forms using a Map and a UDF:

val stateShorFrom = Map(
  "Telangana" -> "TG",
  "Maharashtra" -> "MH",
  "Madhya Pradesh" -> "MP",
  "Odisha" -> "OD",
  "Kerala" -> "KL"
)

// Define a UDF to map the states
val getStateShortForm = udf(
  (state: String) => stateShorFrom.getOrElse(state, "Unknown")
)

val stateShortName = dfWithState.withColumn("State", getStateShortForm(col("State")))

stateShortName.show()

Output

+----+--------+----------------+---------+-----+
|Roll|    Name|            Dept|     City|State|
+----+--------+----------------+---------+-----+
|   1|    Ajay|         Physics|Hyderabad|   TG|
|   2|Bharghav|  Cyber Security|   Mumbai|   MH|
|   3| Chaitra|Material Science|   Indore|   MP|
|   4|   Kamal|          Design|     Puri|   OD|
|   5| Shohaib| Nuclear Science|   Cochin|   KL|
+----+--------+----------------+---------+-----+

Method-2: Using when and otherwise for Conditional Replacement

Now, we’ll replace City names with short forms using Spark's native when and otherwise expressions:

val cityShortName = stateShortName.withColumn(
  "City",
  when(col("City") === "Hyderabad", "HYD")
    .when(col("City") === "Mumbai","BOM")
    .when(col("City") === "Indore", "IDR")
    .when(col("City") === "Puri", "PUR")
    .when(col("City") === "Cochin","COC")
    .otherwise("NA")
)

cityShortName.show()

Output

+----+--------+----------------+----+-----+
|Roll|    Name|            Dept|City|State|
+----+--------+----------------+----+-----+
|   1|    Ajay|         Physics| HYD|   TG|
|   2|Bharghav|  Cyber Security| BOM|   MH|
|   3| Chaitra|Material Science| IDR|   MP|
|   4|   Kamal|          Design| PUR|   OD|
|   5| Shohaib| Nuclear Science| COC|   KL|
+----+--------+----------------+----+-----+

Method-3: Using Joins for Large-Scale Replacement

When working with large datasets, UDFs can become performance bottlenecks. An efficient alternative is joining with a lookup DataFrame.

Let’s map Dept names to their codes:

val deptCode = Seq(
  ("Physics", "PHy"),
  ("Cyber Security", "CyS"),
  ("Material Science", "Mat-Sci"),
  ("Design", "DES"),
  ("Nuclear Science", "Nuc-Sci")
).toDF("Dept", "Code")

val deptShortName = cityShortName.join(deptCode, Seq("Dept"),"inner")
  .withColumn("Dept", col("Code"))
  .drop("Code")
  .select("Roll", "Name", "Dept", "City", "State") // reordering the dataframe columns after join

deptShortName.show()

Output

+----+--------+-------+----+-----+
|Roll|    Name|   Dept|City|State|
+----+--------+-------+----+-----+
|   1|    Ajay|    PHy| HYD|   TG|
|   2|Bharghav|    CyS| BOM|   MH|
|   3| Chaitra|Mat-Sci| IDR|   MP|
|   4|   Kamal|    DES| PUR|   OD|
|   5| Shohaib|Nuc-Sci| COC|   KL|
+----+--------+-------+----+-----+

Summary

In this tutorial, we demonstrated three different ways to replace string values in a Spark DataFrame using:

  • User Defined Functions (UDFs):

    • Replaced city names with corresponding state names.

    • Then shortened full state names to abbreviations using a mapping UDF.

  • when() and otherwise() Conditions:

    • Replaced full city names with their respective short forms using chained conditional logic.
  • Join with a Lookup DataFrame:

    • Used a join operation to replace department names with short forms by creating a separate mapping DataFrame.

These techniques help improve memory efficiency by minimizing long string values in your DataFrames. Joins are particularly recommended for large-scale data processing as they avoid the additional computation overhead introduced by UDFs.

References