Replacing String Values
Last updated on: 2025-05-30
Replacing string values in large DataFrames is a common requirement in data processing. This article covers three effective methods to achieve this in Apache Spark.
-
Using UDFs (User-Defined Functions)
-
Using when and otherwise expressions
-
Using Joins with lookup DataFrames
Let’s work with the following sample Student DataFrame:
+----+--------+----------------+---------+
|Roll| Name| Dept| City|
+----+--------+----------------+---------+
| 1| Ajay| Physics|Hyderabad|
| 2|Bharghav| Cyber Security| Mumbai|
| 3| Chaitra|Material Science| Indore|
| 4| Kamal| Design| Puri|
| 5| Shohaib| Nuclear Science| Cochin|
+----+--------+----------------+---------+
Using UDFs (User-Defined Functions)
Let us now add a new column State, where the city is located. For this, we need to create a map of city with its corresponding state, using UDF.
val state = Map(
"Hyderabad" -> "Telangana",
"Mumbai" -> "Maharashtra",
"Indore" -> "Madhya Pradesh",
"Puri" -> "Odisha",
"Cochin" -> "Kerala"
)
// Define a UDF to map the states
val getStateUDF = udf(
(city: String) => state.getOrElse(city,"Unknown")
)
val dfWithState = studentData.withColumn("State", getStateUDF(col("City")))
dfWithState.show()
Dataframe
+----+--------+----------------+---------+--------------+
|Roll| Name| Dept| City| State|
+----+--------+----------------+---------+--------------+
| 1| Ajay| Physics|Hyderabad| Telangana|
| 2|Bharghav| Cyber Security| Mumbai| Maharashtra|
| 3| Chaitra|Material Science| Indore|Madhya Pradesh|
| 4| Kamal| Design| Puri| Odisha|
| 5| Shohaib| Nuclear Science| Cochin| Kerala|
+----+--------+----------------+---------+--------------+
The columns, Dept, City, State are all string values and demand larger memory to store the data. To optimize space requirements, we can replace the values with short-forms.
Method-1: Using UDFs
Replacing State name with short-form
Let’s convert State names into short forms using a Map and a UDF:
val stateShorFrom = Map(
"Telangana" -> "TG",
"Maharashtra" -> "MH",
"Madhya Pradesh" -> "MP",
"Odisha" -> "OD",
"Kerala" -> "KL"
)
// Define a UDF to map the states
val getStateShortForm = udf(
(state: String) => stateShorFrom.getOrElse(state, "Unknown")
)
val stateShortName = dfWithState.withColumn("State", getStateShortForm(col("State")))
stateShortName.show()
Output
+----+--------+----------------+---------+-----+
|Roll| Name| Dept| City|State|
+----+--------+----------------+---------+-----+
| 1| Ajay| Physics|Hyderabad| TG|
| 2|Bharghav| Cyber Security| Mumbai| MH|
| 3| Chaitra|Material Science| Indore| MP|
| 4| Kamal| Design| Puri| OD|
| 5| Shohaib| Nuclear Science| Cochin| KL|
+----+--------+----------------+---------+-----+
Method-2: Using when
and otherwise
for Conditional Replacement
Now, we’ll replace City names with short forms using Spark's native when and otherwise expressions:
val cityShortName = stateShortName.withColumn(
"City",
when(col("City") === "Hyderabad", "HYD")
.when(col("City") === "Mumbai","BOM")
.when(col("City") === "Indore", "IDR")
.when(col("City") === "Puri", "PUR")
.when(col("City") === "Cochin","COC")
.otherwise("NA")
)
cityShortName.show()
Output
+----+--------+----------------+----+-----+
|Roll| Name| Dept|City|State|
+----+--------+----------------+----+-----+
| 1| Ajay| Physics| HYD| TG|
| 2|Bharghav| Cyber Security| BOM| MH|
| 3| Chaitra|Material Science| IDR| MP|
| 4| Kamal| Design| PUR| OD|
| 5| Shohaib| Nuclear Science| COC| KL|
+----+--------+----------------+----+-----+
Method-3: Using Joins for Large-Scale Replacement
When working with large datasets, UDFs can become performance bottlenecks. An efficient alternative is joining with a lookup DataFrame.
Let’s map Dept names to their codes:
val deptCode = Seq(
("Physics", "PHy"),
("Cyber Security", "CyS"),
("Material Science", "Mat-Sci"),
("Design", "DES"),
("Nuclear Science", "Nuc-Sci")
).toDF("Dept", "Code")
val deptShortName = cityShortName.join(deptCode, Seq("Dept"),"inner")
.withColumn("Dept", col("Code"))
.drop("Code")
.select("Roll", "Name", "Dept", "City", "State") // reordering the dataframe columns after join
deptShortName.show()
Output
+----+--------+-------+----+-----+
|Roll| Name| Dept|City|State|
+----+--------+-------+----+-----+
| 1| Ajay| PHy| HYD| TG|
| 2|Bharghav| CyS| BOM| MH|
| 3| Chaitra|Mat-Sci| IDR| MP|
| 4| Kamal| DES| PUR| OD|
| 5| Shohaib|Nuc-Sci| COC| KL|
+----+--------+-------+----+-----+
Summary
In this tutorial, we demonstrated three different ways to replace string values in a Spark DataFrame using:
-
User Defined Functions (UDFs):
-
Replaced city names with corresponding state names.
-
Then shortened full state names to abbreviations using a mapping UDF.
-
-
when()
andotherwise()
Conditions:- Replaced full city names with their respective short forms using chained conditional logic.
-
Join with a Lookup DataFrame:
- Used a join operation to replace department names with short forms by creating a separate mapping DataFrame.
These techniques help improve memory efficiency by minimizing long string values in your DataFrames. Joins are particularly recommended for large-scale data processing as they avoid the additional computation overhead introduced by UDFs.