String Functions
Last updated on: 2025-05-30
Following our discussion on row operations, column operations, and various mathematical functions, it's time to explore another essential category—String operations.
Real-world data is rarely just numeric. Many datasets include string-type values that require special handling. String operations in Spark DataFrames allow us to clean, transform, and manipulate textual data efficiently.
String operations on DataFrames can be broadly categorized into:
- Basic String functions
- Substring operations
- String concatenation
Let’s now explore how to perform these operations using a Spark DataFrame.
Before diving in, let’s create a sample DataFrame:
val df = Seq(
(1, "Ajay", "Physics"," Hyderabad "),
(2, "Bharghav", "Cyber Security"," Mumbai "),
(3, "Chaitra", "Material Science", " Indore "),
(4, "Kamal", "Design", " Puri "),
(5, "Sohaib", "Nuclear Science", " Cochin ")
).toDF("Roll", "Name","Dept", "Location")
df.show()
Output
+----+--------+----------------+-------------+
|Roll| Name| Dept| Location|
+----+--------+----------------+-------------+
| 1| Ajay| Physics| Hyderabad |
| 2|Bharghav| Cyber Security| Mumbai |
| 3| Chaitra|Material Science| Indore |
| 4| Kamal| Design| Puri |
| 5| Sohaib| Nuclear Science| Cochin |
+----+--------+----------------+-------------+
Convert Strings to Lowercase
To convert the String data to lowercase, we use lower(col())
inside the .withColumn()
method.
val lowerDf = df.withColumn("Lower Case",
lower(col("Name"))
)
lowerDf.show()
Output
+----+--------+----------------+-------------+----------+
|Roll| Name| Dept| Location|Lower Case|
+----+--------+----------------+-------------+----------+
| 1| Ajay| Physics| Hyderabad | ajay|
| 2|Bharghav| Cyber Security| Mumbai | bharghav|
| 3| Chaitra|Material Science| Indore | chaitra|
| 4| Kamal| Design| Puri | kamal|
| 5| Sohaib| Nuclear Science| Cochin | sohaib|
+----+--------+----------------+-------------+----------+
Convert Strings to Uppercase
To convert the String data to uppercase, we use upper(col())
inside the .withColumn()
method.
val upperDf = df.withColumn("Upper Case",
lower(col("Dept"))
)
upperDf.show()
Output
+----+--------+----------------+-------------+----------------+
|Roll| Name| Dept| Location| Upper Case|
+----+--------+----------------+-------------+----------------+
| 1| Ajay| Physics| Hyderabad | PHYSICS|
| 2|Bharghav| Cyber Security| Mumbai | CYBER SECURITY|
| 3| Chaitra|Material Science| Indore |MATERIAL SCIENCE|
| 4| Kamal| Design| Puri | DESIGN|
| 5| Sohaib| Nuclear Science| Cochin | NUCLEAR SCIENCE|
+----+--------+----------------+-------------+----------------+
Trim Whitespaces
Use trim(col())
to remove leading and trailing whitespaces from string columns.
val dfTrim = df.withColumn("Loc Trimmed",
trim(col("Location"))
)
dfTrim.show()
Output
+----+--------+----------------+-------------+-----------+
|Roll| Name| Dept| Location|Loc Trimmed|
+----+--------+----------------+-------------+-----------+
| 1| Ajay| Physics| Hyderabad | Hyderabad|
| 2|Bharghav| Cyber Security| Mumbai | Mumbai|
| 3| Chaitra|Material Science| Indore | Indore|
| 4| Kamal| Design| Puri | Puri|
| 5| Sohaib| Nuclear Science| Cochin | Cochin|
+----+--------+----------------+-------------+-----------+
Find String Length
To find the length of the string column values, use the method length(col())
inside the .withColumn()
.
val dfLength = df.withColumn("Name Length",
length(col("Name"))
)
dfLength.show()
Output
+----+--------+----------------+-------------+-----------+
|Roll| Name| Dept| Location|Name Length|
+----+--------+----------------+-------------+-----------+
| 1| Ajay| Physics| Hyderabad | 4|
| 2|Bharghav| Cyber Security| Mumbai | 8|
| 3| Chaitra|Material Science| Indore | 7|
| 4| Kamal| Design| Puri | 5|
| 5| Sohaib| Nuclear Science| Cochin | 6|
+----+--------+----------------+-------------+-----------+
Reverse String Values
Use reverse(col())
to reverse the characters of a string.
val dfReverse = df.withColumn("Dept Reverse",
reverse(col("Dept"))
)
dfReverse.show()
Output
+----+--------+----------------+-------------+----------------+
|Roll| Name| Dept| Location| Dept Reverse|
+----+--------+----------------+-------------+----------------+
| 1| Ajay| Physics| Hyderabad | scisyhP|
| 2|Bharghav| Cyber Security| Mumbai | ytiruceS rebyC|
| 3| Chaitra|Material Science| Indore |ecneicS lairetaM|
| 4| Kamal| Design| Puri | ngiseD|
| 5| Sohaib| Nuclear Science| Cochin | ecneicS raelcuN|
+----+--------+----------------+-------------+----------------+
Extract Substrings
We use substring(col())
inside .withColumn()
method to get the substring of the specified column.
val dfSubstring = df.withColumn("Name Substring",
substring(col("Name"), 1, 3)
)
dfSubstring.show() // displays the first 3 characters of the values under "Name"
Note: String indexing starts from 1, instead of 0. Make sure to extract the substrings accordingly.
Output
+----+--------+----------------+-------------+--------------+
|Roll| Name| Dept| Location|Name Substring|
+----+--------+----------------+-------------+--------------+
| 1| Ajay| Physics| Hyderabad | Aja|
| 2|Bharghav| Cyber Security| Mumbai | Bha|
| 3| Chaitra|Material Science| Indore | Cha|
| 4| Kamal| Design| Puri | Kam|
| 5| Sohaib| Nuclear Science| Cochin | Soh|
+----+--------+----------------+-------------+--------------+
Concatenate String Columns
To concatenate two string columns, use concat(col("col-1"), lit(" "), col("col-2"))
inside .withColumn()
method.
val dfConcat = df.withColumn("Name-City",
concat(col("Name"),
lit(" - "),
col("Location")
)
)
dfConcat.show() // returns the dataframe with a column values obtained by concatenating Name and Location
Output
+----+--------+----------------+-------------+---------------------+
|Roll|Name |Dept |Location |Name-City |
+----+--------+----------------+-------------+---------------------+
|1 |Ajay |Physics | Hyderabad |Ajay - Hyderabad |
|2 |Bharghav|Cyber Security | Mumbai |Bharghav - Mumbai |
|3 |Chaitra |Material Science| Indore |Chaitra - Indore |
|4 |Kamal |Design | Puri |Kamal - Puri |
|5 |Sohaib |Nuclear Science | Cochin |Sohaib - Cochin |
+----+--------+----------------+-------------+---------------------+
Summary
In this article, you learned:
-
What string operations are and why they matter.
-
How to:
-
Convert string data to lowercase and uppercase.
-
Trim whitespaces from string values.
-
Measure the length of strings.
-
Reverse strings.
-
Extract substrings.
-
Concatenate string columns.
-
These operations are fundamental for effective data preprocessing and transformation when working with textual data in Spark.