String Functions

Last updated on: 2025-05-30

Following our discussion on row operations, column operations, and various mathematical functions, it's time to explore another essential category—String operations.

Real-world data is rarely just numeric. Many datasets include string-type values that require special handling. String operations in Spark DataFrames allow us to clean, transform, and manipulate textual data efficiently.

String operations on DataFrames can be broadly categorized into:

  • Basic String functions
  • Substring operations
  • String concatenation

Let’s now explore how to perform these operations using a Spark DataFrame.

Before diving in, let’s create a sample DataFrame:

val df = Seq(
      (1, "Ajay", "Physics","  Hyderabad  "),
      (2, "Bharghav", "Cyber Security","  Mumbai  "),
      (3, "Chaitra", "Material Science", "  Indore  "),
      (4, "Kamal", "Design", "  Puri  "),
      (5, "Sohaib", "Nuclear Science", "  Cochin  ")
    ).toDF("Roll", "Name","Dept", "Location")

df.show()

Output

+----+--------+----------------+-------------+
|Roll|    Name|            Dept|     Location|
+----+--------+----------------+-------------+
|   1|    Ajay|         Physics|  Hyderabad  |
|   2|Bharghav|  Cyber Security|     Mumbai  |
|   3| Chaitra|Material Science|     Indore  |
|   4|   Kamal|          Design|       Puri  |
|   5|  Sohaib| Nuclear Science|     Cochin  |
+----+--------+----------------+-------------+

Convert Strings to Lowercase

To convert the String data to lowercase, we use lower(col()) inside the .withColumn() method.

val lowerDf = df.withColumn("Lower Case", 
  lower(col("Name"))
)

lowerDf.show()

Output

+----+--------+----------------+-------------+----------+
|Roll|    Name|            Dept|     Location|Lower Case|
+----+--------+----------------+-------------+----------+
|   1|    Ajay|         Physics|  Hyderabad  |      ajay|
|   2|Bharghav|  Cyber Security|     Mumbai  |  bharghav|
|   3| Chaitra|Material Science|     Indore  |   chaitra|
|   4|   Kamal|          Design|       Puri  |     kamal|
|   5|  Sohaib| Nuclear Science|     Cochin  |    sohaib|
+----+--------+----------------+-------------+----------+

Convert Strings to Uppercase

To convert the String data to uppercase, we use upper(col()) inside the .withColumn() method.

val upperDf = df.withColumn("Upper Case", 
  lower(col("Dept"))
)

upperDf.show()

Output

+----+--------+----------------+-------------+----------------+
|Roll|    Name|            Dept|     Location|      Upper Case|
+----+--------+----------------+-------------+----------------+
|   1|    Ajay|         Physics|  Hyderabad  |         PHYSICS|
|   2|Bharghav|  Cyber Security|     Mumbai  |  CYBER SECURITY|
|   3| Chaitra|Material Science|     Indore  |MATERIAL SCIENCE|
|   4|   Kamal|          Design|       Puri  |          DESIGN|
|   5|  Sohaib| Nuclear Science|     Cochin  | NUCLEAR SCIENCE|
+----+--------+----------------+-------------+----------------+

Trim Whitespaces

Use trim(col()) to remove leading and trailing whitespaces from string columns.

val dfTrim = df.withColumn("Loc Trimmed", 
  trim(col("Location"))
)

dfTrim.show()

Output

+----+--------+----------------+-------------+-----------+
|Roll|    Name|            Dept|     Location|Loc Trimmed|
+----+--------+----------------+-------------+-----------+
|   1|    Ajay|         Physics|  Hyderabad  |  Hyderabad|
|   2|Bharghav|  Cyber Security|     Mumbai  |     Mumbai|
|   3| Chaitra|Material Science|     Indore  |     Indore|
|   4|   Kamal|          Design|       Puri  |       Puri|
|   5|  Sohaib| Nuclear Science|     Cochin  |     Cochin|
+----+--------+----------------+-------------+-----------+

Find String Length

To find the length of the string column values, use the method length(col()) inside the .withColumn().

val dfLength = df.withColumn("Name Length",
  length(col("Name"))
)

dfLength.show()

Output

+----+--------+----------------+-------------+-----------+
|Roll|    Name|            Dept|     Location|Name Length|
+----+--------+----------------+-------------+-----------+
|   1|    Ajay|         Physics|  Hyderabad  |          4|
|   2|Bharghav|  Cyber Security|     Mumbai  |          8|
|   3| Chaitra|Material Science|     Indore  |          7|
|   4|   Kamal|          Design|       Puri  |          5|
|   5|  Sohaib| Nuclear Science|     Cochin  |          6|
+----+--------+----------------+-------------+-----------+

Reverse String Values

Use reverse(col()) to reverse the characters of a string.

val dfReverse = df.withColumn("Dept Reverse", 
  reverse(col("Dept"))
)

dfReverse.show()

Output

+----+--------+----------------+-------------+----------------+
|Roll|    Name|            Dept|     Location|    Dept Reverse|
+----+--------+----------------+-------------+----------------+
|   1|    Ajay|         Physics|  Hyderabad  |         scisyhP|
|   2|Bharghav|  Cyber Security|     Mumbai  |  ytiruceS rebyC|
|   3| Chaitra|Material Science|     Indore  |ecneicS lairetaM|
|   4|   Kamal|          Design|       Puri  |          ngiseD|
|   5|  Sohaib| Nuclear Science|     Cochin  | ecneicS raelcuN|
+----+--------+----------------+-------------+----------------+

Extract Substrings

We use substring(col()) inside .withColumn() method to get the substring of the specified column.

val dfSubstring = df.withColumn("Name Substring", 
  substring(col("Name"), 1, 3)
)

dfSubstring.show() // displays the first 3 characters of the values under "Name"

Note: String indexing starts from 1, instead of 0. Make sure to extract the substrings accordingly.

Output

+----+--------+----------------+-------------+--------------+
|Roll|    Name|            Dept|     Location|Name Substring|
+----+--------+----------------+-------------+--------------+
|   1|    Ajay|         Physics|  Hyderabad  |           Aja|
|   2|Bharghav|  Cyber Security|     Mumbai  |           Bha|
|   3| Chaitra|Material Science|     Indore  |           Cha|
|   4|   Kamal|          Design|       Puri  |           Kam|
|   5|  Sohaib| Nuclear Science|     Cochin  |           Soh|
+----+--------+----------------+-------------+--------------+

Concatenate String Columns

To concatenate two string columns, use concat(col("col-1"), lit(" "), col("col-2")) inside .withColumn() method.

val dfConcat = df.withColumn("Name-City", 
  concat(col("Name"), 
    lit(" - "), 
    col("Location")
  )
)

dfConcat.show() // returns the dataframe with a column values obtained by concatenating Name and Location

Output

+----+--------+----------------+-------------+---------------------+
|Roll|Name    |Dept            |Location     |Name-City            |
+----+--------+----------------+-------------+---------------------+
|1   |Ajay    |Physics         |  Hyderabad  |Ajay -   Hyderabad   |
|2   |Bharghav|Cyber Security  |  Mumbai     |Bharghav -   Mumbai  |
|3   |Chaitra |Material Science|  Indore     |Chaitra -   Indore   |
|4   |Kamal   |Design          |  Puri       |Kamal -   Puri       |
|5   |Sohaib  |Nuclear Science |  Cochin     |Sohaib -   Cochin    |
+----+--------+----------------+-------------+---------------------+

Summary

In this article, you learned:

  • What string operations are and why they matter.

  • How to:

    • Convert string data to lowercase and uppercase.

    • Trim whitespaces from string values.

    • Measure the length of strings.

    • Reverse strings.

    • Extract substrings.

    • Concatenate string columns.

These operations are fundamental for effective data preprocessing and transformation when working with textual data in Spark.

References