Creating & Using User Defined Functions
Last updated on: 2025-05-30
A User Defined Function (UDF)
is a custom function you write to perform transformations on Spark data that are not readily available via Spark’s built-in functions. UDFs are especially useful for implementing complex logic, applying domain-specific rules, or performing operations not natively supported in Spark.
We typically use UDFs when the desired transformation cannot be achieved using existing Spark functions.
Creating a simple UDF
Let’s start with a basic example where we capitalize the second letter of a name in a DataFrame.
Sample DataFrame: Student Records
+----+--------+-----+
|Roll| Name|Marks|
+----+--------+-----+
| 1| Ajay| 85|
| 2|Bharghav| 76|
| 3| Chaitra| 70|
| 4| Kamal| 90|
| 5| Sohaib| 83|
+----+--------+-----+
// create the user defined function outside main object
object CapitalizeSecondUDF {
def capitalizeSecond(text: String): String = {
if (text != null && text.nonEmpty) {
text.substring(1, 2).toUpperCase + text.substring(2).toLowerCase
} else {
text
}
}
val capitalizeUDF = udf(capitalizeSecond _)
}
// call the function with the parameter inside the main object
val result = df.withColumn("capitalized",
CapitalizeSecondUDF.capitalizeUDF(col("Name")))
result.show()
Output
+----+--------+-----+-----------+
|Roll| Name|Marks|capitalized|
+----+--------+-----+-----------+
| 1| Ajay| 85| AJay|
| 2|Bharghav| 76| BHarghav|
| 3| Chaitra| 70| CHaitra|
| 4| Kamal| 90| KAmal|
| 5| Sohaib| 83| SOhaib|
+----+--------+-----+-----------+
We’ve successfully transformed the Name
column using a custom function — something not achievable using default Spark functions.
More UDF Examples with a Complex Dataset
Let’s now use a more detailed dataset — a shopping bill — to demonstrate more UDF capabilities.
+--------+-----------+----------+---+------------------+-----------------+
|Item no.| Item Name| Category|MRP| Discounted Price| Price After Tax|
+--------+-----------+----------+---+------------------+-----------------+
| 1|Paper Clips|Stationery| 23| 20.7| 24.84|
| 2| Butter| Dairy| 57| 51.30000000004| 61.56|
| 3| Jeans| Clothes|799| 719.1| 862.92|
| 4| Shirt| Clothes|570| 513.0| 615.6|
| 5|Butter Milk| Dairy| 50| 45.0| 54.0|
| 6| Bag| Apparel|455| 409.5| 491.4|
| 7| Shoes| Apparel|901| 810.9| 973.079999999|
| 8| Stapler|Stationery| 50| 45.0| 54.0|
| 9| Pens|Stationery|120| 108.0| 129.6|
+--------+-----------+----------+---+------------------+-----------------+
In the above example, we defined a user functions as separate object, outside the main object and called it in the main function.
Well, that is one way of defining UDF. There is another way as well. We can declare a variable as a UDF using UserDefinedFunction
function. Let us see how we can do that.
Classifying Items Based on Price
Here we define a UDF that categorizes items as Cheap, Moderate, or Expensive based on their final price.
import org.apache.spark.sql.expressions.UserDefinedFunction
val classifyPriceUDF: UserDefinedFunction = udf((price: Double) => {
if (price < 100) "Cheap"
else if (price >= 100 && price <= 500) "Moderate"
else "Expensive"
})
val enrichedDF = priceAfterTax.withColumn("Price Category",
classifyPriceUDF(col("Price After Tax")))
enrichedDF.show()
Output
+--------+-----------+----------+---+------------------+-----------------+--------------+
|Item no.| Item Name| Category|MRP| Discounted Price| Price After Tax|Price Category|
+--------+-----------+----------+---+------------------+-----------------+--------------+
| 1|Paper Clips|Stationery| 23| 20.7| 24.84| Cheap|
| 2| Butter| Dairy| 57| 51.30000000004| 61.56| Cheap|
| 3| Jeans| Clothes|799| 719.1| 862.92| Expensive|
| 4| Shirt| Clothes|570| 513.0| 615.6| Expensive|
| 5|Butter Milk| Dairy| 50| 45.0| 54.0| Cheap|
| 6| Bag| Apparel|455| 409.5| 491.4| Moderate|
| 7| Shoes| Apparel|901| 810.9| 973.079999999| Expensive|
| 8| Stapler|Stationery| 50| 45.0| 54.0| Cheap|
| 9| Pens|Stationery|120| 108.0| 129.6| Moderate|
+--------+-----------+----------+---+------------------+-----------------+--------------+
Adding a Custom Tax Slab
Let’s define a UDF to assign a tax slab to each item based on its category.
val taxSlabUDF = udf((category: String) => {
category match {
case "Stationery" => "5%"
case "Dairy" => "12%"
case "Clothes" => "18%"
case "Apparel" => "18%"
case _ => "10%"
}
})
val dfWithTaxSlab = priceAfterTax.withColumn("Tax Slab", taxSlabUDF(col("Category")))
dfWithTaxSlab.show()
Output
+--------+-----------+----------+---+------------------+-----------------+--------+
|Item no.| Item Name| Category|MRP| Discounted Price| Price After Tax|Tax Slab|
+--------+-----------+----------+---+------------------+-----------------+--------+
| 1|Paper Clips|Stationery| 23| 20.7| 24.84| 5%|
| 2| Butter| Dairy| 57|51.300000000000004| 61.56| 12%|
| 3| Jeans| Clothes|799| 719.1| 862.92| 18%|
| 4| Shirt| Clothes|570| 513.0| 615.6| 18%|
| 5|Butter Milk| Dairy| 50| 45.0| 54.0| 12%|
| 6| Bag| Apparel|455| 409.5| 491.4| 18%|
| 7| Shoes| Apparel|901| 810.9|973.0799999999999| 18%|
| 8| Stapler|Stationery| 50| 45.0| 54.0| 5%|
| 9| Pens|Stationery|120| 108.0| 129.6| 5%|
+--------+-----------+----------+---+------------------+-----------------+--------+
We used pattern matching in the UDF to dynamically return tax percentages based on item categories.
Summary
In this article, you learned:
- What User Defined Functions (UDFs) are and when to use them.
- How to create a simple UDF to perform a string transformation.
- Different ways to declare UDFs — inside and outside the main object.
- How to use UDFs for classifying numeric values and mapping strings using pattern matching.
In the upcoming article, we’ll explore performance considerations, optimizations, and alternatives to UDFs in Spark.