User Defined Functions: Multiple Parameters & Null Value Handling
Last updated on: 2025-05-30
In our previous discussion, we covered the basics of User Defined Functions (UDFs)
in Spark — including what they are, how to define them, and different ways to implement them.
In this article, we’ll delve into more advanced use cases, such as defining UDFs with multiple input parameters and handling null values within dataframes.
We’ll continue using the shopping bill dataframe introduced earlier:
+--------+-----------+----------+---+------------------+-----------------+
|Item no.| Item Name| Category|MRP| Discounted Price| Price After Tax|
+--------+-----------+----------+---+------------------+-----------------+
| 1|Paper Clips|Stationery| 23| 20.7| 24.84|
| 2| Butter| Dairy| 57| 51.30000000004| 61.56|
| 3| Jeans| Clothes|799| 719.1| 862.92|
| 4| Shirt| Clothes|570| 513.0| 615.6|
| 5|Butter Milk| Dairy| 50| 45.0| 54.0|
| 6| Bag| Apparel|455| 409.5| 491.4|
| 7| Shoes| Apparel|901| 810.9| 973.079999999|
| 8| Stapler|Stationery| 50| 45.0| 54.0|
| 9| Pens|Stationery|120| 108.0| 129.6|
+--------+-----------+----------+---+------------------+-----------------+
Define UDF with multiple input parameters
UDFs in Spark are not limited to single-column input — they can take multiple columns as arguments and execute complex logic. Let's implement a UDF that assigns a price remark based on category, MRP, and final price after tax:
val priceRemarkUDF = udf((category: String, mrp: Double, priceAfterTax: Double) => {
category match {
case "Stationery" if priceAfterTax < mrp * 1.2 => "Fairly Priced"
case "Stationery" => "Overpriced"
case "Dairy" if priceAfterTax > mrp * 1.1 => "Overpriced"
case "Dairy" => "Fairly Priced"
case "Clothes" if priceAfterTax > 800 => "Expensive"
case "Clothes" if priceAfterTax < 500 => "Cheap"
case "Clothes" => "Moderate"
case "Apparel" if priceAfterTax > 900 => "Overpriced"
case "Apparel" => "Reasonable"
case _ => "Unknown"
}
})
val dfWithRemark = priceAfterTax.withColumn("Final Remark",
priceRemarkUDF(col("Category"),
col("MRP"),
col("Price After Tax")
))
dfWithRemark.show(truncate = false)
Output
+--------+-----------+----------+---+------------------+-----------------+-------------+
|Item no.|Item Name |Category |MRP|Discounted Price |Price After Tax |Final Remark |
+--------+-----------+----------+---+------------------+-----------------+-------------+
|1 |Paper Clips|Stationery|23 |20.7 |24.84 |Fairly Priced|
|2 |Butter |Dairy |57 |51.300000000000004|61.56 |Fairly Priced|
|3 |Jeans |Clothes |799|719.1 |862.92 |Expensive |
|4 |Shirt |Clothes |570|513.0 |615.6 |Moderate |
|5 |Butter Milk|Dairy |50 |45.0 |54.0 |Fairly Priced|
|6 |Bag |Apparel |455|409.5 |491.4 |Reasonable |
|7 |Shoes |Apparel |901|810.9 |973.0799999999999|Overpriced |
|8 |Stapler |Stationery|50 |45.0 |54.0 |Fairly Priced|
|9 |Pens |Stationery|120|108.0 |129.6 |Fairly Priced|
+--------+-----------+----------+---+------------------+-----------------+-------------+
Let us see another implementation of how we can define UDF wit multiple columns as inputs.
Custom Discounts Based on Category and Price
Let’s say we want to apply custom discounts based on item category — but only if certain pricing conditions are met. Such conditional logic is often difficult to achieve using built-in functions, which is where UDFs shine.
val finalDiscount = udf((category: String, discountedPrice: Double, priceAfterTax: Double) => {
category match {
case "Stationery" if priceAfterTax < 50 => discountedPrice * 0.95
case "Clothes" if priceAfterTax > 800 => discountedPrice * 0.92
case "Dairy" if discountedPrice < 50 => discountedPrice * 0.98
case _ => discountedPrice
}
})
val finalDiscountBill = priceAfterTax.withColumn("Final Discounted Price",
finalDiscount(col("Category"),
col("Discounted Price"),
col("Price After Tax")
))
finalDiscountBill.show()
Output
+--------+-----------+----------+---+------------------+-----------------+----------------------+
|Item no.| Item Name| Category|MRP| Discounted Price| Price After Tax|Final Discounted Price|
+--------+-----------+----------+---+------------------+-----------------+----------------------+
| 1|Paper Clips|Stationery| 23| 20.7| 24.84| 19.665|
| 2| Butter| Dairy| 57| 51.30000000004| 61.56| 51.30000000004|
| 3| Jeans| Clothes|799| 719.1| 862.92| 661.572|
| 4| Shirt| Clothes|570| 513.0| 615.6| 513.0|
| 5|Butter Milk| Dairy| 50| 45.0| 54.0| 44.1|
| 6| Bag| Apparel|455| 409.5| 491.4| 409.5|
| 7| Shoes| Apparel|901| 810.9| 973.0799999999| 810.9|
| 8| Stapler|Stationery| 50| 45.0| 54.0| 45.0|
| 9| Pens|Stationery|120| 108.0| 129.6| 108.0|
+--------+-----------+----------+---+------------------+-----------------+----------------------+
Handling Null values with UDF
Consider a student dataframe with some missing values:
+----+--------+-----+
|Roll| Name|Marks|
+----+--------+-----+
| 1| Ajay| 85|
| 2|Bharghav| 76|
| 3| Chaitra| null|
| 4| null| 90|
| 5| Sohaib| null|
+----+--------+-----+
Let us try to replace null values using UDFs.
val replaceNullName = udf((name: String) => {
Option(name).getOrElse("Unknown")
})
val replaceNullMarks = udf((marks: Integer) => {
Option(marks).getOrElse(40)
})
val resultDF = studentData
.withColumn("Name", replaceNullName(col("Name")))
.withColumn("Marks", replaceNullMarks(col("Marks")))
resultDF.show()
Here, we are defining two UDFs, one will replace null values in Name column and the other will replace null values in Marks Column.
Output
+----+--------+-----+
|Roll| Name|Marks|
+----+--------+-----+
| 1| Ajay| 85|
| 2|Bharghav| 76|
| 3| Chaitra| 40|
| 4| Unknown| 90|
| 5| Sohaib| 40|
+----+--------+-----+
Note: Although UDFs can handle null values, it's generally more efficient to use Spark’s built-in functions like fill() or na.replace() for this task — especially when working with large datasets.
Summary
In this article, you learned
-
How to write UDFs that accept multiple column values as input.
-
How to apply category-specific logic for pricing using UDFs.
-
How to replace null values with default values using UDFs.
While UDFs offer flexibility, always prefer Spark’s built-in functions where possible for performance and optimization.