Skip to content

User Defined Functions

Spark user defined function is an important and powerful feature. Let's see how you can leverage this in the punch. Given a people dataset as follows:

age name sex height weight
10 John Snow M 2 80
15 Rick Grimes M 1.8 90
36 Micheal Jackson M 1.75 75

Say you want to add a columns with the age converted from years into months.

num_years age name sex height weight
120 10 John Snow M 2 80
180 15 Rick Grimes M 1.8 90
432 36 Micheal Jackson M 1.75 75

The proper way to do that is to add to spark a new function, call it for example yearToMonth, and invoke it from within SQL as follows:

SELECT yearToMonth(age) AS num_years, * FROM people_dataset

UDFs are functions that takes a number of parameters. Those parameters can either be multiple(s) column(s) and/or constant variables which can be used as options in your UDF code. UDFs returns only a single typed column.

If you need several columns, do it in two steps. Spark data types supports nested data structures, you can first generate a single column containing a nested structure. Next, use of the built-in SQL functions to explode the nested result as multiple columns !