User Defined Functions¶

Spark user defined function is an important and powerful feature. Let's see how you can leverage this in the punch. Given a people dataset as follows:

age	name	sex	height	weight
10	John Snow	M	2	80
15	Rick Grimes	M	1.8	90
36	Micheal Jackson	M	1.75	75

Say you want to add a columns with the age converted from years into months.

num_years	age	name	sex	height	weight
120	10	John Snow	M	2	80
180	15	Rick Grimes	M	1.8	90
432	36	Micheal Jackson	M	1.75	75

The proper way to do that is to add to spark a new function, call it for example yearToMonth, and invoke it from within SQL as follows:

SELECT yearToMonth(age) AS num_years, * FROM people_dataset

UDFs are functions that takes a number of parameters. Those parameters can either be multiple(s) column(s) and/or constant variables which can be used as options in your UDF code. UDFs returns only a single typed column.

If you need several columns, do it in two steps. Spark data types supports nested data structures, you can first generate a single column containing a nested structure. Next, use of the built-in SQL functions to explode the nested result as multiple columns !