User Defined Functions¶
Spark user defined function is an important and powerful feature. Let's see how you can leverage this in the punch. Given a people dataset as follows:
age | name | sex | height | weight |
---|---|---|---|---|
10 | John Snow | M | 2 | 80 |
15 | Rick Grimes | M | 1.8 | 90 |
36 | Micheal Jackson | M | 1.75 | 75 |
Say you want to add a columns with the age converted from years into months.
num_years | age | name | sex | height | weight |
---|---|---|---|---|---|
120 | 10 | John Snow | M | 2 | 80 |
180 | 15 | Rick Grimes | M | 1.8 | 90 |
432 | 36 | Micheal Jackson | M | 1.75 | 75 |
The proper way to do that is to add to spark a new function, call it for example yearToMonth
,
and invoke it from within SQL as follows:
SELECT yearToMonth(age) AS num_years, * FROM people_dataset
UDFs are functions that takes a number of parameters. Those parameters can either be multiple(s) column(s) and/or constant variables which can be used as options in your UDF code. UDFs returns only a single typed column.
If you need several columns, do it in two steps. Spark data types supports nested data structures, you can first generate a single column containing a nested structure. Next, use of the built-in SQL functions to explode the nested result as multiple columns !