UDF: User Defined Function¶
Let's try enriching our data with our own UDF function yearToMonth
:
yearToMonth is a function that takes as input one parameter of type Integer and returns an Integer
people_dataset
age | name | sex | height | weight |
---|---|---|---|---|
10 | John Snow | M | 2 | 80 |
15 | Rick Grimes | M | 1.8 | 90 |
36 | Micheal Jackson | M | 1.75 | 75 |
1 | SELECT yearToMonth(age) AS num_years, * FROM people_dataset |
OUTPUT
num_years | age | name | sex | height | weight |
---|---|---|---|---|---|
120 | 10 | John Snow | M | 2 | 80 |
180 | 15 | Rick Grimes | M | 1.8 | 90 |
432 | 36 | Micheal Jackson | M | 1.75 | 75 |
In brief...¶
You can view UDFs in spark's ecosystem as a means to simplify data processing or data enrichment !
In general, UDFs are functions that takes a given number of parameters. Those parameters can either be multiple(s) column(s) and/or constant variables which can be used as options in your UDF code... UDFs returns only a single column that follows spark data types. Since spark's data types supports nested data structures, you can still output multiples columns inside a single one ! Later on, you can use some of the built-in SQL functions to explode the nested result as multiple columns !
Built-in functions¶
Spark Built-in¶
In case you want to have a look of the built-in API packaged within Spark refer to: Built-in-Functions
PunchPlatform Built-in¶
Refer to Here
Spark DataTypes¶
Follow this link here
Developing your custom UDF and installing it !¶
To develop your own UDF we provide a starter-kit maven project.
Feel free to use it: UDF maven project starter-kit
After building the package, you can follow the installation guide: Installation Guide