UDF: User Defined Function¶
Let's try enriching our data with our own UDF function
yearToMonth is a function that takes as input one parameter of type Integer and returns an Integer
SELECT yearToMonth(age) AS num_years, * FROM people_dataset
You can view UDFs in spark's ecosystem as a means to simplify data processing or data enrichment !
In general, UDFs are functions that takes a given number of parameters. Those parameters can either multiple columns and/or defined variables which can be used as options in the UDF code... UDFs returns only a single column that follows spark data types. Since spark's data types supports nested data structures, you can still output your desire output inside a single column. Later on, you can use some of the built-in SQL functions to explode the nested result as multiple columns !
In case you want to have a look of the built-in API packaged within Spark refer to: Built-in-Functions
Refer to Here
Developing your custom UDF and installing it !¶
To develop your own UDF we provide a starter-kit maven project.
Feel free to use it: UDF maven project starter-kit
After you built the package, you can follow the installation guide: Installation Guide