Skip to content

UDF: User Defined Function

Let's try enriching our data with our own UDF function yearToMonth:

yearToMonth is a function that takes as input one parameter of type Integer and returns an Integer

people_dataset

age name sex height weight
10 John Snow M 2 80
15 Rick Grimes M 1.8 90
36 Micheal Jackson M 1.75 75
1
SELECT yearToMonth(age) AS num_years, * FROM people_dataset

OUTPUT

num_years age name sex height weight
120 10 John Snow M 2 80
180 15 Rick Grimes M 1.8 90
432 36 Micheal Jackson M 1.75 75

In brief...

You can view UDFs in spark's ecosystem as a means to simplify data processing or data enrichment !

In general, UDFs are functions that takes a given number of parameters. Those parameters can either multiple columns and/or defined variables which can be used as options in the UDF code... UDFs returns only a single column that follows spark data types. Since spark's data types supports nested data structures, you can still output your desire output inside a single column. Later on, you can use some of the built-in SQL functions to explode the nested result as multiple columns !

Built-in functions

Spark Built-in

In case you want to have a look of the built-in API packaged within Spark refer to: Built-in-Functions

PunchPlatform Built-in

Refer to Here

Spark DataTypes

Follow

Developing your custom UDF and installing it !

To develop your own UDF we provide a starter-kit maven project.

Feel free to use it: UDF maven project starter-kit

After you built the package, you can follow the installation guide: Installation Guide