Skip to content

UDF: User Defined Function

Let's try enriching our data with our own UDF function yearToMonth:

yearToMonth is a function that takes as input one parameter of type Integer and returns an Integer


age name sex height weight
10 John Snow M 2 80
15 Rick Grimes M 1.8 90
36 Micheal Jackson M 1.75 75
SELECT yearToMonth(age) AS num_years, * FROM people_dataset


num_years age name sex height weight
120 10 John Snow M 2 80
180 15 Rick Grimes M 1.8 90
432 36 Micheal Jackson M 1.75 75

In brief...

You can view UDFs in spark's ecosystem as a means to simplify data processing or data enrichment !

In general, UDFs are functions that takes a given number of parameters. Those parameters can either be multiple(s) column(s) and/or constant variables which can be used as options in your UDF code... UDFs returns only a single column that follows spark data types. Since spark's data types supports nested data structures, you can still output multiples columns inside a single one ! Later on, you can use some of the built-in SQL functions to explode the nested result as multiple columns !

Built-in functions

Spark Built-in

In case you want to have a look of the built-in API packaged within Spark refer to: Built-in-Functions

PunchPlatform Built-in

Refer to Here

Spark DataTypes

Follow this link here

Developing your custom UDF and installing it !

To develop your own UDF we provide a starter-kit maven project.

Feel free to use it: UDF maven project starter-kit

After building the package, you can follow the installation guide: Installation Guide