Skip to content

Track 4 Extending Spark SQL UDF and UDAF in java


This track explains how you implement your spark SQL udf and udaf.

Such implementation will be directly usable within an SQL query for both spark java and python.

Checkout the $PUNCHPLATFORM_CONF_DIR/training/aim/track4 folder. All files referenced in this chapter are located in that folder. First read carefully the file.

Dependency Management

Read the Depedency Management Guide to understand the issues at stakes.



Stands for user defined function.

A function that will be applied to a spark dataframe row by row.


Stands for user defined aggregated function.

A function that will be applied to a spark dataframe with N rows and M columns. The function will output a single row and column value by aggregating all the inputted rows.


To benefit from spark sql optimization engine, new implementation of spark UDF function should be done in JAVA and SCALA.

UDAF on the other hand can only be implemented in JAVA and SCALA.

You can use the IDE (or text editor) of your choice.

Use the punchpkg tool to package and deploy your nodes on your local standalone platform. Refer to the PunchPkg Section


Working directory

cd $PUNCHPLATFORM_CONF_DIR/training/aim/track4


  • A standalone installed:
  • Basic knowledge on Spark DataTypes/Scala DataTypes
  • Basic knowledge on Spark API: UDF

UDF & UDAF: Try it out !

Use maven to package the compiled bytecodes in a jar

Checkout the implementation

# build the project
mvn clean install

# a jar with dependencies is generated in $(pwd)/target
# install it with punchpkg
punchpkg spark install target/punchplatform-udf-starter-kit-*-jar-with-dependencies.jar

# now try without the installed udf dependency
punchlinectl start -p before_udf_helloworld.yaml

# let's try with our udf dependency
punchlinectl start -p after_udf_helloworld.yaml