Track 4 Extending Spark SQL UDF and UDAF in java¶
Abstract
This track explains how you implement your spark SQL udf and udaf.
Such implementation will be directly usable within an SQL query for both spark java and python.
Checkout the $PUNCHPLATFORM_CONF_DIR/training/aim/track4
folder. All files referenced in this chapter are located in that folder.
First read carefully the README.md file.
Dependency Management¶
Read the Depedency Management Guide to understand the issues at stakes.
Development¶
UDF
Stands for user defined function.
A function that will be applied to a spark dataframe row by row.
UDAF
Stands for user defined aggregated function.
A function that will be applied to a spark dataframe with N rows and M columns. The function will output a single row and column value by aggregating all the inputted rows.
Audience
To benefit from spark sql optimization engine, new implementation of spark UDF function should be done in JAVA
and SCALA
.
UDAF on the other hand can only be implemented in JAVA and SCALA.
You can use the IDE (or text editor) of your choice.
Use the punchpkg
tool to package and deploy your nodes on your local standalone platform.
Refer to the PunchPkg Section
Example¶
Working directory
cd $PUNCHPLATFORM_CONF_DIR/training/aim/track4
Prerequisite¶
- A standalone installed:
https://punchplatform.com
- Basic knowledge on Spark DataTypes/Scala DataTypes
- Basic knowledge on Spark API: UDF
UDF & UDAF: Try it out !¶
Use maven to package the compiled bytecodes in a jar
Checkout the implementation
StrToArrayString.java
SumAggregate.java
# build the project
mvn clean install
# a jar with dependencies is generated in $(pwd)/target
# install it with punchpkg
punchpkg spark install target/punchplatform-udf-starter-kit-*-jar-with-dependencies.jar
# now try without the installed udf dependency
punchlinectl start -p before_udf_helloworld.yaml
# let's try with our udf dependency
punchlinectl start -p after_udf_helloworld.yaml