Skip to content

Track 4 Extending Spark SQL UDF and UDAF in java

Abstract

This track explains how you implement your spark SQL udf and udaf.

Such implementation will be directly usable within an SQL query for both spark java and python.

Checkout the $PUNCHPLATFORM_CONF_DIR/training/aim/track4 folder. All files referenced in this chapter are located in that folder. First read carefully the README.md file.

Dependency Management

Read the Depedency Management Guide to understand the issues at stakes.

Development

UDF

Stands for user defined function.

A function that will be applied to a spark dataframe row by row.

UDAF

Stands for user defined aggregated function.

A function that will be applied to a spark dataframe with N rows and M columns. The function will output a single row and column value by aggregating all the inputted rows.

Audience

To benefit from spark sql optimization engine, new implementation of spark UDF function should be done in JAVA and SCALA.

UDAF on the other hand can only be implemented in JAVA and SCALA.

You can use the IDE (or text editor) of your choice.

Use the punchpkg tool to package and deploy your nodes on your local standalone platform. Refer to the PunchPkg Section

Example

Working directory

cd $PUNCHPLATFORM_CONF_DIR/training/aim/track4

Prerequisite

  • A standalone installed: https://punchplatform.com
  • Basic knowledge on Spark DataTypes/Scala DataTypes
  • Basic knowledge on Spark API: UDF

UDF & UDAF: Try it out !

Use maven to package the compiled bytecodes in a jar

Checkout the implementation

StrToArrayString.java

SumAggregate.java

# build the project
mvn clean install

# a jar with dependencies is generated in $(pwd)/target
# install it with punchpkg
punchpkg spark install target/punchplatform-udf-starter-kit-*-jar-with-dependencies.jar

# now try without the installed udf dependency
punchlinectl start -p before_udf_helloworld.yaml

# let's try with our udf dependency
punchlinectl start -p after_udf_helloworld.yaml