Track 5 Extending Spark SQL UDF in python¶
Abstract
This track explains how you implement your pyspark SQL udf.
Such implementation will be directly usable within an SQL query.
Checkout the $PUNCHPLATFORM_CONF_DIR/training/aim/track5
folder. All files referenced in this chapter are located in that folder.
First read carefully the README.md file.
Dependency Management¶
Read the Depedency Management Guide to understand the issues at stakes.
Development¶
You can use the IDE (or text editor) of your choice.
Use the punchpkg
tool to package and deploy your nodes on your local standalone platform.
Refer to the PunchPkg Section
Example¶
Working directory
cd $PUNCHPLATFORM_CONF_DIR/training/aim/track5
Prerequisite¶
- A standalone installed:
https://punchplatform.com
- Basic knowledge on Spark DataTypes/Scala DataTypes
- A dave-standalone installed:
https://punchplatform.com
Try it out !¶
PEX is used to package our python dependency
#
# generate a udf.pex that contains all your python code and requirements
make package name=udf.pex
# install the udf.pex dependency like other dependencies
punchpkg pyspark install dist/udf.pex
# launch the example
punchlinectl start -p example_pyspark_udf.yaml
# output
Show node result:
+--------------+----------------------------------------+
| myOwnFunc2() | UDF:punch_str_to_array_double(Message) |
+--------------+----------------------------------------+
| 6 | [1.0, 2.0, 3.0] |
| 6 | [1.0, 2.0, 3.0, 5.0] |
| 6 | [1.0, 2.0, 3.0, 99.0, 5.0] |
| 6 | [0.3, 2.0, 3.0] |
+--------------+----------------------------------------+