Track 5 Extending Spark SQL UDF in python¶

Abstract

This track explains how you implement your pyspark SQL udf.

Such implementation will be directly usable within an SQL query.

Checkout the $PUNCHPLATFORM_CONF_DIR/training/aim/track5 folder. All files referenced in this chapter are located in that folder. First read carefully the README.md file.

Dependency Management¶

Read the Depedency Management Guide to understand the issues at stakes.

Development¶

You can use the IDE (or text editor) of your choice.

Use the punchpkg tool to package and deploy your nodes on your local standalone platform. Refer to the PunchPkg Section

Example¶

Working directory

cd $PUNCHPLATFORM_CONF_DIR/training/aim/track5

Prerequisite¶

A standalone installed: https://punchplatform.com
Basic knowledge on Spark DataTypes/Scala DataTypes
A dave-standalone installed: https://punchplatform.com

Try it out !¶

PEX is used to package our python dependency

# 

# generate a udf.pex that contains all your python code and requirements
make package name=udf.pex

# install the udf.pex dependency like other dependencies
punchpkg pyspark install dist/udf.pex

# launch the example
punchlinectl start -p example_pyspark_udf.yaml

# output

Show node result:
+--------------+----------------------------------------+
| myOwnFunc2() | UDF:punch_str_to_array_double(Message) |
+--------------+----------------------------------------+
|      6       |            [1.0, 2.0, 3.0]             |
|      6       |          [1.0, 2.0, 3.0, 5.0]          |
|      6       |       [1.0, 2.0, 3.0, 99.0, 5.0]       |
|      6       |            [0.3, 2.0, 3.0]             |
+--------------+----------------------------------------+