Skip to content

Track 5 Extending Spark SQL UDF in python

Abstract

This track explains how you implement your pyspark SQL udf.

Such implementation will be directly usable within an SQL query.

Checkout the $PUNCHPLATFORM_CONF_DIR/training/aim/track5 folder. All files referenced in this chapter are located in that folder. First read carefully the README.md file.

Dependency Management

Read the Depedency Management Guide to understand the issues at stakes.

Development

You can use the IDE (or text editor) of your choice.

Use the punchpkg tool to package and deploy your nodes on your local standalone platform. Refer to the PunchPkg Section

Example

Working directory

cd $PUNCHPLATFORM_CONF_DIR/training/aim/track5

Prerequisite

  • A standalone installed: https://punchplatform.com
  • Basic knowledge on Spark DataTypes/Scala DataTypes
  • A dave-standalone installed: https://punchplatform.com

Try it out !

PEX is used to package our python dependency

# 

# generate a udf.pex that contains all your python code and requirements
make package name=udf.pex

# install the udf.pex dependency like other dependencies
punchpkg pyspark install dist/udf.pex

# launch the example
punchlinectl start -p example_pyspark_udf.yaml

# output

Show node result:
+--------------+----------------------------------------+
| myOwnFunc2() | UDF:punch_str_to_array_double(Message) |
+--------------+----------------------------------------+
|      6       |            [1.0, 2.0, 3.0]             |
|      6       |          [1.0, 2.0, 3.0, 5.0]          |
|      6       |       [1.0, 2.0, 3.0, 99.0, 5.0]       |
|      6       |            [0.3, 2.0, 3.0]             |
+--------------+----------------------------------------+