Track 2 Pyspark Node Development¶
Abstract
This track explains how you can code your own python custom node.
Checkout the $PUNCHPLATFORM_CONF_DIR/training/aim/track2
folder. All files referenced in this chapter are located in that folder.
First read carefully the README.md file.
Dependency Management¶
Read the Depedency Management Guide to understand the issues at stakes.
Development¶
You can use the IDE (or text editor) of your choice but we recommend the PyCharm IDE.
Use the punchpkg
tool to package and deploy your nodes on your local standalone platform.
Refer to the PunchPkg Section
Example¶
Working directory
cd $PUNCHPLATFORM_CONF_DIR/training/aim/track2
Prerequisites¶
- standalone installed:
https://punchplatform.com
Try it out !¶
A Makefile
is put at your disposal, use it to clean, lint and format your custom node code !
# check lint and code formatting for module algorithms
make inspect path=algorithms/
# clean unwanted .pyc if any
make clean
Note that we will be using punchpkg
...
# Use punchpkg pyspark info to get install dir of pyspark
> punchpkg pyspark info
# Let's try running our template_node
# To begin, we will make our node available to our shells and editor
> eval "$(_PUNCHPKG_COMPLETE=source punchpkg)" # for auto completion
> punchpkg pyspark link-external-nodes $(pwd) # note: pwd here is rootdir of this README.txt
> punchpkg pyspark list-external-nodes # check if node was linked properly
> punchpkg pyspark install-dependencies $(pwd)/complex_algorithm_dependencies # install custom dependencies needed by your module (note: if the given module is not available on PyPI, please convert your module to PEX and use the same command on your PEX file !)
> punchlinectl start -p $(pwd)/full_job.yaml
--[[
__________ .__ .____ .__
\______ \__ __ ____ ____ | |__ | | |__| ____ ____
| ___/ | \/ \_/ ___\| | \| | | |/ \_/ __ \
| | | | / | \ \___| Y \ |___| | | \ ___/
|____| |____/|___| /\___ >___| /_______ \__|___| /\___ >
\/ \/ \/ \/ \/ \/
--]]
____ ________ _________ _
|__]\_/ [__ |__]|__||__/|_/
| | ___]| | || \| \_
using nodes from ./nodes sources
Hello punch
Execution took 0.18007254600524902 seconds
Let's try for now to add some autocompletion to our favorite IDE
# Grab our punchline_python.whl file and install it using pip install in a virtualenv
# Note when using pip install some_modules. Be sure to track added modules in a seperate file.
# i.e don't mix our installed dependencies with your since this would generate big PEX files...
Coding/deploying your custom node¶
Follow our development guide
Making your node available to our environment¶
# In case your node uses some custom modules like: pandas
# You should provide a text file named as your module.
# The text file should include only the custom modules your node is using
punchpkg pyspark install full/path/to/text_file/custom_modules
> punchpkg pyspark install complex_algorithm_dependencies
# Check if your custom module is properly installed
# A json document will be outputted on stdout, search for the key custom_pex_dependencies
# Within this key, you will see custom_modules
punchpkg pyspark list-dependencies
# Check the current module
punchpkg pyspark info
# Installing your custom node from full path (use tab for autocompletion)
punchpkg pyspark install </tab></tab>
> punchpkg pyspark install $(pwd)/algorithms
# List installed nodes
punchpkg pyspark list-nodes
# Executing a node
# either use our PL editor or use our shell punchlinectl
punchlinectl start -p full/path/to/job.punchline
> punchlinectl start -p full_job.punchline -v
--[[
__________ .__ .____ .__
\______ \__ __ ____ ____ | |__ | | |__| ____ ____
| ___/ | \/ \_/ ___\| | \| | | |/ \_/ __ \
| | | | / | \ \___| Y \ |___| | | \ ___/
|____| |____/|___| /\___ >___| /_______ \__|___| /\___ >
\/ \/ \/ \/ \/ \/
--]]
____ ________ _________ _
|__]\_/ [__ |__]|__||__/|_/
| | ___]| | || \| \_
using nodes from ./nodes sources
Hello punch
Execution took 0.18007254600524902 seconds