Skip to content

HOWTO add my python node

Why do that

A major strength of pml-pyspark is to allow you to create your own nodes that match your specific needs, this section describe how to add a custom node to your standalone.

For user working on the git repository: the global maven installation of all the repos automatically add your changes to the new standalone.

It also possible to manually update an installed standalone with the zip in the target folder of the repo, If you don't have a target folder, you need to run mvn clean install at least once in the repo. to upgrade, copy the zip to the standalone pyspark folder (replace the old one), and unzip. Finally, copy the .venv of the repo to the freshly unzipped pyspark folder in your standalone

Prerequisites

You need a punchplatform-standalone installed with pyspark (use --with-pyspark argument when running the installation script for the standalone, or use the graphical interface and tick pyspark).

What to do

Once your standalone is installed, the py-spark package containing all nodes is located at : my-standalone-dir/external/puncplatform-pyspark-x.y.z/

All the following shell commands suggest that your are at the root of the pyspark folder.

it should contain at least theses items :

1
2
3
$ ls -a
.   elasticsearch-hadoop-6.8.2.jar  punchplatform-pyspark.py  requirements.txt
..  nodes.zip                       python-deps               .venv

The files that interest us are nodes.zip and requirements.txt.

Step 1 : Backup nodes.zip

A little precaution that can save big mistakes : backup the files.

Additionally, its not recommended to modifies already existing nodes or files, Doing so it at your own risks.

Step 2 : Unzip nodes.zip

This will create two folder : nodes and shared : - nodes contain the nodes (obviously) - shared contains some files that implement generic methods which can be used by the nodes. - core contains the backend engine files

Step 3 : Add your file

To add a node, just copy the python file containing your node to the previously mentioned nodes folder.

Step 3.1 (optional) : Test your new node

From now on, you could repackage the nodes.zip and test your code using the punchplatform-pyspark shell command, however its also possible to test your node right now, allowing you to apply the last correction to your nodes before re-packaging.

Warning : at this point you need to remove or rename the old nodes.zip to avoid conflicts (be sure to backup) if you don't delete/rename, it will still use the zipped nodes, and wont find the newly added node.

To live test the node, run the following commands (from the pyspark directory) :

1
2
source .venv/bin/activate
python punchplatform_pyspark.py --job /path/to/my/example.pml

Or to test with a spark submit :

1
2
source .venv/bin/activate
spark-submit --jars elasticsearch-hadoop-6.8.2.jar --py-files nodes punchplatform_pyspark.py --job /path/to/my/example.pml

You need to specify the path to a test config file (.pml), it can be any pml you planned to used after adding your node, or a fake one with just your node and a show node.

Step 3.2 (optional) : Add additional libraries

You might want to use different python libraries that the one included in the standalone : the list of packaged libraries are in the requirement.txt file. If your node import a new library not included, you need to add the library :

  • Either add it manually with pip install (don't forget to source the .venv)

  • Or add it to the requirement.txt, you can specific the version with the syntax package-name==version, or just the package name as you would install it with pip. Then run the following :

1
2
3
source .venv/bin/activate
pip install -U -r requirements.txt
deactivate 

Step 4 : Re-zip and run

Once the node is fully functional, its time te repack it in a new nodes.zip :

1
zip nodes.zip -r nodes shared

Don't forget the shared folder

Now you can execute your pml as if you were using any punch example (no need to source the .venv) :

1
punchplatform-pyspark.sh --job /path/to/new/example.pml

This scrip behave like the punchplatform-analytics.sh, both might be merged together in the future