HOWTO add my python node
Why do that¶
A major strength of pml-pyspark is to allow you to create your own nodes that match your specific needs, this section describe how to add a custom node to your standalone.
For user working on the git repository: the global maven installation of all the repos automatically add your changes to the new standalone.
It is also possible to manually update an installed standalone with the zip in the target folder of the repo, If you don't have a target folder, you need to run
mvn clean installat least once in the repo. to upgrade, copy the zip to the standalone pyspark folder (replace the old one), and unzip. Finally, copy the .venv of the repo to the freshly unzipped pyspark folder in your standalone
You need a punchplatform-standalone installed with pyspark (use --with-pyspark argument when running the installation script for the standalone, or use the graphical interface and tick pyspark).
What to do¶
Once your standalone is installed, the pyspark package containing all nodes is located at :
All the following shell commands suggest that your are at the root of the pyspark folder.
it should contain at least theses items :
1 2 3
$ ls -a . elasticsearch-hadoop-6.8.2.jar punchplatform-pyspark.py requirements.txt .. nodes.zip python-deps .venv
The files that interest us are
The workflows for adding a new node will be simplified in Dave release. We will providing cli to automatically import your dependencies and generate the needed files.
Step 1 : Backup nodes.zip¶
A little precaution that can save big mistakes : backup the files.
Additionally, its not recommended to modify an existing node since update might happen in the future. This could cause breaking changes in your PML pipeline.
Step 2 : Unzip nodes.zip¶
This will create two folder : nodes and shared :
nodes contain the nodes (obviously)
shared contains some files that implement generic methods which can be used by the nodes.
core contains the backend engine files
Step 3.1 : Add your file¶
To add a node, just copy the python file containing your node to the previously mentioned
Step 3.2 (optional) : Add additional libraries¶
You might want to use different python libraries that the one included in the standalone : the list of packaged libraries are in the requirement.txt file. If your node import a new library not included, you need to add the library :
Either add it manually with pip install (don't forget to source the .venv)
Or add it to the requirement.txt, you can specific the version with the syntax
package-name==version, or just the package name as you would install it with pip. Then run the following :
1 2 3
source .venv/bin/activate pip install -U -r requirements.txt deactivate
Step 4 : Re-zip and run¶
Once the node is fully functional, its time te repack it in a new nodes.zip :
zip nodes.zip -r nodes shared
Don't forget the shared folder
Now you can execute your pml as if you were using any punch example (no need to source the .venv) :
punchplatform-pyspark.sh --job /path/to/new/example.pml
This script behave like the punchplatform-analytics.sh, both might be merged together in the future