HOWTO add my python node
Why do that¶
A major strength of pml-pyspark is to allow you to create your own nodes that match your specific needs, this section describe how to add a custom node to your standalone.
For user working on the git repository: the global maven installation of all the repos automatically add your changes to the new standalone.
It also possible to manually update an installed standalone with the zip in the target folder of the repo, If you don't have a target folder, you need to run
mvn clean installat least once in the repo. to upgrade, copy the zip to the standalone pyspark folder (replace the old one), and unzip. Finally, copy the .venv of the repo to the freshly unzipped pyspark folder in your standalone
You need a punchplatform-standalone installed with pyspark (use --with-pyspark argument when running the installation script for the standalone, or use the graphical interface and tick pyspark).
What to do¶
Once your standalone is installed, the py-spark package containing all nodes is located at :
All the following shell commands suggest that your are at the root of the pyspark folder.
it should contain at least theses items :
1 2 3
$ ls -a . elasticsearch-hadoop-6.8.2.jar punchplatform-pyspark.py requirements.txt .. nodes.zip python-deps .venv
The files that interest us are
Step 1 : Backup nodes.zip¶
A little precaution that can save big mistakes : backup the files.
Additionally, its not recommended to modifies already existing nodes or files, Doing so it at your own risks.
Step 2 : Unzip nodes.zip¶
This will create two folder : nodes and shared :
nodes contain the nodes (obviously)
shared contains some files that implement generic methods which can be used by the nodes.
core contains the backend engine files
Step 3 : Add your file¶
To add a node, just copy the python file containing your node to the previously mentioned
Step 3.1 (optional) : Test your new node¶
From now on, you could repackage the nodes.zip and test your code using the punchplatform-pyspark shell command, however its also possible to test your node right now, allowing you to apply the last correction to your nodes before re-packaging.
Warning : at this point you need to remove or rename the old nodes.zip to avoid conflicts (be sure to backup) if you don't delete/rename, it will still use the zipped nodes, and wont find the newly added node.
To live test the node, run the following commands (from the pyspark directory) :
source .venv/bin/activate python punchplatform_pyspark.py --job /path/to/my/example.pml
Or to test with a spark submit :
source .venv/bin/activate spark-submit --jars elasticsearch-hadoop-6.8.2.jar --py-files nodes punchplatform_pyspark.py --job /path/to/my/example.pml
You need to specify the path to a test config file (.pml), it can be any pml you planned to used after adding your node, or a fake one with just your node and a show node.
Step 3.2 (optional) : Add additional libraries¶
You might want to use different python libraries that the one included in the standalone : the list of packaged libraries are in the requirement.txt file. If your node import a new library not included, you need to add the library :
Either add it manually with pip install (don't forget to source the .venv)
Or add it to the requirement.txt, you can specific the version with the syntax
package-name==version, or just the package name as you would install it with pip. Then run the following :
1 2 3
source .venv/bin/activate pip install -U -r requirements.txt deactivate
Step 4 : Re-zip and run¶
Once the node is fully functional, its time te repack it in a new nodes.zip :
zip nodes.zip -r nodes shared
Don't forget the shared folder
Now you can execute your pml as if you were using any punch example (no need to source the .venv) :
punchplatform-pyspark.sh --job /path/to/new/example.pml
This scrip behave like the punchplatform-analytics.sh, both might be merged together in the future