Spark and Pyspark Dependency Management¶
Abstract
this chapter explains the user application dependency management issues. You can skip it if you have no custom nodes on your platform.
The punch lets you add your own java or python nodes to the librarires of nodes, so that you can in turn refer to these in punchlines. At the end a punchline is either bundled as a java jar with dependencies package, or a pex package depending if you code in java or python.
Whatever you do you must declare and include your required runtime dependencies so that at execution, all your libraries are properly located and loaded.
Spark¶
All the punch libraries are shaded using maven shading plugins. There should be no conflict with yours. We recommend that you also use shading to avoid conflicts with the spark standard jars.
PySpark¶
PySpark is a wrapper of the java Spark runtime, it is possible to include jars dependencies, given your code calls the proper APIs to make use of them.
PEX¶
The punch PySpark implementation leverages PEX executables for bringing in dependencies. There are two possible scenarios in the python world:
- load all the available python dependencies during runtime
- or load only a selected list of python dependencies.
Note
selecting a list of dependencies to load instead of loading all of them can significantly reduce the runtime loading btime.
Loading a list of selected python dependencies enables you to have multiple punchlines using different version of the same python module.
Warning
By default, if not specified, no dependencies are taken into account.
In case you want to select dependencies loaded during
runtime, use the option: spark.additional.pex: pex1.pex,pex2.pex
in the settings: {...}
section of your punchline.
Tip
use punchpkg pyspark list-dependencies
to get an overview of installed dependencies !
Punch Enforced Dependencies¶
Below is a list of python modules that we bring out of the box and whose version cannot be overriden. We tried our best to limit those dependencies to a minimum, so as to avoid version conflict/code incompatibilities with yours.
art==4.2
cov-core==1.15.0
coverage==4.5.4
elasticsearch==7.0.5
hjson==3.0.1
numpy==1.18.4
pandas==1.0.3
pex==1.6.12
prettytable==0.7.2
py4j==0.10.7
pyarrow==0.17.0
pyspark==2.4.3
python-dateutil==2.8.1
pytz==2020.1
six==1.14.0
urllib3==1.25.6
Example(s)¶
Below is a punchline that prints to stdout a spark dataframe. Executing this punchline requires:
- complex_algorithm_dependencies.pex and
- pandas_v1.pex
to exists in $PUNCHPLATFORM_INSTALL_DIR/extlib/pyspark
---
type: punchline
runtime: pyspark
channel: default
version: '6.0'
tenant: default
dag:
- settings:
input_data:
- date: "{{ from }}"
name: from_date
- date: "{{ to }}"
name: to_date
component: input
publish:
- stream: data
type: dataset_generator
- settings:
truncate: false
component: show
subscribe:
- component: input
stream: data
type: show
settings:
spark.additional.pex: complex_algorithm_dependencies.pex,pandas_v1.pex