Spark and Pyspark Dependency Management¶

Abstract

this chapter explains the user application dependency management issues. You can skip it if you have no custom nodes on your platform.

The punch lets you add your own java or python nodes to the librarires of nodes, so that you can in turn refer to these in punchlines. At the end a punchline is either bundled as a java jar with dependencies package, or a pex package depending if you code in java or python.

Whatever you do you must declare and include your required runtime dependencies so that at execution, all your libraries are properly located and loaded.

Spark¶

All the punch libraries are shaded using maven shading plugins. There should be no conflict with yours. We recommend that you also use shading to avoid conflicts with the spark standard jars.

PySpark¶

PySpark is a wrapper of the java Spark runtime, it is possible to include jars dependencies, given your code calls the proper APIs to make use of them.

PEX¶

The punch PySpark implementation leverages PEX executables for bringing in dependencies. There are two possible scenarios in the python world:

load all the available python dependencies during runtime
or load only a selected list of python dependencies.

Note

selecting a list of dependencies to load instead of loading all of them can significantly reduce the runtime loading btime.

Loading a list of selected python dependencies enables you to have multiple punchlines using different version of the same python module.

Warning

By default, if not specified, no dependencies are taken into account.

In case you want to select dependencies loaded during runtime, use the option: spark.additional.pex: pex1.pex,pex2.pex in the settings: {...} section of your punchline.

Tip

use punchpkg pyspark list-dependencies to get an overview of installed dependencies !

Punch Enforced Dependencies¶

Below is a list of python modules that we bring out of the box and whose version cannot be overriden. We tried our best to limit those dependencies to a minimum, so as to avoid version conflict/code incompatibilities with yours.

art==4.2
cov-core==1.15.0
coverage==4.5.4
elasticsearch==7.0.5
hjson==3.0.1
numpy==1.18.4
pandas==1.0.3
pex==1.6.12
prettytable==0.7.2
py4j==0.10.7
pyarrow==0.17.0
pyspark==2.4.3
python-dateutil==2.8.1
pytz==2020.1
six==1.14.0
urllib3==1.25.6

Example(s)¶

Below is a punchline that prints to stdout a spark dataframe. Executing this punchline requires:

complex_algorithm_dependencies.pex and
pandas_v1.pex

to exists in $PUNCHPLATFORM_INSTALL_DIR/extlib/pyspark

---
type: punchline
runtime: pyspark
channel: default
version: '6.0'
tenant: default
dag:
- settings:
    input_data:
    - date: "{{ from }}"
      name: from_date
    - date: "{{ to }}"
      name: to_date
  component: input
  publish:
  - stream: data
  type: dataset_generator
- settings:
    truncate: false
  component: show
  subscribe:
  - component: input
    stream: data
  type: show
settings:
  spark.additional.pex: complex_algorithm_dependencies.pex,pandas_v1.pex