Skip to content

Spark and Pyspark dependencies management

This section can be skip if you don't plan on developping custom nodes in one of the following runtime:

  • spark;
  • or pyspark.

Spark runtime

All of our librairies are shaded using maven shading plugins. There should be no conflict when you will be including your jars containing your 3rd party dependencies...

PySpark runtime

Jars

Since PySpark is a wrapper of Spark runtime, it is possible to include jars dependencies, given your code call the proper APIs to make use of them...

PEX

Our PySpark implementation relies heavily on PEX executables for bringing in dependencies. There are two possible scenarios in python world:

  • load all available python dependencies during runtime;
  • or load only a selected list of python dependencies...

Note

selecting a list of dependencies to load instead of loading all of them can reduce runtime loading by a lot.

Loading a list of selected python dependencies enable you to have multiple punchlines using different version of the same python module !

Warning

By default, if not specified, all dependencies returned by punchpkg pyspark list-dependencies command are taken into account during runtime. In case you want to reduce/select dependencies loaded during runtime, use the option: spark.additional.pex: pex1.pex,pex2.pex in settings: {...} of your punchline.

Tip

use punchpkg pyspark list-dependencies to get an overview of installed dependencies !

Things that should be taken into account

Below is a list of python modules that we bring out of the box whose version cannot be overriden ! We tried our best to limit those dependencies as much as possible, so as to avoid version conflict/code incompatibility when users will be developping their nodes.

art==4.2
cov-core==1.15.0
coverage==4.5.4
elasticsearch==7.0.5
hjson==3.0.1
numpy==1.18.4
pandas==1.0.3
pex==1.6.12
prettytable==0.7.2
py4j==0.10.7
pyarrow==0.17.0
pyspark==2.4.3
python-dateutil==2.8.1
pytz==2020.1
six==1.14.0
urllib3==1.25.6

Example(s)

{
  type: punchline
  runtime: spark
  channel: default
  version: "6.0"
  tenant: default
  dag:
  [
    {
      settings:
      {
        input_data:
        [
          {
            date: "{{ from }}"
            name: from_date
          }
          {
            date: "{{ to }}"
            name: to_date
          }
        ]
      }
      component: input
      publish:
      [
        {
          stream: data
        }
      ]
      type: dataset_generator
    }
    {
      settings:
      {
        truncate: false
      }
      component: show
      subscribe:
      [
        {
          component: input
          stream: data
        }
      ]
      type: show
    }
  ]
  settings: {
    spark.additional.pex: complex_algorithm_dependencies.pex,pandas_v1.pex
  }
}