Skip to content

Spark and Pyspark Dependency Management

Abstract

this chapter explains the user application dependency management issues. You can skip it if you have no custom nodes on your platform.

The punch lets you add your own java or python nodes to the librarires of nodes, so that you can in turn refer to these in punchlines. At the end a punchline is either bundled as a java jar with dependencies package, or a pex package depending if you code in java or python.

Whatever you do you must declare and includes your required runtime dependencies so that at execution, all your libraries are properly located and loaded.

Spark

All the punch librairies are shaded using maven shading plugins. There should be no conflict with yours. We recommand you also use shading to avoid conflicts with the spark standard jars.

PySpark

PySpark is a wrapper of the java Spark runtime, it is possible to include jars dependencies, given your code call the proper APIs to make use of them.

PEX

The punch PySpark implementation leverages PEX executables for bringing in dependencies. There are two possible scenarios in the python world:

  • load all the available python dependencies during runtime
  • or load only a selected list of python dependencies.

Note

selecting a list of dependencies to load instead of loading all of them can significantly reduce the runtime loading btime.

Loading a list of selected python dependencies enables you to have multiple punchlines using different version of the same python module.

Warning

By default, if not specified, all dependencies returned by punchpkg pyspark list-dependencies command are taken into account during runtime. In case you want to reduce/select dependencies loaded during runtime, use the option: spark.additional.pex: pex1.pex,pex2.pex in the settings: {...} section of your punchline.

Tip

use punchpkg pyspark list-dependencies to get an overview of installed dependencies !

Punch Enforced Dependencies

Below is a list of python modules that we bring out of the box and whose version cannot be overriden, We tried our best to limit those dependencies to a minimum, so as to avoid version conflict/code incompatibilities with yours.

art==4.2
cov-core==1.15.0
coverage==4.5.4
elasticsearch==7.0.5
hjson==3.0.1
numpy==1.18.4
pandas==1.0.3
pex==1.6.12
prettytable==0.7.2
py4j==0.10.7
pyarrow==0.17.0
pyspark==2.4.3
python-dateutil==2.8.1
pytz==2020.1
six==1.14.0
urllib3==1.25.6

Example(s)

{
  type: punchline
  runtime: spark
  channel: default
  version: "6.0"
  tenant: default
  dag:
  [
    {
      settings:
      {
        input_data:
        [
          {
            date: "{{ from }}"
            name: from_date
          }
          {
            date: "{{ to }}"
            name: to_date
          }
        ]
      }
      component: input
      publish:
      [
        {
          stream: data
        }
      ]
      type: dataset_generator
    }
    {
      settings:
      {
        truncate: false
      }
      component: show
      subscribe:
      [
        {
          component: input
          stream: data
        }
      ]
      type: show
    }
  ]
  settings: {
    spark.additional.pex: complex_algorithm_dependencies.pex,pandas_v1.pex
  }
}