Skip to content

Spark

Apache Spark in Punch comes along with official spark binaries and with our extended binaries, be it for python or java.

Description

Check Spark getting started for more information.

Spark/Pyspark binaries

Spark binaries in punch can be break into 3 parts.

  • internal punch binaries
  • external binaries
  • official spark binaries

An overview below:

$ls $PUNCHPLATFORM_INSTALL_DIR
apache-storm-2.3.0  lib  logstash-7.10.2  plugins resources  spark-2.4.3-bin-hadoop2.7

$tree $PUNCHPLATFORM_INSTALL_DIR/lib
├── punch-archive-app-6.1.0-jar-with-dependencies.jar
├── ...
├── pyspark
│   ├── aws-java-sdk-1.7.4.jar
│   ├── hadoop-aws-2.7.7.jar
│   ├── log4j-api-2.12.1.jar
│   ├── log4j-core-2.12.1.jar
│   ├── punchline_python-6.1.0_SNAPSHOT-py3-none-any.whl
│   ├── punchplatform-pyspark-6.1.0.pex
│   ├── punch-punchlang-compile-lib-6.1.0.jar
│   ├── punch-punchlang-runtime-lib-6.1.0.jar
│   ├── punch-spark-metrics-6.1.0-jar-with-dependencies.jar
│   ├── punch-spark-node-es-basic-6.1.0.jar
│   ├── punch-spark-node-es-spark-6.1.0.jar
│   ├── punch-spark-node-json-6.1.0.jar
│   ├── punch-spark-node-kafka-6.1.0.jar
│   ├── punch-spark-udf-6.1.0-jar-with-dependencies.jar
│   ├── python_main
│   │   ├── pex_path.py
│   │   ├── punchline.py
│   │   └── scanner.py
│   ├── spark-sql-kafka-0-10_2.11-2.4.3.jar
│   └── spark-streaming-kafka-0-8_2.11-2.4.3.jar
└── spark
    ├── punch-spark-client-6.1.0-jar-with-dependencies.jar
    ├── punch-spark-configuration-6.1.0.jar
    ├── punch-spark-job-6.1.0.jar
    ├── punch-spark-metrics-6.1.0-jar-with-dependencies.jar
    ├── punch-spark-scan-6.1.0.jar
    └── punch-spark-uber-6.1.0-jar-with-dependencies.jar

Notice that extlib is missing in $PUNCHPLATFORM_INSTALL_DIR. This is because no external dependencies were deployed !

Internal Punch binaries

Internal punch binaries are located for spark under:

$PUNCHPLATFORM_INSTALL_DIR/lib/spark

whereas for pyspark:

$PUNCHPLATFORM_INSTALL_DIR/lib/pyspark

When a patch is done; binaries located under the mentionned directory are substituted by the freshly deployed ones.

External binaries

External binaries are located for spark under:

$PUNCHPLATFORM_INSTALL_DIR/extlib/spark

and for pyspark under:

$PUNCHPLATFORM_INSTALL_DIR/extlib/pyspark

All external jars deployed by the operator should be under this directory.

These dependencies can later be added to a Punchline with Spark runtime through the use of spark.additional.jar or spark.additional.pex.

Patch internal punch binaries for Spark and Pyspark

Refer to this documentation

Add an external library for Spark and Pyspark

Refer to this documentation