Spark
Apache Spark in Punch comes along with official spark binaries and with our extended binaries, be it for python or java.
Description¶
Check Spark getting started for more information.
Spark/Pyspark binaries¶
Spark binaries in punch can be break into 3 parts.
- internal punch binaries
- external binaries
- official spark binaries
An overview below:
$ls $PUNCHPLATFORM_INSTALL_DIR
apache-storm-2.3.0 lib logstash-7.10.2 plugins resources spark-2.4.3-bin-hadoop2.7
$tree $PUNCHPLATFORM_INSTALL_DIR/lib
├── punch-archive-app-6.1.0-jar-with-dependencies.jar
├── ...
├── pyspark
│ ├── aws-java-sdk-1.7.4.jar
│ ├── hadoop-aws-2.7.7.jar
│ ├── log4j-api-2.12.1.jar
│ ├── log4j-core-2.12.1.jar
│ ├── punchline_python-6.1.0_SNAPSHOT-py3-none-any.whl
│ ├── punchplatform-pyspark-6.1.0.pex
│ ├── punch-punchlang-compile-lib-6.1.0.jar
│ ├── punch-punchlang-runtime-lib-6.1.0.jar
│ ├── punch-spark-metrics-6.1.0-jar-with-dependencies.jar
│ ├── punch-spark-node-es-basic-6.1.0.jar
│ ├── punch-spark-node-es-spark-6.1.0.jar
│ ├── punch-spark-node-json-6.1.0.jar
│ ├── punch-spark-node-kafka-6.1.0.jar
│ ├── punch-spark-udf-6.1.0-jar-with-dependencies.jar
│ ├── python_main
│ │ ├── pex_path.py
│ │ ├── punchline.py
│ │ └── scanner.py
│ ├── spark-sql-kafka-0-10_2.11-2.4.3.jar
│ └── spark-streaming-kafka-0-8_2.11-2.4.3.jar
└── spark
├── punch-spark-client-6.1.0-jar-with-dependencies.jar
├── punch-spark-configuration-6.1.0.jar
├── punch-spark-job-6.1.0.jar
├── punch-spark-metrics-6.1.0-jar-with-dependencies.jar
├── punch-spark-scan-6.1.0.jar
└── punch-spark-uber-6.1.0-jar-with-dependencies.jar
Notice that extlib is missing in $PUNCHPLATFORM_INSTALL_DIR
. This is because no external dependencies were deployed !
Internal Punch binaries¶
Internal punch binaries are located for spark under:
$PUNCHPLATFORM_INSTALL_DIR/lib/spark
whereas for pyspark:
$PUNCHPLATFORM_INSTALL_DIR/lib/pyspark
When a patch is done; binaries located under the mentionned directory are substituted by the freshly deployed ones.
External binaries¶
External binaries are located for spark under:
$PUNCHPLATFORM_INSTALL_DIR/extlib/spark
and for pyspark under:
$PUNCHPLATFORM_INSTALL_DIR/extlib/pyspark
All external jars deployed by the operator should be under this directory.
These dependencies can later be added to a Punchline with Spark runtime through the use of spark.additional.jar
or spark.additional.pex
.
Patch internal punch binaries for Spark and Pyspark¶
Refer to this documentation
Add an external library for Spark and Pyspark¶
Refer to this documentation