Skip to content

Track 1 Spark/Pyspark Punchline Fundamentals

Abstract

This track explains the punch analytics and machine learning capabilities. What are the issues to solve ? What is the perimeter covered by the Punchplatform ? When is a custom development necessary ?

Punchlines

Info

A punchline is a data processing application represented by a graph of nodes. The punch provides a simple graph configuration language that can be used to represent different kinds of applications : stream or batch, simple extract-transform-load (ETL) or machine learning applications.

Have a look at the spark_punchline.yaml in the $PUNCHPLATFORM_CONF_DIR/trainings/aim/track1 directory of your standalone. For now, you should focus only on the punchline itself.

Execute it as a spark java application using the punchlinectl command line tool:

punchlinectl start -p $PUNCHPLATFORM_CONF_DIR/trainings/aim/track1/spark_punchline.yaml

Execute it now as a spark python application

punchlinectl start -p $PUNCHPLATFORM_CONF_DIR/trainings/aim/track1/pyspark_punchline.yaml

Tip

There are many benefits of representing applications using graphs:

  1. you do not code anymore, you assemble ready to use nodes
  2. your application can be checked and audited before running in production
  3. data access is made dramatically easier using the many input/output nodes to read files, sql databases, kafka, elasticsearch etc..
  4. applications can be protected easily

Plans

Info

Often, you need to periodically run a punchline to process some ranges of data. For example running an application every hour that will detect new data arrival in the last hour. This is a basic need. Instead of coding this on you own, the punch provides you with plans. A plan is a punchline scheduler.

Have a look at the plan.yaml in the conf/tenants/training/channels/spark_plan folder, together with the hello_world.yaml.

Use the planctl command line tool to start it. Notice here we request it to run in the regular (java-based) runtime.

Working directory

cd $PUNCHPLATFORM_CONF_DIR/trainings/aim/track1

Spark runtime

planctl start \
    --plan plan.yaml \
    --template spark_punchline.yaml

Plans can also submit applications to a python spark engine.

planctl start \
    --plan plan.yaml \
    --template pyspark_punchline.yaml

Tip

If you think about it, this is really magic. The same application can be submitted as a java or python application.

Why does the punch offer this feature ? Not really because it makes sense from a runtime point of view. In reality you will setup punchline leveraging python-only or java-only for processing or I/O nodes. And your punchlines will only be compatible with a given runtime. This said the same dag model is used for all punch application : stream, batch, python or java. This is a key strength of the punch.

Coming back to the example you just executed, it turns out that all the nodes in the sample hello punchline are both available in the java and python punch node libraries. Hence the magic.

Question

Can you imagine three functional examples where a plan would be useful ?

Working with Channels

Scheduling a punchline within a channel

Info

On a production platform a spark punchline must be highly available. You do that by declaring it as a scheduled application. That guarantees that your punchline will always be running. Should a server fail, it will be properly rescheduled on another server.

Checkout the standalone trainingstenant. It provides two examples: sparkplan and sparkpunchline.

channel_structure.yaml

Have a look at the channel_structure.yaml file, it is self-explanatory.

Use channelctl to status, start and stop these examples.

First query the status of all the trainings channels:

channelctl --tenant trainings status

Start a channel that includes a spark punchline:

channelctl --tenant trainings start --channel sparkpunchline

Check the channel status, it must be active

channelctl --tenant trainings status
Stop the channel

channelctl --tenant trainings stop --channel sparkpunchline
and recheck its status
channelctl --tenant trainings status

Scheduling a plan within a channel

Info

Just like for punchlines, plans must be executed in a resilient manner. Shiva or Kubernetes are in actions to schedule, run and restart them whenever needed.

Execute the following sequence of actions:

channelctl --tenant trainings status
channelctl --tenant trainings start --channel sparkplan
channelctl --tenant trainings status
channelctl --tenant trainings stop --channel sparkplan

channelctl --tenant trainings status