Track 1 Spark/Pyspark Punchline Fundamentals¶
Abstract
This track explains the punch analytics and machine learning capabilities. What are the issues to solve ? What is the perimeter covered by the Punchplatform ? When is a custom development necessary ?
Punchlines¶
Info
A punchline is a data processing application represented by a graph of nodes. The punch provides a simple graph configuration language that can be used to represent different kinds of applications : stream or batch, simple extract-transform-load (ETL) or machine learning applications.
Have a look at the spark_punchline.yaml
in the $PUNCHPLATFORM_CONF_DIR/trainings/aim/track1
directory of your standalone.
For now, you should focus only on the punchline itself.
Execute it as a spark java application using the punchlinectl
command line tool:
punchlinectl start -p $PUNCHPLATFORM_CONF_DIR/trainings/aim/track1/spark_punchline.yaml
Execute it now as a spark python application
punchlinectl start -p $PUNCHPLATFORM_CONF_DIR/trainings/aim/track1/pyspark_punchline.yaml
Tip
There are many benefits of representing applications using graphs:
- you do not code anymore, you assemble ready to use nodes
- your application can be checked and audited before running in production
- data access is made dramatically easier using the many input/output nodes to read files, sql databases, kafka, elasticsearch etc..
- applications can be protected easily
Plans¶
Info
Often, you need to periodically run a punchline to process some ranges of data. For example
running an application every hour that will detect new data arrival in the last hour.
This is a basic need. Instead of coding this on you own, the punch provides you with plans
.
A plan is a punchline scheduler.
Have a look at the plan.yaml
in the conf/tenants/training/channels/spark_plan folder,
together with the hello_world.yaml
.
Use the planctl
command line tool to start it. Notice here we request it to run in the regular (java-based)
runtime.
Working directory
cd $PUNCHPLATFORM_CONF_DIR/trainings/aim/track1
Spark runtime
planctl start \
--plan plan.yaml \
--template spark_punchline.yaml
Plans can also submit applications to a python spark engine.
planctl start \
--plan plan.yaml \
--template pyspark_punchline.yaml
Tip
If you think about it, this is really magic. The same application can be submitted as a java or python application.
Why does the punch offer this feature ? Not really because it makes sense from a runtime point of view. In reality you will setup punchline leveraging python-only or java-only for processing or I/O nodes. And your punchlines will only be compatible with a given runtime. This said the same dag model is used for all punch application : stream, batch, python or java. This is a key strength of the punch.
Coming back to the example you just executed, it turns out that all the nodes in the sample hello punchline are both available in the java and python punch node libraries. Hence the magic.
Question
Can you imagine three functional examples where a plan would be useful ?
Working with Channels¶
Scheduling a punchline within a channel¶
Info
On a production platform a spark punchline must be highly available. You do that by declaring it as a scheduled application. That guarantees that your punchline will always be running. Should a server fail, it will be properly rescheduled on another server.
Checkout the standalone trainings
tenant. It provides two examples: sparkplan and sparkpunchline.
channel_structure.yaml
Have a look at the channel_structure.yaml
file, it is self-explanatory.
Use channelctl
to status, start and stop these examples.
First query the status of all the trainings channels:
channelctl --tenant trainings status
Start a channel that includes a spark punchline:
channelctl --tenant trainings start --channel sparkpunchline
Check the channel status, it must be active
channelctl --tenant trainings status
channelctl --tenant trainings stop --channel sparkpunchline
channelctl --tenant trainings status
Scheduling a plan within a channel¶
Info
Just like for punchlines, plans must be executed in a resilient manner. Shiva or Kubernetes are in actions to schedule, run and restart them whenever needed.
Execute the following sequence of actions:
channelctl --tenant trainings status
channelctl --tenant trainings start --channel sparkplan
channelctl --tenant trainings status
channelctl --tenant trainings stop --channel sparkplan
channelctl --tenant trainings status