Skip to content

Track 1 Spark/Pyspark Punchline Fundamentals

Abstract

This track explains the punch analytics and machine learning capabilities. What are the issues to solve ? What is the perimeter covered by the Punchplatform ? At what point is a custom development necessary ?

Punchlines

Info

A punchline is a data processing application represented by a graph of nodes. The punch provides a simple graph configuration language that can ne used to represent many different kinds of application : stream or batch, simple extract-load-transform or machine learning apps.

Have a look at the hello_world.punchline in the conf/tenants/training/channels/spark_punchline folder of your standalone. Forget for now it is part of a channel folder and concentrate only on the punchline itself.

Execute it as a spark java application using the punchlinectl command line tool:

punchlinectl start -p examples/punchlines/hello_world.punchline -r spark
punchlinectl start -p resources/track1/punchlinectl/hello_world.punchline -r spark

Execute it now as a spark python application

punchlinectl start -p resources/track1/punchlinectl/hello_world.punchline -r pyspark

Tip

There are many benefits of representing applications using graphs. 1. you do not code anymore, you assemble ready to use nodes 2. your application can be checked and audited before running in production 3. data access is made dramatically easier using the many input/output nodes to read files, sql databases, kafka, elasticsearch etc.. 4. applications can be protected without impact to you

Plans

Info

Often, you need to periodically run a punchline to process some ranges of data. For example running an application every hour that will detect new data arrival in the last hour. This is a basic need. Instead of coding this on you own, the punch provides you with plans. A plan simply schedules a punchline.

Have a look at the plan.hjson in the conf/tenants/training/channels/spark_plan folder, together with the hello_world.punchline.

Use the planctl command line tool to start it. Notice here we request it to run in the regular (java-based) runtime.

planctl start --plan resources/track1/planctl/plan.hjson \
        --template resources/track1/planctl/hello_world.punchline \
        -r spark

Plans can also submit applications to a python spark engine.

planctl start --plan resources/track1/planctl/plan.hjson \
        --template resources/track1/planctl/hello_world.punchline \
        -r pyspark

Tip

If you tink about, this is really magic. The same application can be submitted as a java or python app. Why does the punch offer this feature ? Not really because it make sense from a runtime point of view. In reality you will setup punchline leveraging python-only or java-only processing or IO nodes. And your punchlines will only be compatible with a given runtime. This said the same dag model is used for all punch application : stream, batch, python or java. This is a key strength of the punch.

Coming back to the example you just executed, it turns out that all the nodes in the sample hello punchline are both available in the java and python punch node libraries. Hence the magic.

Question

Can you imagine three functional examples where a plan would be useful ?

Working with Channels

Add a Punchline to a channel

Info

On a production platform a spark punchlines must be properly scheduled to run. You do that by making it declared as a shiva scheduled application. That guarantees you punchline will always be running. Should a server fail, it will be properly rescheduled on another server.

Check out the conf/tenants/training/channels/spark_punchline channel. It illustrates how to define a channel that includes a single punchline application. Have a look at the channel_structure.json file, it is self explanatory.

Use now the channelctl tool to start and stop that channel:

# step 1 check nothing is running
channelctl --tenant training status

# step 2 start a channel that includes a pyspark plan
channelctl start --tenant training --channel spark_punchline

# step 3 check trhe status, it must be active
channelctl --tenant training status

# step 3 stop cyoru channel
channelctl --tenant training stop --channel spark_punchline

# step 6 and make sure it is indeed stopped
channelctl --tenant training status

Repeat the same sequence with the pyspark_punchline channel.

Add a Plan to a channel

Info

Just like for punchlines, plan must be executed in a resilient manner. Plans can also be declared as part of a shiva application.

Check out the conf/tenants/training/channels/spark_plan channel. It illustrates how to define a channel that includes a single punchline application. Have a look at the channel_structure.json file, it is self explanatory.

Execute the following sequence of actions:

# step 1 check nothing is running
channelctl --tenant training status

# step 2 start a channel that includes a pyspark plan
channelctl start --tenant training --channel spark_plan

# step 3 check trhe status, it must be active
channelctl --tenant training status

# step 3 stop cyoru channel
channelctl --tenant training stop --channel spark_plan

# step 6 and make sure it is indeed stopped
channelctl --tenant training status

Repeat the same sequence with the pyspark_plan channel.