Skip to content

Punchlines

Abstract

The punchline is the base execution unit. It can represent stream or batch processing, storm spark pyspark or plain python applications. This is very powerful and elegant but can lead to some confusion. This essential chapter clearly explains the key punchline concepts. Make sure you read it.

A punch application is represented as a direct-acyclic-graph (DAG) of steps called nodes. Each node can publish some values and subscribe to other node(s) published values. Think of a node as a function with in and out parameters. All that is expressed in a configuration file. A node can publish any type of data : tuples, numbers, datasets, machine-learning model, ...

You basically deal with three kinds of nodes:

  1. Input node in charge of fetching or receiving some data from some sources
  2. Processing nodes where you apply some processing logic to your data
  3. Output nodes to save your processing results to some data sink.

The overall structure of every punchline is identical. Here it is:

{
  version: "6.0"
  type: punchline
  runtime: spark
  "dag" : [
    ...
  ],
  "settings" : {
    ...
  }

Each property deserves some quick explanations. Before describing each, note first that a punchline is always executed in ths scope of a per tenant channel. That basically means tha each punchline is identified by three key identifiers:

  1. the tenant
  2. the channel
  3. the name of the punchline file

Threse three identifiers are automatically part submitted punchlines. You do not need (and in fact cannot) set these properties explicitly.

Version

Starting at 6.0 release, a version identifier is required. This string property is used internally to ensure backward compatibility as well as to validate your punchlines before submitting them for execution.

Important

If no version appears in your punchline, they will be assumed to be 5.x topology or pml files and automatically converted. This facility is only meant to facilitate your migrations but of course, we strongly suggest you quickly adapt your configuration templates and file to embrace the new punchline format presented here.

Type

Every punch hjson or json file has a type field. This is simply to make every file self-described. Punchlines have (as expected) the punchline type.

Runtime

The punchline format if generic. Stream or batch, spark or pyspark, etc.. This said most punchlines are only compatible to a given runtime environment. A runtime environment here does not mean the actual runtime engine used to run a punchline, but instead the class or type if runtime. Low latency record based processing or micro-batching or batch processing ? Data-parallel centric or Task-parallel centric ?

These are key and crucial characteristics but highly technical and not easy to understand. The punch makes it easier to choose and provides you with fewer variants. Here they are:

  • storm: record based processing. This simple node only support streaming punchlines:
    • java based
    • task-parallel centric
    • providing end-to-end acknowledgements and at-least-once semantics
  • spark: spark compatible runtime
    • java based
    • data-parallel centric
    • can deal with plain java spark or spark structured streaming capabilities.
    • support dataframe spark api
    • SQL support
  • pyspark: spark python variant. Same capabilities than spark but supports python nodes possibly mixed with java nodes
  • python: plain python applications.
    • used for non-spark application. These can leverage your python libraries of choice: pandas, numpys etc..

Each runtime comes with specific settings and characteristics. As a rule of thumb :

  • The storm runtime environments are used for real time stateless processing such as ETLs, log shipping and parsing, routing data.
  • The spark and pyspark engines are used for stateful and batch applications such as machine learning applications, or to compute aggregations. They can also be used for stream use case, in particular to benefit from spark SQL and machine learning streaming capabilities.
  • The python applications are used for machine learning or any simple application that does not require to scale. If your data is small enough, python applications can do the job perfectly without the penalty to go through a distributed engine such as spark.

Important

The storm runtime environment does not mean a storm cluster is actually required. Storm punchlines can run in various engine including the lightweight single process punch proprietary engine. Similarly the punch provides single-process variants of spark and pyspark engines, best suited for small apps.

Dag Nodes

The dag array provides the list of nodes. Each node has the following format:

{
      type: punchlet_node
      component: punchlet_node
      settings: {
        ...
      }
      publish: [
          {
            stream: logs
            fields: [ ... ]
          }
      ]
      subscribe: [
          {
            component: file_input
            stream: logs
          }
      ]
    }

I.e. a node subscribes to some data (from previous nodes) and publishes some data (for subsequent node). Input nodes only publish data, while output node only subscribe to some data.

The data flows between nodes are identified by a stream identifier. Here for example the stream of data is called "logs".

Depending on the runtime environment (punch versus spark|pyspark versus python) the data traversing these streams can be of different nature: records, dataframe, vectors etc.. Refer to the documentation of each type for configuration details.

type

The type of the node refers to one of the punch provided node. Or to your own if you provide it to the punch.

component

That property uniquely identifies your node in the dag. Often you have a single node of each type. In that case you may omit the component property. It will be implicitly set to your node type.

When a node subscribes to another one, it refers that component identifier.

settings

Each node has its proprietary settings. These are grouped in the settings dictionary.

publish

Nodes can publish data to downstream nodes. Data is published onto so-called streams. A node can publish more than one stream. A typical example is to publish errors, regular data or alerts.

subscribe

Inner nodes subscribe to data from previous nodes. Each subscription includes the previous node component identifier.

Backward Compatibility

Old style topology and pml files prior to 6.0 are supported for backward compatibility. If you wish to easily translate them to the new format, use the punchlinectl convert option:

With the long options :

punchlinectl convert --hjson -in-place --punchlinep your_old_topology_or_pml.json
Or
punchlinectl convert --hjson -i -p your_old_topology_or_pml.json
Without the -i option the translated punchline is simply written to stdout.