Skip to content



A punchline is the base execution unit. It can represent stream or batch processing, storm - spark - pyspark or plain python applications. This is very powerful and elegant but can lead to some confusion. This essential chapter explains the key punchline concepts. Make sure you read it.

A punch application is represented as a direct-acyclic-graph (DAG) structure. It allows you to combine three types of nodes:

  1. Input node in charge of fetching or receiving some data from some sources
  2. Processing nodes where you apply some processing logic to your data
  3. Output nodes to save your processing results to some data sink.

The overall structure of every punchline is identical. Here it is:

  version: "6.0"
  type: punchline
  tenant: mytenant
  channel: apache
  runtime: spark
  "dag" : [
  "settings" : {

Each property deserves some explanations.


Starting at 6.0 release, a version identifier is required. This string property is used internally to ensure backward compatibility as well as to validate your punchlines before submitting them for execution.


If no version appears in your punchline, they will be assumed to be 5.x topology or pml files and automatically converted. This facility is only meant to facilitate your migrations but of course, we strongly suggest you quickly adapt your configuration templates and file to embrace the new punchline format presented here.


The tenant property indicates to which tenant a punchline belongs. On a production system, punchlines are part of a channel itself part of a tenant. The tenant is used for security in order to isolate a punchline form accessing other tenants data, as well as to tag all sort of generated monitoring data (events, kpis, logs, etc..).


Similarly a punchline is executed in the scope of a channel. The channel is simply an administrative grouping of one or several punchlines. Once grouped as part of a channel, you can start or stop your punchlines at once.

Again the channel name is used to tag the many metrics and logs generated by a punchline. This in turn is key to properly monitor your channels.


In development mode, you can omit the tenant and channel properties. If you do that your punchline will run using the 'default' tenant and channel identifiers.


Every punch hjson or json file has a type field. This is simply to make every file self-described. Punchlines have (as expected) the punchline type.


The punchline format if generic. Stream or batch, spark or pyspark, etc.. This said most punchlines are only compatible to a given runtime environment. A runtime environment here does not identify the actual runtime engine used to run a punchline, but instead the class or type if runtime. Low latency record based processing or micro-batching or batch processing ? Data-parallel centric or Task-parallel centric ?

These are key and crucial characteristics but highly technical and not easy to understand. The punch makes it easier to choose and provides you with fewer variants. Here they are:

  • storm: record based processing. This simple node only support streaming punchlines:
    • java based
    • task-parallel centric
    • providing end-to-end acknowledgements and at-least-once semantics
  • spark: spark compatible runtime
    • java based
    • data-parallel centric
    • can deal with plain java spark or spark structured streaming capabilities.
    • support dataframe spark api
    • SQL support
  • pyspark: spark python variant. Same capabilities than spark but supports python nodes possibly mixed with java nodes
  • python: plain python applications.
    • used for non-spark application. These can leverage your python libraries of choice: pandas, numpys etc..

Each runtime comes with specific settings and characteristics. As a rule of thumb :

  • The storm runtime environments are used for real time stateless processing such as ETLs, log shipping and parsing, routing data.
  • The spark and pyspark engines are used for stateful and batch applications such as machine learning applications, or to compute aggregations. They can also be used for stream use case, in particular to benefit from spark SQL and machine learning streaming capabilities.
  • The python applications are used for machine learning or any simple application that does not require to scale. If your data is small enough, python applications can do the job perfectly without the penalty to go through a distributed engine such as spark.


The storm runtime environment does not mean a storm cluster is actually required. Storm punchlines can run in various engine including the lightweight single process punch proprietary engine. Similarly the punch provides single-process variants of spark and pyspark engines, best suited for small apps.

Dag Nodes

The dag array provides the list of nodes. Each node has the following format:

      type: punchlet_node
      component: punchlet_node
      settings: {
      publish: [
            stream: logs
            fields: [ ... ]
      subscribe: [
            component: file_input
            stream: logs

I.e. a node subscribes to some data (from previous nodes) and publishes some data (for subsequent node). Input nodes only publish data, while output node only subscribe to some data.

The data flows between nodes are identified by a stream identifier. Here for example the stream of data is called "logs".

Depending on the runtime environment (punch versus spark|pyspark versus python) the data traversing these streams can be of different nature: records, dataframe, vectors etc.. Refer to the documentation of each type for configuration details.


The type of the node refers to one of the punch provided node. Or to your own if you provide it to the punch.


That property uniquely identifies your node in the dag. Often you have a single node of each type. In that case you may omit the component property. It will be implicitly set to your node type.

When a node subscribes to another one, it refers that component identifier.


Each node has its proprietary settings. These are grouped in the settings dictionary.


Nodes can publish data to downstream nodes. Data is published onto so-called streams. A node can publish more than one stream. A typical example is to publish errors, regular data or alerts.


Inner nodes subscribe to data from previous nodes. Each subscription includes the previous node component identifier.

Backward Compatibility

Old style topology and pml files prior to 6.0 are supported for backward compatibility. If you wish to easily translate them to the new format, use the punchlinectl convert option:

With the long options :

punchlinectl convert --hjson -in-place --punchlinep your_old_topology_or_pml.json
punchlinectl convert --hjson -i -p your_old_topology_or_pml.json
Without the -i option the translated punchline is simply written to stdout.