Skip to content

Punch Node

The punch node executes one or several punchlets on the fly. This node cannot be used to communicate with an external application, it necessarily is internal to a punchline. Its configuration looks like this :

{
  "type": "punchlet_node",
  "settings": {
    "punchlet": [
        "standard/common/input.punch",
        "standard/common/parsing_syslog_header.punch"
    ]
  },
  "component": "my_punch_bolt",
  "subscribe": [
      {
        "component": "kafka_input",
        "stream": "logs"
      }
  ],
  "publish": [
      {
        "stream": "logs",
        "fields": [
          "log", "_ppf_id", "_ppf_timestamp"
        ]
      },
      {
        "stream": "_ppf_errors",
        "fields": [
          "_ppf_id", "_ppf_error_document", "_ppf_error_message", "_ppf_timestamp"
        ]
      }
  ]
}

The key concept to understand it the relationship between the punchlet and the subscribed and published stream/field. In this example your punchlet will receive punch tuple corresponding to the storm tuple received on the stream. After it returns your punchlet will generate an arbitrary punch tuple (it depends on the punchlet logic). But the punchbolt will only emit these tuples/fields that match the publish section.

Info

make sure you understand input nodes and inner nodes stream and field fundamental concepts.

The _ppf_errors stream and its fields is explained below in the dedicated error handling section.

Punchlets

The punchlet property refer to the punchlet you want to execute in the bolt. This property is a path, relative to the $PUNCHPLATFORM_CONF_DIR/resources/punch folder by default.

Some punchlet require resource files, typically when they use the findByKey or findByInterval Punch operator.
Others use Siddhi rule that must equivalently be loaded. To add resource files to your punchlet proceed as follows:

{
  "type": "punchlet_node",
  "settings": {
    "punchlet_json_resources": [
      "standard/apache_httpd/http_codes.json",
      "standard/apache_httpd/taxonomy.json"
    ],
    "punchlet_rule_resources": [
      "standard/common/detection.rule"
    ],
    "punchlet_grok_pattern_dirs": [
      "%{channel}%/patterns"
    ],
    "punchlet": [
      "standard/common/input.punch",
      "standard/common/parsing_syslog_header.punch",
      "standard/apache_httpd/parsing.punch",
      "standard/apache_httpd/enrichment.punch",
      "standard/apache_httpd/normalization.punch",
      "standard/common/geoip.punch"
    ]
  }
}
  • punchlet_json_resources (list): the required json files resources.
  • punchlet_grok_pattern_dirs (list): path of the grok pattern folders.

    All the files (not only with the *.grok extension) in these folders will be registered as grok pattern.

  • punchlet_rule_resources (list): the required rules files.

    All these will be loaded prior to the punchlet execution. These property must contain paths relative to the $PUNCHPLATFORM_CONF_DIR/resources/punch folder.

Placeholders only work if the tenant attribute is filled in the topology file.

You can also store these files relative to the configuration directory, tenant directory or channel directory by using (resp.) the %{conf}%, %{tenant}% or %{channel}% placeholder. Note that the %{conf}% is the one used by default so %{conf}%/standard/common/input.punch is equivalent to standard/common/input.punch. Also note that punchlet can be added using their absolute path, any path started with a / is considered as absolute.

For example if your resource file (say) taxonomy.json is located under :

$PUNCHPLATFORM_CONF_DIR/tenants/mytenant/channels/apache/resources/taxonomy.json

Here is the configuration you must use :

{
  "type": "punchlet_node",
  "settings": {
    "punchlet_json_resources": [
      "%{channel}%/resources/taxonomy.json"
    ],
    "punchlet": [
      "%{conf}%/standard/common/input.punch",
      ...
    ]
  },
  ...
}

Warning

Use absolute path with caution, preferably only for development, since these files will likely not be stored in your platform configuration directory.

Error Handling

A punchlet can raise an exception. Either explicitly or because it encounters a runtime error. Most often you cannot afford to loose the input data and must arrange to get it back together with the exception information and forward it to a backend for later reprocessing. Doing that on the punchplatform is quite easy. Simply add an additional publish stream to indicate to the node to emit the error information in the topology:

{
  "type": "punchlet_node",
  "settings": {
    ...
  },
  "storm_settings": {
    "subscribe": [
      ...
    ],
    "publish": [
      {
        "stream": "logs",
        "fields": [
          "log",
          "_ppf_id",
          "_ppf_timestamp"
        ]
      },
      {
        "stream": "_ppf_errors",
        "fields": [
          "_ppf_id",
          "_ppf_error_document",
          "_ppf_error_message",
          "_ppf_timestamp"
        ]
      }
    ]
  }
}

The _ppf_errors stream and _ppf_error_document, _ppf_error_message fields are reserved. What this cause is the emitting of a single-field storm tuple in the topology, that you can handle the same way you handle regular data. It basically contain the exception message (that includes the input data). Because it is emitted just like any other data, you can arrange to have it forward up to the final destination you need to save it and reprocess it later : archiving, elasticsearch or any other.

Info

the generated error field is a ready to use json document. Most often you simply need to forward it to save it somewhere. If you would like to enrich or normalise its content in some ways, simply deploy a punchlet node that subscribes to it. Your punchlet will then be capable of changing its content. But in turn that punchlet should not fail. Bottom line : do this only if strictly required and if so pay extra attention to write a error handling punchlet that can never fail.

Additional fields can be published in error stream, that can be either copied from the input stream (any field name is supported, as long at it is present the subscribed stream), or generated by the Punch node :

  • _ppf_timestamp : the standard input timestamp (long number of milliseconds since 1/1/1970)
  • _ppf_id : the unique id (string) of the input document
  • _ppf_platform : the unique id of the punchplatform instance
  • _ppf_tenant : the tenant name of the current channel
  • _ppf_channel : the name of the channel containing the failed punchlet
  • _ppf_topology : the name of the topology containing the failed punchlet
  • _ppf_component : the component name of the PunchNode containing the failed punchlet
  • _ppf_error_message : the exception message or class that the punchlet raised at failure time.
  • _ppf_error_document : the JSON-escaped string document found when the punchlet fail occurred

More than one Punchlet

It is extremely common the deploy a straight sequence of punchlets. You can do that using one PunchNode per punchlet. But a more efficient solution is to deploy all the punchlets as a sequence in the same PunchNode. This avoids extra serialisation between PunchNodes and can save up considerable cpu resources. The way you do this is very simple :

{
  "type": "punchlet_node",
  "settings": {
    "punchlets": [
      "standard/common/input.punch",
      "standard/common/parsing_syslog_header.punch",
      "standard/sourcefire/parsing.punch",
      "standard/common/geoip.punch"
    ]
  },
  ...
}

if one of the punchlets in there raise an exception, the following ones are skipped.

Dealing with Timers

Some punchlets need to be regularly called even if no input data is received. The typical example are so-called stateful punchlets that need to make some internal structure flushed.

The way you can achieve that is be requesting the punch node to periodically invoke your punchlet with a special empty tuple. This tuple is referred to as a tick tuple. Here is how you make your punchlet receive an empty tuple every 5 seconds:

{
  "type": "punchlet_node",
  "settings": {
    "punchlet_tick_frequency" : 2,
    ...
  }

In you punchlet you can catch these empty tuple as simply as :

{
  if (root.isEmpty()) {
    // deal with your flush or expiration logic
    return;
  } else {
    // deal with a regular tuple
  }

Latency Tracking

Just like all nodes, you can subscribe to and publish the _ppf_metrics stream and _ppf_latency field to make your node part of the latency tracking path. Here is an example.

{
  "type": "punchlet_node",
  "settings": {
    ...
  },
  "subscribe": [
      {
        "component": "kafka_input",
        "stream": "logs"
      },
      {
        "component": "kafka_input",
        "stream": "_ppf_metrics"
      }
    ],
  "publish": [
      {
        "stream": "logs",
        ...
      },
      {
        "stream": "_ppf_errors",
        ...
      },
      {
        "stream": "_ppf_metrics",
        "fields": [
          "_ppf_latency"
        ]
      }
    ]
}

Info

the only special thing about _ppf_metrics stream and _ppf_latency field is that they do not traverse your punchlets. You do not have to explicitly protect your punchlet code logic to ignore these. Make sure you understand input nodes and inner nodes stream and field fundamental concepts.

Multi Threading

It is very common to require a high performance punchlet chaining with several stages of punchlets, each stages executed using several threads. You can achieve this using the punchline dag by requesting several executors if the corresponding runtime engine (nifi, storm or spark) supports it.

There is however a simpler and more efficient solution. The punch node can be configured to exectue a multithreaded dag on its own. Here is an example :

    {
      "type": "punchlet_node",
      "component": "punchlet_dag",
      "settings": {
        "punchlet_dag": [
          { 
            "punchlet" : [
                "standard/common/first_punchlet.punch", 
                "standard/common/second_punchlet.punch"
            ],
            "executors" : 5
          },
          { 
            "punchlet" : [
                "standard/common/third_punchlet.punch" ,
                "standard/common/fourth_punchlet.punch"
            },
            "executors" : 2,
            "affinity" : "logs.log.id"
          }
        ]
      },
      "subscribe": [ .. ],
      "publish" : [ .. ]
    },

With this configuration :

  • the first and second punchlets are executed in 5 independant threads. Each thread will run the sequence of the two punchlets. The incoming node traffic is dispatched amon these 5 threads.
  • the third and fourth punchelts are executed similarly in two thread. Besides, each pairof (third and fourth) punchlets will receive the input traffic with an affinity sticking strategie based on the [logs][log][id] field value. I.e. all tuples with the same [logs][log][id] value will be directed to the same thread so that the third or fourth punchlets can perform some stateful processing such as correlation rules, or anything that demands that all the same tuples go to the same punchlet.

Info

This is really the same directed acyclic graph concept than the punchline one. The difference is simply thatis is executed in a lightweight local thread engine instead of going from one punchline node to another. The benefit of this mode is to avoid extra tuple serialization. The performance gain can be significant. The only limitation of such embedded dags is they can be executed only as part of a single node, i.e. as part of a single process. This mode is thus perfect for high-performance single-process/multi-threaded punchlines.

TroubleShooting and Debugging

If you write punchlet, make sure you are aware of the many resources to easily test them. Check the punch language documentation chapters.

A good trick to know should you have issues with stream/fields being not emitted the way you think is to add a small punchlet in the chain that simply prints out and forward the received data. That is easy by deploying an inline punchlet in a node you add to your topology :

{
    "type": "punchlet_node",
    "settings": {
        "punchlet_code": "{ print(root); }"
    },
        ...
}

Make sure you are aware of the following punch features, they dramatically ease your punchlet coding user experience :

  • inline punchlets
  • sublime text editor
  • punch log injector
  • the kibana inline plugin to execute grok and punchlets
  • punchlinectl.sh
  • topologies log levels