Skip to content

Enablers

Abstract

Providing a complete list of punch features is difficult. First there are many. Second they are of different nature, some low-level (i.e. a back pressure algorithm) others high-level (archive data housekeeping). All are essential to build a complete industrial solution.

The punch value is therefore best described in terms of what it enables you to solve. For that reason this chapter lists enablers. For each we provide its rationale, and we list alternatives should you need to solve the same problem on your own.

This chapter will help you determining quickly if and how the punch can help you. Refer to the feature list for a more compact and easy to scan list of features.

Overview

Whatever use case you have (cybersecurity, Iot, system or application monitoring) it is common to assemble the same various functions over and over again. This can be depicted as follows:

image

This does not mean that you always need all these functions. It is just likely that you need a fair amount of them. The punch has been designed to deliver scalable and resilient components that are easy to assemble. You can use them on tiny systems or on large scale big data platforms.

Before we describe each enabler in detail, it helps to have a quick understanding on their value depending on your target use case and environment. We will consider four illustrative environments:

  • Edge : refers to small platforms from one to 10 servers. These must be highly autonomous, cheap to build, deploy and operate. They run on limited hardware resources yet require advanced stream processing capabilities, potentially including machine learning capabilities.
  • DataCenter : refers to large scale on premise platforms up to several hundreds of servers. Besides the size and resources, these are managed by human operators on a daily basis. Although running critical services, they must be easily upgraded, adapted, and enriched with new services.
  • Socs: refers to Security Operation Center. See a Soc as a DataCenter-like platform focused on providing cybersecurity detection and forensic capabilities. Some Socs however require to run on limited hardware so as to monitor small systems.
  • Clouds : refers to cloud and cloud-native environments where applications and resources need be monitored. Possibly for security concerns, or for tracking resources and applications usages. Most often for tracking both.
Punch Enabler Edge DataCenter Socs Clouds
Search and Analytics Data Engine * *** *** ***
Pipelines *** *** *** ***
Stream Pipeline Processor * *** *** ***
Batch Pipeline Processor ** *** *** ***
Punch Light Processor *** ** ** **
Channels *** *** *** ***
IO Connectors *** *** *** ***
Data Shipper Connectors *** ** ** ***
Punch Language *** *** *** ***
Multi Tenancy ** *** *** ***
Rule Engine ** ** *** **
Long Term Storage - *** *** ***
Short Term Storage *** *** *** ***
Deployer *** ** ** **

Search and Analytics Data Engine

  • aggregate
  • correlate
  • search

The Punch provides a distributed search and analytics data engine. That part is fully built on top of Elasticsearch.

Rationale

Elasticsearch offers key benefits. It is simple to setup and operate, while providing scalability and resiliency from small to large systems. It provides near real time operations. It makes it easy and quick to build applications for a variety of use-cases in a way drastically simpler than with an hadoop stack or with more specific stack.

Elasticsearch comes with a variety of tools to visualise, collect, aggregate and process the data. Last, it can be enriched with additional machine learning capabilities using various connectors to batch and streaming frameworks including as Spark, Storm.

Use Cases

  • log analytics
  • full-text search
  • security intelligence
  • business analytics
  • operational intelligence

Alternatives

Alternative Details
elasticsearch there are now several variants of open source elasticsearch releases. Using elasticsearch on your own requires you provide multi-tenancy, data housekeeping, log parsing etc.. on your own
elastic suite the paid license offers additional features and services such as reporting, security, machine learning, alerting. Note that the punch team recommend the use of an X-Pack license in combination with the punch.
splunk Splunk provides the same excellent ease-of-use, real-time operations and data visualisation capabilities.
fluentd although only focused on log management, there are a number of open source, managed of commercial variants that are built on topp of elasticsearch. Fluentd is an excellent one.
graylog Another well known log management offer

Pipelines

  • collect, transport, route
  • transform, parse, enrich
  • archive, extract, replay
  • learn, detect, alert

Besides a data analytics database, most of your effort will be devoted to design data processing applications to parse, filter, route, enrich your data. Others will be required to aggregate, detect, reprocess your data. The punch proposes a dataflow programming tool to design such arbitrary data processing applications using a model-driven approach. Each application is represented as a Direct Acyclic Graphs (DAG). We refer to these DAGs as pipelines. Using pipelines users can design arbitrary stream or batch applications.

Each pipeline groups several processing nodes. The simplest pipelines allows you to design input-processing-output patterns, such as the one you need for a log management solution. You can however model much more sophisticated graphs, not only simple sequences. The punch provides a number of ready to use nodes (listed separatly), and allows you to add your own.

Rationale

Having processing pipelines delivered as mere configuration files provides your system with auditability, security and ease-of-use.

flux

Bad things happen when configuration is hard-coded. No one should have to recompile or repackage an application in order to change configuration.

Use Cases

Selecting the right processors, here is a quick overview of typical punch pipeline use cases.

Using Spark Streaming processor:

  • Accurate counts
  • Windowing Aggregations
  • Progressive Analysis
  • Continuous Machine Learning

Using Storm or PunchLightProcessor processors:

  • Log management pipelines
  • Site to site data transport
  • Real-time alerting and complex event processing rules

Using Spark batch processor:

  • Machine learning
  • Data extractions and Aggregations

Features

  • reliable : punch pipelines provide data acknowledgement required to achieve exactly or at least once data processing
  • monitored : punch pipelines provide data acknowledgement required to achieve exactly or at least once data processing
  • scalable : from single to distributed multi process deployment. Scalability is provided by processors
  • binary or textual data : Pipelines can process both binary (images, ..)or textual (json, xml,logs, ..) data
  • extensible : Pipelines can be enriched with your own nodes and modules
  • autonomous : through various backpressure configurations, pipelines will stay stable under various traffic or data ingestion conditions

Alternatives

Alternative Details
flux Storm specific. Similar to punch topologies
streamset streamset is a system for creating, executing and operating continuous dataflows that connect various parts of your data infrastructure.
nifi A popular flow automation tool.
logstash Limited to simple log management pipelines.
cdap an open source framework for building data analytic applications.
mlflow mlFlow is a framework that supports the machine learning lifecycle. This means that it has components to monitor your model during training and running, ability to store models, load the model in production code and create a pipeline.
apache beam Apache Beam is an open source, unified model for defining both batch and streaming data-parallel processing pipelines.

Pipeline Processors

  • host and run pipelines

To run pipelines, the punch provides you with 4 possible processors. A processor is simply the component in charge of running your pipelines.

Techology Description
Storm Stream processing engine in charge of running pipelines referred to as topologies.
Spark Batch processing engine in charge of running batch pipelines in particular the ones running machine learning modules.
Spark Streaming Stream processing engine in charge of running continuous pipelines, in particular the ones running machine learning modules.
Punch Punch provides a lightweight pipeline engine. It offers a minimal yet performant single-process engine to run pipelines on constrained environments such as edge platforms. It is also well-suited to deploy pipelines on dockerized platforms.

Rationale

Running large scale distributed pipelines requires state-of-the-art yet robust stream and batch frameworks. The punch aims at making it simple and fast to leverage key technologies, without the burden of developping applications on your own, nor the difficulty to identify and invest on a particular technology.

Alternative Details
kafka stream Does not provide a runtime framework, only libraries
h2o dedicated to machine learning
flink A storm and kafka stream competitor
tibco streambase A CEP platform for rapidly building and deploying applications that analyze and act on real-time streaming data.
ibm streams Ibm stream processing platform that can ingest, filter, analyze and correlate massive volumes of continuous data streams
google dataflow a cloud-based data processing service for both batch and real-time data streaming applications. It leverages and expands the map reduce project.
apache samza Samza is a stream processing framework. See it as a higher level stream framework built on top of Kafka.
azure stream analytics microsoft serverless scalable complex event processing engine.
amazon kinesis Amazon stream engine, that intergtates with other amazon services such as Redshift, Dynamo Database and Simple Storage Service (Amazon S3),
twitter heron a fork from the original storm framework now used by twitter.

Channels

  • end-to-end application composition
  • configuration
  • monitoring
  • user commands

Channels are a unique punch feature. Using channels you can assemble several pipelines into a single application. You are then offered with three basic commands to start, stop or reload your application. A channel can group both stream pipelines to ingest live data and batch pipelines to perform various batch only tasks: machine learning, alerting, aggregating, etc..

Channels also offer a key feature : they can span accross one or several Kafka hops.

Rationale

Real production systems run tenths of pipelines. Some of them stream, other batch. Users and operators must be offered with a logical and management view of these pipelines, grouped by tenant and by applicative role.


Multi-Tenancy

  • security
  • resource mutualisation

From the ground up, the punch is designed with a multi tenant architecture. It allows you to deploy some of design your architecture with end-to-end multi-tenant isolation. Multi-tenancy is achieved through a number of lower detail punch feature: monitoring, pipeline configuration, kafka topic, elasticearch index, ceph pool naming scheme and management.

It is a punch feature in that the punch deployer and the various (administration services)[#administration-services] helps you configuring and operating your tenants.

Rationale

This is a must-have feature for most production system. It allows to mutualise key expensive services and components such as big data technologies (Kafka or Elasticsearch clusters) while providing a safe and secure isolation between several customers/domains/zones.

Alternatives

This usually requires ad-hoc configurations of the technologies at play. This requires extensive work.


IO Connectors

The punch provides you with a set of ready to use input-output connectors. They come as configurable pipeline nodes. They can ingest textual or binary data.

Azure Blob Storage connectors to read data from azure blob storage including nsg data.
kafka connectors to kafka are provided on the shelves.
Ceph Storm connectors to Ceph are provided natively. This allows to setup archiving services.
Sockets TCP, UDP, SSL and Lumberjack sockets connectors are built-in.
Files Files and folders can directly be consumed.
Snmp Strom pipelines can ingest snmp traps natively
Netflow Storm pipelines can ingest netflow data natively
Elasticsearch reading and writing data from/to elasticsearch is punch key feature. In particular some pipelines can consume massive data from elasticsearch through spark and elasticsearch hadoop-connectors.

Rationale

Leveraging performant and reliable input ouput connectors is a critical requirement. These provide the backbone on top of which you will construct your applications. Not loosing a single data record, ensuring at-least or exactly-once semantics directly depends on your connectors.

Alternatives

Most platforms or framework provide you with IO connectors. It usually requires lots of testing to ensure they provide the required level of robustness, acknowledgement protocol support, and integration inside

On the other hand, a number of open source components are available. Most miss the required level of configurability and monitoring to be deployed as is.

We recommend you only use components effectivelly used in production on critical systems. Not just on dedicated ad hoc platforms.


Data Shippers

Some IO connectors deserve a particular attention. They allow you to forward data between distant sites with four key functions:

  • load-balancing : multi-destination, multi-room or multi-sites to automatically switch the data forwarding to secondary destinations
  • encryption and compression : to reduce the amount of data on limited bandwitdh environments
  • end-to-end acknowledgement : so as to ensure data retransmission in case of failure
  • local buffering : so as to cope with network interruptions.

Rationale

These are all mandatory functions for edge data collectors.


Punch Language

  • parsers
  • data routing and filtering
  • arbitrary data processing
  • logstash-like operators : dates, grok, csv, json, key-value, loops

Inside your pipeline you can deploy so-called punchlets. These leverage a compact programming language that provide a number of operators to manipulate your data.

Alternatives

Alternative Details
logstash the punch language is inspired from logstash filters but is a real imperative language, and is more compact (up to 5 times). It provides the sames operators.
splunk splunk provides a powerful pipeline-style language with many similar operator
graylog graylog provide a more verbose logstash style language. Specific to log management
custom developments write your own processing nodes or applications. You will lack the support of builtin operators, and of course you will be adherent to the apis and frameworks you select.

Rule Engine

  • alert
  • correlation

Siddhi rules are supported and embedded natively in the punch language. A storm node is also provided to use it independently of the punch language. Refer to the CEP guide for details.

Alternatives

Alternative Details
esper Esper is a language, compiler and runtime for complex event processing (CEP) and streaming analytics, available for Java as well as for .NET. Watch out for its licensing.
wso2 the siddhi library used in the punch comes from the excellent wso2 product. The way siddhi is integrated into Storm is very similar to their architecture.
splunk using splunk language you can implement correlation rules on your own.
qradar ibm qradar provides an event correlator and rule engine as part of its siem solution
custom developments Stream processing framework allows you to write your own CEP rules using stream SQL variants.

Archiving Service

  • archiving
  • data lifecycle : hot cold frozen delete

The punch ships in with pipelines dedicated to provide you with an archiving service. This service provides :

  • Compact and cost-effective Data Archiving and Indexing : choose from various replication levels.
  • Extraction and data replay services.
  • Housekeeping : to automatically expurge you old data.
  • Housekeeping : to automatically expurge you old data.

The archiving service runs on top of the punch long term storage.

Alternatives

Alternative Details
splunk Splunk provides cold, frozen, delete data lifecycle configurations. It also can leverages san or nfs external storage.
log manamenent products the many log management solutions all provide some sort of equivalent service.

Short Term Storage

  • resiliency
  • scalability
  • distributed end-to-end pipeline setup

The punch completely integrates Kafka, used internally to design end-to-end processing channels.

Rationale

Kafka is now the de facto standard for architecturing real time and reliable stream processing pipelines. The punch leverages Kafka since its early days.


Long Term Storage

The punch provides a CEPH storage engine for storing years of data. It can be exposed through object storage api, CephFs or S3/Swift. Refer to the archiving service documentation.

Alternatives

Alternative Details
cloud blob storages OVH, S3, Azure Blob Storage, or Google Cloud Storage all provide excellent and cost effective storage offerings.
splunk Splunk provides cold and frozen storage.
qradar ibm qradar provides log storage under the form of data nodes.

Deployer

  • multi-tenancy
  • automated deployment
  • configuration management
  • monitoring

The punch deployer is a high level deployer tools that install a complete punch platform to your target servers. It works using only simple configuration files, without requiring low adherence to complex deployment technologies.

Rationale

Many customers have only hardware or virtual servers to start with. Building a complete solution such as the punch represents a significant

Alternatives

Alternative Details
ansible Deploying the many components using ansible is your most probable alternative should you operate your infrastructure. It requires significant work to ensure your roles are idempotent and robust. Extra care must be taken to not end up with complex inventories, hard to maintain. The punch deployer relies on ansible but provides you with roles, and fully takes care of generating the inventories.
application or container paas I.e. pivotal cloudfoundry or kubernetes or any other in between mixed variant (cfcr, openshift) you can then deploy only the pipelines part, and leverages the (elasticsearch. kafka and others) as platform services. You still must assemble security, monitoring, multi-tenancy, and teh overall configuration management plane on top of it.

Warning

This area is source of great confusion. The punch is a lightweight distribution that can be deployed on many target systems including clouds and paas. However it does not impose such a paas. These are heavyweight infrastructures requiring their own dependencies. For example pivotal cloudfoundry requires a (mysql) database, whereas kubernetes requires a ditributed store (etcd). Each supports different filesystems, depending on their underlying container implementation of both platforms.

Warning

Qouting niels goossens. "Even though a PaaS – and particularly a container platform – works best when adhering to the Twelve-factor app standard, specifically the factor about stateless processes, it is possible to run statefull applications on OpenShift. Why you would want to do this apart from a proof-of-concept is beyond me, though. For more information see this excellent article on Kubernetes and statefull."


Machine Learning

  • learn, train, detect
  • stream or batch
  • monitored
  • powered by spark mlib
  • supports both python and java/scala runtimes

The punch supports special pipelines dedicated to run machine learning applications. Refer to the PML documentation. This features is quite unique as it allows users to conceive machine learning pipelines on top of Elasticsearch data using In addition a graphical editor is provided to simplify their design and execution.

Rationale

Such a dataflow, model-driven solution to design test and deploy ml functions onto a production platform dramatically reduces ml modules time-to-market.

Alternative Details
cdap an open source framework for building data analytic applications
mlflow databrick ml framework
dataiku a collaborative data science software platform for teams of data scientists, data analysts, and engineers. It simplifies the design of ml pipelines.
sinequa an integrated machine learning platform that addresses citizen data scientists.

Data Visualisation

  • security
  • multitenant
  • data visualisation
  • data forensics
  • real time

The punch relies on Kibana to provide you with data visualisation capabilities.

Rationale

The reason we exclusively leverage Kibana is explained in a dedicated blog. The key argument is to keep the punch simple, not requiring ourselves and our user to master too many technologies.

Alternative Details
grafana A popular monitoring visualisation tool. It is compatible with elasticsearch and more specifically dedicated to monitoring and supervision use cases.
data dog A sophisticated metrics and monitoring solution.
prometheus A time serie metrics and monitoring solution. It provides visualisation capabilities. Not as rich as grafana and kibana

Graphical User Interface

Delivered as Kibana plugin, the punch user graphical interface lets you:

  • access the documentation
  • extract data from elasticsearch
  • code, run and test punchlets
  • design machine learning jobs using a graphical editor

Integrated Monitoring

  • monitoring
  • alerting

Monitoring is a key requirements to stay in control of your application, pipeline and platform. This is true for tiny or large scale use case. The punch integrates monitoring agents to collect various system and applicative metrics so as to provide users with

  • normalised and aggregated metrics
  • dashboards
  • alerting
  • REST api for pluggin in external supervision tools

Rationale

The punch leverages the Elastic stack, all of it : Beats (metrics agents), Elasticsearch and Kibana. We decided to stop using alternate technologies, some of them more efficient and specialised to handle monitoring metrics. Benefits are significant. First the overall solution stays simple, not requiring additional tools and technologies. Second the elastic component have excellent performance. Lats and most importantly, leveraging elastic and punch features, we can plugin aggregation and machine learning jobs also on these monitoring metrics.

Alternatives

Alternative Details
grafana A popular monitoring visualisation tool. It is compatible with elasticsearch and more specifically dedicated to monitoring and supervision use cases.
prometheus A time serie metrics and monitoring solution. Simple to setup for simple alerting use case. Low footprints and performant.

Log Parsers

  • standard parsers
  • normalisation
  • log management

The punch ships in with 70 standard parsers. Checkout the log management documentation.

Rationale

the punch is a thales strategical development with the idea to build an open-source based log management solution. Key to that strategy is the management and sharing of common assets, in particular log parsers. The punch language has allowed various teams to produce parsers. Compact, easy to maintain, fully integrated into continuous integration tooling, these parsers make it now cost effective to deploy many log management solutions as required by the many Thales projects.

Alternatives

Alternative Details
splunk Splunk benefits from a rich community that has contributed to develop a large number of parsers
fluentd community parsers
graylog community parsers
qradar ibm qradar ships in with so-called dsm, ready to use parsers.

Security

  • access control
  • encryption of data in movement
  • multi-tenant data access
  • RBAC user management

The punch provides RBAC access control and authentication. Refer to the punch security documentation.

this section is under construction

Alternatives

Alternative Details
elastic suite the paid license offers security features to protect kibana and elasticsearch.

Open Technology

  • open
  • extensible

Although not technical, this last enabler is key. The punch source code is extremally well organised. Projects can use the complete packaging or only leverage some specific punch libraries.

Rationale

Such an ambitious stack can only suceeds if various thales and/or external customer teams participates and contribute to the punch. This has been a design driver from day one.

Punch Simulator Tool

  • development
  • test
  • performance

The punch injector tool provides users with invaluable means to:

  • inject arbitrary data : the punch injector templating format allows to generate variable fields in Json, XML or any textual data
  • stress the system : the punch injector can send efficiently thenths of thousands of event per seconds.
  • consume data : the injector can also read data from Kafka or sockets. That makes it ideal to investigate, debug or even (stress) test various pipelines.
  • various sources and destinations : elasticsearch, kafka, sockets, files. It makes it extremmaly easy to design pocs, mvps, por provide production system with the required testing tooling.

Alternatives

No one we know of.