Enablers¶

Abstract

Providing a complete list of punch features is difficult. First there are many. Second they are of different nature, some low-level (i.e. a back pressure algorithm) others high-level (archive data housekeeping). All are essential to build a complete industrial solution.

The punch value is therefore best described in terms of what it enables you to solve. For that reason this chapter lists enablers. For each we provide its rationale, and we list alternatives should you need to solve the same problem on your own.

This chapter will help you determining quickly if and how the punch can help you. Refer to the feature list for a more compact and easy to scan list of features.

Overview¶

Whatever use case you have (cybersecurity, Iot, system or application monitoring) it is common to assemble the same various functions over and over again. This can be depicted as follows:

This does not mean that you always need all these functions. It is just likely that you need a fair amount of them. The punch has been designed to deliver scalable and resilient components that are easy to assemble. You can use them on tiny systems or on large scale big data platforms.

Before we describe each enabler in detail, it helps to have a quick understanding on their value depending on your target use case and environment. We will consider four illustrative environments:

Edge : refers to small platforms from one to 10 servers. These must be highly autonomous, cheap to build, deploy and operate. They run on limited hardware resources yet require advanced stream processing capabilities, potentially including machine learning capabilities.
DataCenter : refers to large scale on premise platforms up to several hundreds of servers. Besides the size and resources, these are managed by human operators on a daily basis. Although running critical services, they must be easily upgraded, adapted, and enriched with new services.
Socs: refers to Security Operation Center. See a Soc as a DataCenter-like platform focused on providing cybersecurity detection and forensic capabilities. Some Socs however require to run on limited hardware so as to monitor small systems.
Clouds : refers to cloud and cloud-native environments where applications and resources need be monitored. Possibly for security concerns, or for tracking resources and applications usages. Most often for tracking both.

Punch Enabler	Edge	DataCenter	Socs	Clouds
Search and Analytics Data Engine	*	***	***	***
Pipelines	***	***	***	***
Stream Pipeline Processor	*	***	***	***
Batch Pipeline Processor	**	***	***	***
Punch Light Processor	***	**	**	**
Channels	***	***	***	***
IO Connectors	***	***	***	***
Data Shipper Connectors	***	**	**	***
Punch Language	***	***	***	***
Multi Tenancy	**	***	***	***
Rule Engine	**	**	***	**
Long Term Storage	-	***	***	***
Short Term Storage	***	***	***	***
Deployer	***	**	**	**

Search and Analytics Data Engine¶

aggregate
correlate
search

The Punch provides a distributed search and analytics data engine. That part is fully built on top of Elasticsearch.

Rationale

Elasticsearch offers key benefits. It is simple to setup and operate, while providing scalability and resiliency from small to large systems. It provides near real time operations. It makes it easy and quick to build applications for a variety of use-cases in a way drastically simpler than with an hadoop stack or with more specific stack.

Elasticsearch comes with a variety of tools to visualise, collect, aggregate and process the data. Last, it can be enriched with additional machine learning capabilities using various connectors to batch and streaming frameworks including as Spark, Storm.

Use Cases

Alternatives

Alternative	Details
elasticsearch	there are now several variants of open source elasticsearch releases. Using elasticsearch on your own requires you provide multi-tenancy, data housekeeping, log parsing etc.. on your own
elastic suite	the paid license offers additional features and services such as reporting, security, machine learning, alerting. Note that the punch team recommend the use of an X-Pack license in combination with the punch.
splunk	Splunk provides the same excellent ease-of-use, real-time operations and data visualisation capabilities.
fluentd	although only focused on log management, there are a number of open source, managed of commercial variants that are built on top of elasticsearch. Fluentd is an excellent one.
graylog	Another well known log management offer

Pipelines¶

collect, transport, route
transform, parse, enrich
archive, extract, replay
learn, detect, alert

Besides a data analytics database, most of your effort will be devoted to design data processing applications to parse, filter, route, enrich your data. Others will be required to aggregate, detect, reprocess your data. The punch proposes a dataflow programming tool to design such arbitrary data processing applications using a model-driven approach. Each application is represented as a Direct Acyclic Graphs (DAG). We refer to these DAGs as punchlines. Using punchlines users can design arbitrary stream or batch applications.

Each punchline groups several processing nodes. The simplest punchline allows you to design input-processing-output patterns, such as the one you need for a log management solution. You can however model much more sophisticated graphs, not only simple sequences. The punch provides a number of ready to use nodes (listed separately), and allows you to add your own.

Rationale

Having processing pipelines delivered as mere configuration files provides your system with auditability, security and ease-of-use.

flux

Bad things happen when configuration is hard-coded. No one should have to recompile or repackage an application in order to change configuration.

Use Cases

Selecting the right processors, here is a quick overview of typical punch pipeline use cases.

Using Spark Streaming processor:

Accurate counts
Windowing Aggregations
Progressive Analysis
Continuous Machine Learning

Using Storm or PunchLightProcessor processors:

Log management pipelines
Site to site data transport
Real-time alerting and complex event processing rules

Using Spark batch processor:

Machine learning
Data extractions and Aggregations

Features

reliable : punchlines provide data acknowledgement required to achieve exactly or at least once data processing
monitored : punchlines provide data acknowledgement required to achieve exactly or at least once data processing
scalable : from single to distributed multi process deployment. Scalability is provided by processors
binary or textual data : punchlines can process both binary (images, ..)or textual (json, xml,logs, ..) data
extensible : punchlines can be enriched with your own nodes and modules
autonomous : through various backpressure configurations, punchlines will stay stable under various traffic or data ingestion conditions

Alternatives

Alternative	Details
flux	Storm specific. Similar to punch topologies
streamset	streamset is a system for creating, executing and operating continuous dataflows that connect various parts of your data infrastructure.
nifi	A popular flow automation tool.
logstash	Limited to simple log management pipelines.
cdap	an open source framework for building data analytic applications.
mlflow	mlFlow is a framework that supports the machine learning lifecycle. This means that it has components to monitor your model during training and running, ability to store models, load the model in production code and create a pipeline.
apache beam	Apache Beam is an open source, unified model for defining both batch and streaming data-parallel processing pipelines.

Pipeline Processors¶

host and run pipelines

To run pipelines, the punch provides you with 4 possible processors. A processor is simply the component in charge of running your pipelines.

Technology	Description
Storm	Stream processing engine in charge of running pipelines referred to as topologies.
Spark	Batch processing engine in charge of running batch pipelines in particular the ones running machine learning modules.
Spark Streaming	Stream processing engine in charge of running continuous pipelines, in particular the ones running machine learning modules.
Punch	Punch provides a lightweight pipeline engine. It offers a minimal yet performant single-process engine to run pipelines on constrained environments such as edge platforms. It is also well-suited to deploy pipelines on dockerized platforms.

Rationale

Running large scale distributed pipelines requires state-of-the-art yet robust stream and batch frameworks. The punch aims at making it simple and fast to leverage key technologies, without the burden of developing applications on your own, nor the difficulty to identify and invest on a particular technology.

Alternative	Details
kafka stream	Does not provide a runtime framework, only libraries
h2o	dedicated to machine learning
flink	A storm and kafka stream competitor
tibco streambase	A CEP platform for rapidly building and deploying applications that analyze and act on real-time streaming data.
ibm streams	Ibm stream processing platform that can ingest, filter, analyze and correlate massive volumes of continuous data streams
google dataflow	a cloud-based data processing service for both batch and real-time data streaming applications. It leverages and expands the map reduce project.
apache samza	Samza is a stream processing framework. See it as a higher level stream framework built on top of Kafka.
azure stream analytics	microsoft serverless scalable complex event processing engine.
amazon kinesis	Amazon stream engine, that integrates with other amazon services such as Redshift, Dynamo Database and Simple Storage Service (Amazon S3),
twitter heron	a fork from the original storm framework now used by twitter.

Channels¶

end-to-end application composition
configuration
monitoring
user commands

Channels are a unique punch feature. Using channels you can assemble several pipelines into a single application. You are then offered with three basic commands to start, stop or reload your application. A channel can group both stream pipelines to ingest live data and batch pipelines to perform various batch only tasks: machine learning, alerting, aggregating, etc..

Channels also offer a key feature : they can span across one or several Kafka hops.

Rationale

Real production systems run tenths of pipelines. Some of them stream, other batch. Users and operators must be offered with a logical and management view of these pipelines, grouped by tenant and by applicative role.

Multi-Tenancy¶

security
resource mutualisation

From the ground up, the punch is designed with a multi tenant architecture. It allows you to deploy some of design your architecture with end-to-end multi-tenant isolation. Multi-tenancy is achieved through a number of lower detail punch feature: monitoring, pipeline configuration, kafka topic, elasticsearch index, ceph pool naming scheme and management.

It is a punch feature in that the punch deployer and the various (administration services)[#administration-services] helps you configuring and operating your tenants.

Rationale

This is a must-have feature for most production system. It allows to mutualise key expensive services and components such as big data technologies (Kafka or Elasticsearch clusters) while providing a safe and secure isolation between several customers/domains/zones.

Alternatives

This usually requires ad-hoc configurations of the technologies at play. This requires extensive work.

IO Connectors¶

The punch provides you with a set of ready to use input-output connectors. They come as configurable pipeline nodes. They can ingest textual or binary data.


Azure Blob Storage	connectors to read data from azure blob storage including nsg data.
kafka	connectors to kafka are provided on the shelves.
Ceph	Storm connectors to Ceph are provided natively. This allows to setup archiving services.
Sockets	TCP, UDP, SSL and Lumberjack sockets connectors are built-in.
Files	Files and folders can directly be consumed.
Snmp	Storm pipelines can ingest snmp traps natively
Netflow	Storm pipelines can ingest netflow data natively
Elasticsearch	reading and writing data from/to elasticsearch is punch key feature. In particular some pipelines can consume massive data from elasticsearch through spark and elasticsearch hadoop-connectors.

Rationale

Leveraging performant and reliable input output connectors is a critical requirement. These provide the backbone on top of which you will construct your applications. Not loosing a single data record, ensuring at-least or exactly-once semantics directly depends on your connectors.

Alternatives

Most platforms or framework provide you with IO connectors. It usually requires lots of testing to ensure they provide the required level of robustness, acknowledgement protocol support, and integration inside

On the other hand, a number of open source components are available. Most miss the required level of configurability and monitoring to be deployed as is.

We recommend you only use components effectively used in production on critical systems. Not just on dedicated ad hoc platforms.

Data Shippers¶

Some IO connectors deserve a particular attention. They allow you to forward data between distant sites with four key functions:

load-balancing : multi-destination, multi-room or multi-sites to automatically switch the data forwarding to secondary destinations
encryption and compression : to reduce the amount of data on limited bandwidth environments
end-to-end acknowledgement : so as to ensure data retransmission in case of failure
local buffering : so as to cope with network interruptions.

Rationale

These are all mandatory functions for edge data collectors.

Punch Language¶

parsers
data routing and filtering
arbitrary data processing
logstash-like operators : dates, grok, csv, json, key-value, loops

Inside your pipeline you can deploy so-called punchlets. These leverage a compact programming language that provide a number of operators to manipulate your data.

Alternatives

Alternative	Details
logstash	the punch language is inspired from logstash filters but is a real imperative language, and is more compact (up to 5 times). It provides the sames operators.
splunk	splunk provides a powerful pipeline-style language with many similar operator
graylog	graylog provide a more verbose logstash style language. Specific to log management
custom developments	write your own processing nodes or applications. You will lack the support of builtin operators, and of course you will be adherent to the apis and frameworks you select.

Rule Engines¶

alert
correlation

Different rule engines apply depending on the needed algorithmic (partitioning, stateless/stateful...) features.

Elastalert is a rule-engine for doing Elasticsearch-based correlation/detection, that simplifies the rule configuration process. this brick is used for supporting Cybels Analytics correlation/detection rule sets management and templating including leveraging Sigma rules. Elastalert engine also provides notification actions (e.g. e-mails) without requiring additional pipelining.

Alternatives

Alternative	Details
esper	Esper is a language, compiler and runtime for complex event processing (CEP) and streaming analytics, available for Java as well as for .NET. Watch out for its licensing.
wso2	the siddhi library used in the punch comes from the excellent wso2 product. The way siddhi is integrated into Storm is very similar to their architecture.
splunk	using splunk language you can implement correlation rules on your own.
qradar	ibm qradar provides an event correlator and rule engine as part of its siem solution
custom developments	Stream processing framework allows you to write your own CEP rules using stream SQL variants.

Archiving Service¶

archiving
data lifecycle : hot cold frozen delete

The punch ships in with pipelines dedicated to provide you with an archiving service. This service provides :

Compact and cost-effective Data Archiving and Indexing : choose from various replication levels.
Extraction and data replay services.
Housekeeping : to automatically expunge you old data.
Housekeeping : to automatically expunge you old data.

The archiving service runs on top of the punch long term storage.

Alternatives

Alternative	Details
splunk	Splunk provides cold, frozen, delete data lifecycle configurations. It also can leverages san or nfs external storage.
log management products	the many log management solutions all provide some sort of equivalent service.

Short Term Storage¶

resiliency
scalability
distributed end-to-end pipeline setup

The punch completely integrates Kafka, used internally to design end-to-end processing channels.

Rationale

Kafka is now the de facto standard for architecturing real time and reliable stream processing pipelines. The punch leverages Kafka since its early days.

Long Term Storage¶

The punch provides a CEPH storage engine for storing years of data. It can be exposed through object storage api, CephFs or S3/Swift. Refer to the archiving service documentation.

Alternatives

Alternative	Details
cloud blob storage	OVH, S3, Azure Blob Storage, or Google Cloud Storage all provide excellent and cost effective storage offerings.
splunk	Splunk provides cold and frozen storage.
qradar	ibm qradar provides log storage under the form of data nodes.

Deployer¶

multi-tenancy
automated deployment
configuration management
monitoring

The punch deployer is a high level deployer tools that install a complete punch platform to your target servers. It works using only simple configuration files, without requiring low adherence to complex deployment technologies.

Rationale

Many customers have only hardware or virtual servers to start with. Building a complete solution such as the punch represents a significant

Alternatives

Alternative	Details
ansible	Deploying the many components using ansible is your most probable alternative should you operate your infrastructure. It requires significant work to ensure your roles are idempotent and robust. Extra care must be taken to not end up with complex inventories, hard to maintain. The punch deployer relies on ansible but provides you with roles, and fully takes care of generating the inventories.
application or container paas	I.e. pivotal cloudfoundry or kubernetes or any other in between mixed variant (cfcr, openshift) you can then deploy only the pipelines part, and leverages the (elasticsearch. kafka and others) as platform services. You still must assemble security, monitoring, multi-tenancy, and teh overall configuration management plane on top of it.

Warning

This area is source of great confusion. The punch is a lightweight distribution that can be deployed on many target systems including clouds and paas. However it does not impose such a paas. These are heavyweight infrastructures requiring their own dependencies. For example pivotal cloudfoundry requires a (mysql) database, whereas kubernetes requires a ditributed store (etcd). Each supports different filesystems, depending on their underlying container implementation of both platforms.

Warning

Quoting niels goossens. "Even though a PaaS – and particularly a container platform – works best when adhering to the Twelve-factor app standard, specifically the factor about stateless processes, it is possible to run stateful applications on OpenShift. Why you would want to do this apart from a proof-of-concept is beyond me, though. For more information see this excellent article on Kubernetes and stateful."

Machine Learning¶

The punch supports special pipelines dedicated to run machine learning applications. Refer to the PML documentation. This features is quite unique as it allows users to conceive machine learning pipelines on top of Elasticsearch data using In addition a graphical editor is provided to simplify their design and execution.

Rationale

Such a dataflow, model-driven solution to design test and deploy ml functions onto a production platform dramatically reduces ml modules time-to-market.

Alternative	Details
cdap	an open source framework for building data analytic applications
mlflow	databrick ml framework
dataiku	a collaborative data science software platform for teams of data scientists, data analysts, and engineers. It simplifies the design of ml pipelines.
sinequa	an integrated machine learning platform that addresses citizen data scientists.

Data Visualisation¶

The punch relies on Kibana to provide you with data visualisation capabilities.

Rationale

The reason we exclusively leverage Kibana is explained in a dedicated blog. The key argument is to keep the punch simple, not requiring ourselves and our user to master too many technologies.

Alternative	Details
grafana	A popular monitoring visualisation tool. It is compatible with elasticsearch and more specifically dedicated to monitoring and supervision use cases.
data dog	A sophisticated metrics and monitoring solution.
prometheus	A time serie metrics and monitoring solution. It provides visualisation capabilities. Not as rich as grafana and kibana

Integrated Monitoring¶

monitoring
alerting

Monitoring is a key requirements to stay in control of your application, pipeline and platform. This is true for tiny or large scale use case. The punch integrates monitoring agents to collect various system and applicative metrics so as to provide users with

normalised and aggregated metrics
dashboards
alerting
REST api for plugin in external supervision tools

Rationale

The punch leverages the Elastic stack, all of it : Beats (metrics agents), Elasticsearch and Kibana. We decided to stop using alternate technologies, some of them more efficient and specialised to handle monitoring metrics. Benefits are significant. First the overall solution stays simple, not requiring additional tools and technologies. Second the elastic component have excellent performance. Lats and most importantly, leveraging elastic and punch features, we can plugin aggregation and machine learning jobs also on these monitoring metrics.

Alternatives

Alternative	Details
grafana	A popular monitoring visualisation tool. It is compatible with elasticsearch and more specifically dedicated to monitoring and supervision use cases.
prometheus	A time serie metrics and monitoring solution. Simple to setup for simple alerting use case. Low footprints and performant.

Log Parsers¶

standard parsers
normalisation
log management

The punch ships in with 70 standard parsers. Checkout the log management documentation.

Rationale

the punch is a thales strategical development with the idea to build an open-source based log management solution. Key to that strategy is the management and sharing of common assets, in particular log parsers. The punch language has allowed various teams to produce parsers. Compact, easy to maintain, fully integrated into continuous integration tooling, these parsers make it now cost effective to deploy many log management solutions as required by the many Thales projects.

Alternatives

Alternative	Details
splunk	Splunk benefits from a rich community that has contributed to develop a large number of parsers
fluentd	community parsers
graylog	community parsers
qradar	ibm qradar ships in with so-called dsm, ready to use parsers.

Security¶

access control
encryption of data in movement
multi-tenant data access
RBAC user management

The punch provides RBAC access control and authentication. Refer to the punch security documentation.

this section is under construction

Alternatives

Alternative	Details
elastic suite	the paid license offers security features to protect kibana and elasticsearch.

Open Technology¶

open
extensible

Although not technical, this last enabler is key. The punch source code is extremely well organised. Projects can use the complete packaging or only leverage some specific punch libraries.

Rationale

Such an ambitious stack can only suceeds if various thales and/or external customer teams participates and contribute to the punch. This has been a design driver from day one.

Punch Simulator Tool¶

development
test
performance

The punch injector tool provides users with invaluable means to:

inject arbitrary data : the punch injector templating format allows to generate variable fields in Json, XML or any textual data
stress the system : the punch injector can send efficiently tenths of thousands of event per seconds.
consume data : the injector can also read data from Kafka or sockets. That makes it ideal to investigate, debug or even (stress) test various pipelines.
various sources and destinations : elasticsearch, kafka, sockets, files. It makes it extremely easy to design pocs, mvps, por provide production system with the required testing tooling.

Alternatives

No one we know of.