Providing a complete list of punch features is difficult. First there are many. Second they are of different nature, some low-level (i.e. a back pressure algorithm) others high-level (archive data housekeeping). All are essential to build a complete industrial solution.
The punch value is therefore best described in terms of what it enables you to solve. For that reason this chapter lists enablers. For each we provide its rationale, and we list alternatives should you need to solve the same problem on your own.
This chapter will help you determining quickly if and how the punch can help you. Refer to the feature list for a more compact and easy to scan list of features.
Whatever use case you have (cybersecurity, Iot, system or application monitoring) it is common to assemble the same various functions over and over again. This can be depicted as follows:
This does not mean that you always need all these functions. It is just likely that you need a fair amount of them. The punch has been designed to deliver scalable and resilient components that are easy to assemble. You can use them on tiny systems or on large scale big data platforms.
Before we describe each enabler in detail, it helps to have a quick understanding on their value depending on your target use case and environment. We will consider four illustrative environments:
- Edge : refers to small platforms from one to 10 servers. These must be highly autonomous, cheap to build, deploy and operate. They run on limited hardware resources yet require advanced stream processing capabilities, potentially including machine learning capabilities.
- DataCenter : refers to large scale on premise platforms up to several hundreds of servers. Besides the size and resources, these are managed by human operators on a daily basis. Although running critical services, they must be easily upgraded, adapted, and enriched with new services.
- Socs: refers to Security Operation Center. See a Soc as a DataCenter-like platform focused on providing cybersecurity detection and forensic capabilities. Some Socs however require to run on limited hardware so as to monitor small systems.
- Clouds : refers to cloud and cloud-native environments where applications and resources need be monitored. Possibly for security concerns, or for tracking resources and applications usages. Most often for tracking both.
|Search and Analytics Data Engine||*||***||***||***|
|Stream Pipeline Processor||*||***||***||***|
|Batch Pipeline Processor||**||***||***||***|
|Punch Light Processor||***||**||**||**|
|Data Shipper Connectors||***||**||**||***|
|Long Term Storage||-||***||***||***|
|Short Term Storage||***||***||***||***|
Search and Analytics Data Engine¶
The Punch provides a distributed search and analytics data engine. That part is fully built on top of Elasticsearch.
Elasticsearch offers key benefits. It is simple to setup and operate, while providing scalability and resiliency from small to large systems. It provides near real time operations. It makes it easy and quick to build applications for a variety of use-cases in a way drastically simpler than with an hadoop stack or with more specific stack.
Elasticsearch comes with a variety of tools to visualise, collect, aggregate and process the data. Last, it can be enriched with additional machine learning capabilities using various connectors to batch and streaming frameworks including as Spark, Storm.
- log analytics
- full-text search
- security intelligence
- business analytics
- operational intelligence
|elasticsearch||there are now several variants of open source elasticsearch releases. Using elasticsearch on your own requires you provide multi-tenancy, data housekeeping, log parsing etc.. on your own|
|elastic suite||the paid license offers additional features and services such as reporting, security, machine learning, alerting. Note that the punch team recommend the use of an X-Pack license in combination with the punch.|
|splunk||Splunk provides the same excellent ease-of-use, real-time operations and data visualisation capabilities.|
|fluentd||although only focused on log management, there are a number of open source, managed of commercial variants that are built on top of elasticsearch. Fluentd is an excellent one.|
|graylog||Another well known log management offer|
- collect, transport, route
- transform, parse, enrich
- archive, extract, replay
- learn, detect, alert
Besides a data analytics database, most of your effort will be devoted to design data processing applications to parse, filter, route, enrich your data. Others will be required to aggregate, detect, reprocess your data. The punch proposes a dataflow programming tool to design such arbitrary data processing applications using a model-driven approach. Each application is represented as a Direct Acyclic Graphs (DAG). We refer to these DAGs as punchlines. Using punchlines users can design arbitrary stream or batch applications.
Each punchline groups several processing nodes. The simplest punchline allows you to design input-processing-output patterns, such as the one you need for a log management solution. You can however model much more sophisticated graphs, not only simple sequences. The punch provides a number of ready to use nodes (listed separately), and allows you to add your own.
Having processing pipelines delivered as mere configuration files provides your system with auditability, security and ease-of-use.
Bad things happen when configuration is hard-coded. No one should have to recompile or repackage an application in order to change configuration.
Selecting the right processors, here is a quick overview of typical punch pipeline use cases.
Using Spark Streaming processor:
- Accurate counts
- Windowing Aggregations
- Progressive Analysis
- Continuous Machine Learning
Using Storm or PunchLightProcessor processors:
- Log management pipelines
- Site to site data transport
- Real-time alerting and complex event processing rules
Using Spark batch processor:
- Machine learning
- Data extractions and Aggregations
- reliable : punchlines provide data acknowledgement required to achieve exactly or at least once data processing
- monitored : punchlines provide data acknowledgement required to achieve exactly or at least once data processing
- scalable : from single to distributed multi process deployment. Scalability is provided by processors
- binary or textual data : punchlines can process both binary (images, ..)or textual (json, xml,logs, ..) data
- extensible : punchlines can be enriched with your own nodes and modules
- autonomous : through various backpressure configurations, punchlines will stay stable under various traffic or data ingestion conditions
|flux||Storm specific. Similar to punch topologies|
|streamset||streamset is a system for creating, executing and operating continuous dataflows that connect various parts of your data infrastructure.|
|nifi||A popular flow automation tool.|
|logstash||Limited to simple log management pipelines.|
|cdap||an open source framework for building data analytic applications.|
|mlflow||mlFlow is a framework that supports the machine learning lifecycle. This means that it has components to monitor your model during training and running, ability to store models, load the model in production code and create a pipeline.|
|apache beam||Apache Beam is an open source, unified model for defining both batch and streaming data-parallel processing pipelines.|
- host and run pipelines
To run pipelines, the punch provides you with 4 possible processors. A processor is simply the component in charge of running your pipelines.
|Storm||Stream processing engine in charge of running pipelines referred to as topologies.|
|Spark||Batch processing engine in charge of running batch pipelines in particular the ones running machine learning modules.|
|Spark Streaming||Stream processing engine in charge of running continuous pipelines, in particular the ones running machine learning modules.|
|Punch||Punch provides a lightweight pipeline engine. It offers a minimal yet performant single-process engine to run pipelines on constrained environments such as edge platforms. It is also well-suited to deploy pipelines on dockerized platforms.|
Running large scale distributed pipelines requires state-of-the-art yet robust stream and batch frameworks. The punch aims at making it simple and fast to leverage key technologies, without the burden of developing applications on your own, nor the difficulty to identify and invest on a particular technology.
|kafka stream||Does not provide a runtime framework, only libraries|
|h2o||dedicated to machine learning|
|flink||A storm and kafka stream competitor|
|tibco streambase||A CEP platform for rapidly building and deploying applications that analyze and act on real-time streaming data.|
|ibm streams||Ibm stream processing platform that can ingest, filter, analyze and correlate massive volumes of continuous data streams|
|google dataflow||a cloud-based data processing service for both batch and real-time data streaming applications. It leverages and expands the map reduce project.|
|apache samza||Samza is a stream processing framework. See it as a higher level stream framework built on top of Kafka.|
|azure stream analytics||microsoft serverless scalable complex event processing engine.|
|amazon kinesis||Amazon stream engine, that integrates with other amazon services such as Redshift, Dynamo Database and Simple Storage Service (Amazon S3),|
|twitter heron||a fork from the original storm framework now used by twitter.|
- end-to-end application composition
- user commands
Channels are a unique punch feature. Using channels you can assemble several pipelines into a single application. You are then offered with three basic commands to start, stop or reload your application. A channel can group both stream pipelines to ingest live data and batch pipelines to perform various batch only tasks: machine learning, alerting, aggregating, etc..
Channels also offer a key feature : they can span across one or several Kafka hops.
Real production systems run tenths of pipelines. Some of them stream, other batch. Users and operators must be offered with a logical and management view of these pipelines, grouped by tenant and by applicative role.
- resource mutualisation
From the ground up, the punch is designed with a multi tenant architecture. It allows you to deploy some of design your architecture with end-to-end multi-tenant isolation. Multi-tenancy is achieved through a number of lower detail punch feature: monitoring, pipeline configuration, kafka topic, elasticsearch index, ceph pool naming scheme and management.
It is a punch feature in that the punch deployer and the various (administration services)[#administration-services] helps you configuring and operating your tenants.
This is a must-have feature for most production system. It allows to mutualise key expensive services and components such as big data technologies (Kafka or Elasticsearch clusters) while providing a safe and secure isolation between several customers/domains/zones.
This usually requires ad-hoc configurations of the technologies at play. This requires extensive work.
The punch provides you with a set of ready to use input-output connectors. They come as configurable pipeline nodes. They can ingest textual or binary data.
|Azure Blob Storage||connectors to read data from azure blob storage including nsg data.|
|kafka||connectors to kafka are provided on the shelves.|
|Ceph||Storm connectors to Ceph are provided natively. This allows to setup archiving services.|
|Sockets||TCP, UDP, SSL and Lumberjack sockets connectors are built-in.|
|Files||Files and folders can directly be consumed.|
|Snmp||Storm pipelines can ingest snmp traps natively|
|Netflow||Storm pipelines can ingest netflow data natively|
|Elasticsearch||reading and writing data from/to elasticsearch is punch key feature. In particular some pipelines can consume massive data from elasticsearch through spark and elasticsearch hadoop-connectors.|
Leveraging performant and reliable input output connectors is a critical requirement. These provide the backbone on top of which you will construct your applications. Not loosing a single data record, ensuring at-least or exactly-once semantics directly depends on your connectors.
Most platforms or framework provide you with IO connectors. It usually requires lots of testing to ensure they provide the required level of robustness, acknowledgement protocol support, and integration inside
On the other hand, a number of open source components are available. Most miss the required level of configurability and monitoring to be deployed as is.
We recommend you only use components effectively used in production on critical systems. Not just on dedicated ad hoc platforms.
Some IO connectors deserve a particular attention. They allow you to forward data between distant sites with four key functions:
- load-balancing : multi-destination, multi-room or multi-sites to automatically switch the data forwarding to secondary destinations
- encryption and compression : to reduce the amount of data on limited bandwidth environments
- end-to-end acknowledgement : so as to ensure data retransmission in case of failure
- local buffering : so as to cope with network interruptions.
These are all mandatory functions for edge data collectors.
- data routing and filtering
- arbitrary data processing
- logstash-like operators : dates, grok, csv, json, key-value, loops
Inside your pipeline you can deploy so-called punchlets. These leverage a compact programming language that provide a number of operators to manipulate your data.
|logstash||the punch language is inspired from logstash filters but is a real imperative language, and is more compact (up to 5 times). It provides the sames operators.|
|splunk||splunk provides a powerful pipeline-style language with many similar operator|
|graylog||graylog provide a more verbose logstash style language. Specific to log management|
|custom developments||write your own processing nodes or applications. You will lack the support of builtin operators, and of course you will be adherent to the apis and frameworks you select.|
The siddhi rule engine has been deprecated started in 7.x. It is replaced by Flink SQL capabilities.
Different rule engines apply depending on the needed algorithmic (partitioning, stateless/stateful...) features.
Elastalert is a rule-engine for doing Elasticsearch-based correlation/detection, that simplifies the rule configuration process. this brick is used for supporting Cybels Analytics correlation/detection rule sets management and templating including leveraging Sigma rules. Elastalert engine also provides notification actions (e.g. e-mails) without requiring additional pipelining.
Flink and Spark Streaming punchlines provides various capabilities to implement CEP uses cases. These support both streaming and batch use cases.
|esper||Esper is a language, compiler and runtime for complex event processing (CEP) and streaming analytics, available for Java as well as for .NET. Watch out for its licensing.|
|wso2||the siddhi library used in the punch comes from the excellent wso2 product. The way siddhi is integrated into Storm is very similar to their architecture.|
|splunk||using splunk language you can implement correlation rules on your own.|
|qradar||ibm qradar provides an event correlator and rule engine as part of its siem solution|
|custom developments||Stream processing framework allows you to write your own CEP rules using stream SQL variants.|
- data lifecycle : hot cold frozen delete
The punch ships in with pipelines dedicated to provide you with an archiving service. This service provides :
- Compact and cost-effective Data Archiving and Indexing : choose from various replication levels.
- Extraction and data replay services.
- Housekeeping : to automatically expunge you old data.
- Housekeeping : to automatically expunge you old data.
The archiving service runs on top of the punch long term storage.
|splunk||Splunk provides cold, frozen, delete data lifecycle configurations. It also can leverages san or nfs external storage.|
|log management products||the many log management solutions all provide some sort of equivalent service.|
Short Term Storage¶
- distributed end-to-end pipeline setup
The punch completely integrates Kafka, used internally to design end-to-end processing channels.
Kafka is now the de facto standard for architecturing real time and reliable stream processing pipelines. The punch leverages Kafka since its early days.
Long Term Storage¶
|cloud blob storage||OVH, S3, Azure Blob Storage, or Google Cloud Storage all provide excellent and cost effective storage offerings.|
|splunk||Splunk provides cold and frozen storage.|
|qradar||ibm qradar provides log storage under the form of data nodes.|
- automated deployment
- configuration management
The punch deployer is a high level deployer tools that install a complete punch platform to your target servers. It works using only simple configuration files, without requiring low adherence to complex deployment technologies.
Many customers have only hardware or virtual servers to start with. Building a complete solution such as the punch represents a significant
|ansible||Deploying the many components using ansible is your most probable alternative should you operate your infrastructure. It requires significant work to ensure your roles are idempotent and robust. Extra care must be taken to not end up with complex inventories, hard to maintain. The punch deployer relies on ansible but provides you with roles, and fully takes care of generating the inventories.|
|application or container paas||I.e. pivotal cloudfoundry or kubernetes or any other in between mixed variant (cfcr, openshift) you can then deploy only the pipelines part, and leverages the (elasticsearch. kafka and others) as platform services. You still must assemble security, monitoring, multi-tenancy, and teh overall configuration management plane on top of it.|
This area is source of great confusion. The punch is a lightweight distribution that can be deployed on many target systems including clouds and paas. However it does not impose such a paas. These are heavyweight infrastructures requiring their own dependencies. For example pivotal cloudfoundry requires a (mysql) database, whereas kubernetes requires a ditributed store (etcd). Each supports different filesystems, depending on their underlying container implementation of both platforms.
Quoting niels goossens. "Even though a PaaS – and particularly a container platform – works best when adhering to the Twelve-factor app standard, specifically the factor about stateless processes, it is possible to run stateful applications on OpenShift. Why you would want to do this apart from a proof-of-concept is beyond me, though. For more information see this excellent article on Kubernetes and stateful."
- learn, train, detect
- stream or batch
- powered by spark mllib
- supports both python and java/scala runtimes
The punch supports special pipelines dedicated to run machine learning applications. Refer to the PML documentation. This features is quite unique as it allows users to conceive machine learning pipelines on top of Elasticsearch data using In addition a graphical editor is provided to simplify their design and execution.
Such a dataflow, model-driven solution to design test and deploy ml functions onto a production platform dramatically reduces ml modules time-to-market.
|cdap||an open source framework for building data analytic applications|
|mlflow||databrick ml framework|
|dataiku||a collaborative data science software platform for teams of data scientists, data analysts, and engineers. It simplifies the design of ml pipelines.|
|sinequa||an integrated machine learning platform that addresses citizen data scientists.|
- data visualisation
- data forensics
- real time
The punch relies on Kibana to provide you with data visualisation capabilities.
The reason we exclusively leverage Kibana is explained in a dedicated blog. The key argument is to keep the punch simple, not requiring ourselves and our user to master too many technologies.
|grafana||A popular monitoring visualisation tool. It is compatible with elasticsearch and more specifically dedicated to monitoring and supervision use cases.|
|data dog||A sophisticated metrics and monitoring solution.|
|prometheus||A time serie metrics and monitoring solution. It provides visualisation capabilities. Not as rich as grafana and kibana|
Graphical User Interface¶
Delivered as Kibana plugin, the punch user graphical interface lets you:
- access the documentation
- extract data from elasticsearch
- code, run and test punchlets
- design machine learning jobs using a graphical editor
Monitoring is a key requirements to stay in control of your application, pipeline and platform. This is true for tiny or large scale use case. The punch integrates monitoring agents to collect various system and applicative metrics so as to provide users with
- normalised and aggregated metrics
- REST api for plugin in external supervision tools
The punch leverages the Elastic stack, all of it : Beats (metrics agents), Elasticsearch and Kibana. We decided to stop using alternate technologies, some of them more efficient and specialised to handle monitoring metrics. Benefits are significant. First the overall solution stays simple, not requiring additional tools and technologies. Second the elastic component have excellent performance. Lats and most importantly, leveraging elastic and punch features, we can plugin aggregation and machine learning jobs also on these monitoring metrics.
|grafana||A popular monitoring visualisation tool. It is compatible with elasticsearch and more specifically dedicated to monitoring and supervision use cases.|
|prometheus||A time serie metrics and monitoring solution. Simple to setup for simple alerting use case. Low footprints and performant.|
- standard parsers
- log management
The punch ships in with 70 standard parsers. Checkout the log management documentation.
the punch is a thales strategical development with the idea to build an open-source based log management solution. Key to that strategy is the management and sharing of common assets, in particular log parsers. The punch language has allowed various teams to produce parsers. Compact, easy to maintain, fully integrated into continuous integration tooling, these parsers make it now cost effective to deploy many log management solutions as required by the many Thales projects.
|splunk||Splunk benefits from a rich community that has contributed to develop a large number of parsers|
|qradar||ibm qradar ships in with so-called dsm, ready to use parsers.|
- access control
- encryption of data in movement
- multi-tenant data access
- RBAC user management
The punch provides RBAC access control and authentication. Refer to the punch security documentation.
this section is under construction
|elastic suite||the paid license offers security features to protect kibana and elasticsearch.|
Although not technical, this last enabler is key. The punch source code is extremely well organised. Projects can use the complete packaging or only leverage some specific punch libraries.
Such an ambitious stack can only suceeds if various thales and/or external customer teams participates and contribute to the punch. This has been a design driver from day one.
Punch Simulator Tool¶
The punch injector tool provides users with invaluable means to:
- inject arbitrary data : the punch injector templating format allows to generate variable fields in Json, XML or any textual data
- stress the system : the punch injector can send efficiently tenths of thousands of event per seconds.
- consume data : the injector can also read data from Kafka or sockets. That makes it ideal to investigate, debug or even (stress) test various pipelines.
- various sources and destinations : elasticsearch, kafka, sockets, files. It makes it extremely easy to design pocs, mvps, por provide production system with the required testing tooling.
No one we know of.