Skip to content

Rationale

Abstract

To go production with an analytics and/or big data platform, you will need to easily deal with stream processing, batch processing and administrative tasks. The Punch aims at making that simple and easy. This guide provides the rationale of the punch architecture, components and overall design.

Stream Processing

To reliably process data at high rate, the PunchPlatform chains hops, where data goes from one safe place to another. In between two hops, you plug in your processing. By safe, what is meant here is a persistent storage where the data resides long enough so as to be reprocessed in case of failure. Each hop thus is as follows:

image

The PunchPlatform uses Kafka as intermediate storage, it is reliable, extremely fast both for writing to disk and to publish data to consumers.

For processing the data, what is required is a runtime platform that can host and run your processing functions, yet be capable of scaling out, survive failures, and detect data losses so as to replay the data. The PunchPlatform uses Twitter's Storm technology, it does exactly that.

Before looking deeper into Storm, note that a complete processing chain requires more than one hop. Think of taking the data from (say) equipments up to several destination for (say) indexing, archiving, or performing correlation. Hence, what you get is more like this:

image

Because each hop is reliable, the chain is eventually reliable. In practice, setting up such processing chain is hard work : from hardware and software configuration, testing, monitoring, up to (hot)-(re)-deploying processings, possibly going through several security zones, or even distant sites. If not carefully designed, the setup just illustrated will turn to be hardly manageable, and will likely not behave as expected.

Hence the Punch. A PunchPlatform can act as data emitter, generating events, proxy, running processing on the fly, or receiver, indexing the data in the adequate backend.

Channels

Going back to the previous example, you immediately see that one functional pipeline may consists in more than one piece. You may need to combine several ones as part of a consistent overall (bigger) pipeline.

This is where channels come into play.

A channel consists of an arbitrary number of hops, each taking the data from a (reliable) queue to another up to some backend. Each hop runs your processing, or simply transport your data to the next destination. Channels are high-performant, scalable, reliable but also monitored. Starting/stopping/reloading a channel is straightforward, whatever be the actual complexity of the underlying channel structure.

For example, the previous chain could be composed of three channels, as follows:

image

Having one or several platform depends on your constraints, for example two platforms can be deployed on two distant sites. Channels, however, are defined and managed in the scope of a single platform.

Concretely, let us illustrate a real use case. Logs are collected from some equipments at some site, they must be transported to several destinations on several sites, to end up being processed in several backends, (say) several Elasticsearch clusters, for log analysis and forensic, and to a QRadar correlation engine, installed on one of the site. The PunchPlatform makes it possible (and easy) to setup a complete transport and processing chain to take care of:

  1. log transport using scalable and reliable guarantees
  2. log transformation, Parsing, enrichment, filtering, duplication
  3. log indexing, (short and long term) storage
  4. visualization and forensic using Kibana dashboards
  5. log forwarding to third party tools such as QRadar

All in all this is synthesized next:

image

Batch Processing

Having your real-time streaming pipelines in only a part of the story. You may need to enrich this with batch processing. The punchplatform lets you define batch processing using a number of technologies, in particular spark applications (http://spark.apache.org).

Spark is a great fit for designing stateful processings: extractions, aggregations, machine learning, data reprocessing.

What you need is an easy way to assemble your streaming pipelines with a periodic scheduling of your batch jobs. The way you do that on the punchplatform is by grouping them all in a simple and consistent way as part of your channels. This is explained hereafter.

such architecture combining stream and batch processing are often referred to as lamba architecture.

A Closer Look at Channels

A channel is composed of one or several processing pipelines. Technically speaking, a pipeline is a either a storm topology, a Spark job (stream or batch). Functionally each is doing something useful on your data (parse, enrich, train, learn, detect).

The following is a broad view of a typical channel, one used for log processing.

image

In there, red bullets are pipelines. It is worth understanding the basic concepts of pipelines : they are simple, generic and powerful. The punchplatform inherits much of its power directly from the Storm and Spark pipeline models.

For example a torm topology is a directed graph of two kinds of functions : so-called Spouts get or receive the data from some data sources, and ingest in turn that data in the graph. The graph is composed of so-called Bolts, that actually run some function on the data. Some bolts (usually the terminating ones) save or forward the processed data to a next sink.

A Storm topology can be an arbitrary (directed acyclic) graph. Data must not necessarily traverse the graph, upon receiving a data, a bolt can emit 0, 1 or more data further the chain. Here is a summary:

image

Input Connectors

Data enters a platform through input connectors. These comes as Storm spouts, deployed inside Storm topologies, and acting as a platform data entry point.

The PunchPlatform supports Kafka, file, TCP/UDP/SSL sockets to consume input data. Connectors can be easily added to the platform whenever needed.

!!! note "On TCP sockets, the PunchPlatform accepts the Lumberjack acknowledged protocol. That makes it possible to ingest data from elastic beats or logstash forwarders, and provide reliabilty from the start."

Output Connectors

Data can be saved to Kafka, Elasticsearch, Ceph, or to a socket peer. Using the acknowledged syslog protocol just cited, you can chain several platforms yet benefit from end to end reliability. More precisely, if the destination platform suffers from a failure, the lost data will not be acknowledged, and will be replayed until it eventually is processed and acknowledged.

Processing Functions

In many real-time streaming big data applications, processing is actually simple : performing regex based pattern matching, transforming the data using key value or cvs like operators, filtering/pruning the data, computing aggregations, etc ... Even so, only a programming language has the required expressiveness.

Storm is Java centric. Besides the language, the API is subtle and low level. We want our users to express their document transformation using a much simpler and JSON friendly programming language. The PunchPlatform brings in a programming language : Punch, that makes working with JSON document as simple as possible. An example of punch, here is how you add nested properties in a JSON document :

1
2
3
4
{
    [user][name] = bob;
    [user][age] = 24;
}

The Punch language has many characteristics that makes it very compact, and easy to work with. Users actually deploy Punchlets as part of their channels, in just minutes. Punchlets are small Punch programs deployed, compiled and executed on the fly. Punchlets give the platform administrators complete control over the documents:

  • timestamping, versioning, enrich the data in arbitrary ways
  • filtering/pruning the data
  • running regular expressions, key-value or CSV formatting and many more operators
  • detect and/or generate various kind of events, from the content of the data going through or from accumulated information.
  • routing the data to Elasticsearch indexes, backends, other Punchlets hops.
  • ...

Punchlets are designed for Storm : a punchlet can generate documents, in one or several Storm data streams so as to be consumed by downstream Storm components.

Tip

If you are familiar with Logstash, Punchlets will be very familiar to you. All the Logstash regexes patterns are available to punchlets. In fact, the PunchPlatform, as for log processing is concerned, is Logstash on Storm steroids.

This said, in some cases you may require to code using a standard programming language, and benefit from powerful editors. The PunchPlatform lets you code in Java and provides a Java API to let you benefit from all the Punch language goodies. For you to understand here is the same example than above but written in Java:

1
2
    root.get("user").get("name").set("bob");
    root.get("user").get("age").set(24);

As you can see, it is not far being as compact.

Searching Your Data

Our vision is to let you benefit from Kibana and Elasticsearch capabilities, not restricting their use. You will find a great amount of online and public resources to help you design the dashboards you need. Both the Elasticsearch and Kibana communities are extremely active, and keep adding features and delivering improvements or bug fixes. We designed the PunchPlatform so as to be extremely easily upgraded to new versions, without service interruption or data loss.

The following is a very basic Kibana dashboard, providing a quick view of a set of log parsers. That level of dashboarding is what you start from, using a standalone PunchPlatform package.

image

Tip

the data highlighted here was processed using the standard log parsers delivered as part of the platform. It was injected using the PunchPlatform injector. All that is available to you in minutes.

Operations

Running a big-data streaming platform requires state-of-the art supervision capabilities. Standard supervision tools, collecting metrics every now and then, cannot scale up to follow the events rate. The PunchPlatform ships with a supervision chain capable of dealing with high throughput metric collection. Not only these metrics are stored, you also benefit from a Kibana graphical user interface, that you can customize in the way you need.

After installing a punchplatform (i.e. after a few minutes), you start with a complete platform level supervision plane and ready-to-use dashboards.

You can further easily define your application level monitoring using ready-to-use metrics generated by all punchplatform components centralised in an elasticsearch instance.

The PunchPlatform publishes many useful metrics. Some of them lets you keep an eye on the traversal time of your data in your data channel. The principle, illustrated next, consists in injecting monitoring payloads in the data streams, in a way to collect information along their journey. Each traversed hop publishes in real time the corresponding information so as to let you visualize how the data is traversing the complete chain:

image

This scheme works also using a proxying scheme, so that you can visualize and monitor informations coming from a remote PunchPlatforms, running on a different system, typically used as a data collector and forwarder system. This is illustrated next:

image

Important events must be detected in real-time, so as to immediately trigger the generation of new, aggregated events. The PunchPlatform offers many ways to implement alerting. The PunchPlatform components publish many useful metrics to standard metrics receivers. These receivers can then perform their own alerting using various strategies and calculations.

The PunchPlatform always publish metrics in a way to scale. It performs real time statistical analysis on the fly and only publish consolidated metrics so that it works whatever the traffic load is.

The PunchPlatform publishes metrics at various levels : platform, tenant, channel, module. Besides, punchlets can publish their own metrics or alerts. For example to track the usage of a particular string sub-pattern in a log field, to react to a timed occurrence of some value, or to immediately notify the administrator of dropped traffic.

Footprints

The PunchPlatform can endorse several roles. Parts of your channel can be implemented using small Java processes, only in charge of forwarding logs to the first Kafka hops, yet executing a punchlet to take care of a first transformation. The following illustrates the memory occupation of a forwarder topology. It runs using a 128Mb RAM settings, and processes 5000 logs/seconds. Garbage collection occupation is under 1% of the overall CPU usage.

image

In contrast, other part of the channel may consist in bigger Java processes for performing aggregation. They will need more RAM to store intermediate computations.

The point is : both logics can be part of a single channel. The PunchPlatform provides a smart Storm scheduler so as to place the right components at the right locations in your cluster of machines. You can dedicate servers to ingesting data, in a dedicated security zone, with small machines (in terms of RAM), while submitting data intensive treatments to another partition of your servers, where you use more powerful machine.

Summary

With just a few Punch lines of code, you can invent your topologies, scaling up to your needs, and covering a large set of use cases : parsing, enrichment, normalization, filtering, machine learning. Although the PunchPlatform has been initially designed to cover the log management use case, it is great for any high performance textual data processing.

Because Kafka Storm and Spark are designed for simplicity, scalability and reliability, you have little limits . The real value of the PunchPlatform is to let you benefit from these great technologies in minutes, not days or weeks, and be immediately ready to run and manage a production environment.

Our vision is to let you benefit from Kibana and Elasticsearch capabilities, not to restrict their use. You will find a great amount of online and public resources to help you design the dashboards you need. Both the Elasticsearch and Kibana communities are extremely active, and keep adding features and delivering improvements or bug fixes.

We designed the PunchPlatform so as to be extremely easily upgraded to new versions, without service interruption or data loss.