Components

Introduction

The PunchPlatform is a scalable and resilient data processing platform that can serve various functional use cases.

Using the PunchPlatform you can design end-to-end chains of data processings, capturing data from remote (typically client) sites up to your analytics backend. The PunchPlatform software components can be configured to play the various required roles in such chains.

In this chapter we first describe the PunchPlatform functional building blocks. These can be combined in a number of ways to fit the needs of specific applications such as Log Management solution, Supervision centers, analytics platforms etc..

Next we detail some key PunchPlatform configurations that altogether provide a complete end-to-end log management solution.

Log Management

Log management is an important PunchPlatform use case.

  • LMC : a LMC refers to a complete log management solution. It is composed of three parts:

    • LTR run on remote sites, in charge of collecting the logs, and forwarding them to the backend platform.
    • LMR is the LTR counterpart, it receives and acknowledge the logs from a LTR, and forwards them to the
    • LMS : the (serious) server part, providing all the services required to compute, store, index and search the data.

These names have a well-defined meaning, and are of the Thales commercial products portfolio. Make sure you refer to these with the meaning explained here.

Note

this does not mean you cannot invent new ways of using the PunchPlatform software, for researching, testing or prototyping new architectures or addressing new use cases.

LTR

Design

LTR is a specific PunchPlatform configuration used to collect data on the client side, and transporting them to a remote LMS/LMR.

../../../_images/Collect_archi.jpg

The key idea behind the LTR is to rely on a Kafka cluster to save the logs locally, on the client side, before sending them to the remote LMR/LMS systems. A LTR can accumulate several hours (or days) of data should the network connection be down. At service resuming, the LTR will flush the Kafka queues.

The following picture illustrates a possible configuration where the LTR relies on a socket forwarder, connected to a LMR listener.

../../../_images/LTR-PUSH.png

The socket protocol used between the LTR and the LMR/LMS is the Lumberjack acknowledged protocol. It makes it possible for the LTR to ensure logs have been saved on the remote side. That acknowledgement is in turn used to increment the Kafka offset on the forwarder side. This is illustrated next.

../../../_images/LTR-PUSH-ACKS.png

The following picture details how Kafka, Zookeeper and Storm are combined on three different servers. Only the Kafka and Zookeeper share their data. Each Storm cluster is a distinct (mini-)cluster. Should one server fail, the complete Storm cluster on that server disappears.

../../../_images/LtrActiveActiveSetup.png

Note

this setup works because several Storm processes can share the load from a common Kafka topic, using a dynamic configuration free load-sharing behavior. If one Storm process fails on one server or if one server fails all together, the other Storm processes will instantaneously consume the data.

A better option consists in directly pulling the data from Kafka using a pull strategy. This is illustrated next. This architecture has several advantages: - better performance : there is one less userland hop, and Kafka relies on a sendfile zero copy data forwarding strategy. This pull setup thus consumes less CPU on the forwarder side. - better monitoring : the receiver topology will automatically monitor the Kafka topics and make that information available to the Lmc monitoring plane. In short, the Kafka backlogs, rate of data processing will be available on the Lmc Grafana, allowing an easier supervision of remote forwarders.

../../../_images/LTR-PULL.png

Of course a LTR must be resilient. Several options are offered depending on the level of resiliency required. The recommended setup is illustrated next:

../../../_images/LTR-PULL-HA.png
  • Three LTR appliances (typically Vms) run using an active active scheme. A Kafka cluster runs on these three appliances.
  • A virtual IP address (vip) is up on one of the three appliance. Thus all incoming traffic is directed to one of the three appliance. An internal load-balancer dispatches that traffic to three Storm topologies in charge of writing the data to Kafka.
  • On LMR/LMS side, a topology consumes the data from the LmcForwarder Kafka, and saves it to the Lmc Kafka.

Note that with this setup, there is a single vip in the complete system. Each log is replicated in two Kafka nodes, hence the system guarantees data delivery in case of single hardware, network or software failure.

Note

for this setup to work, it must be possible for the Lmc receiver topology to issue TCP connections to the LmcForwarder Kafka cluster. This must be authorized by the client security rules.

Comparison

The following picture illustrates a comparison with an equivalent setup using a Splunk Forwarder. The Splunk Forwarder is similar to the LTR in that it has a persistent queue to backup the logs should it fail to forward them to the remote Splunk Receiver.

../../../_images/LTR-SPLUNK.png

This setup suffers from several drawbacks:

  1. The SplunkForwarder queue is file system based. It cannot be replicated easily. If the local disk is used to store the forwarder queue, you loose you logs in case of hardware failure.
  2. There is an additional userland process in the chain, because the SplunkReceiver cannot write to Kafka. Hence you suffer from additional CPU usage on the receiving server.
  3. There is no acknowledged protocol between the Splunk Receiver and the Lmc Receiver topology. This is because Splunk uses a proprietary acknowledged protocol. Hence if you restart either the SplunkReceiver, the Lmc receiver topology or the server itself, you will loose some logs.
  4. The Splunk Forwarder, by design, will stop sending traffic the to Splunk Receiver if a single one of its connection is slow. That basically means that if you have 10 logs channels (i.e on 10 different TCP ports) between the two, the traffic of all 10 will stop if you have an issue of a single channel.

For these reasons, we recommend using a LTR over any other solution. Besides solving the issues just explained it provides you with key benefits:

  1. The kafka cluster can be setup on several servers, providing you with additional resiliency and scalability on the client side.
  2. The LTR can be configured to perform automatic supervision that will be available to you on the Lmc side.
  3. You can plug in simple but key configuration such as timestamping and tagging logs with unique identifiers.

PunchMonitoring Platform

Design

The PunchMonitoring is used to monitor an infrastructure with beats. After that, we will be able to run some predictive algorithms to do active monitoring.

../../../_images/PunchplatformMetricMonitoring.jpg